Personalized Daily ArXiv Papers 2026-04-01

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	169327	7469	176796	595	335	26
`gpt-5.4`	Cost	$0.42	$0.11	$0.54	595	335	26

Table of contents with paper titles:

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication Authors: Zichao Wei
Training-Free Dynamic Upcycling of Expert Language Models Authors: Eros Fan`i, O\u{g}uzhan Ersoy
Grokking From Abstraction to Intelligence Authors: Junjie Zhang, Zhen Shen, Gang Xiong, Xisong Dong
Tucker Attention: A generalization of approximate attention mechanisms Authors: Timon Klein, Jonas Kusch, Sebastian Sager, Stefan Schnake, Steffen Schotth\"ofer
OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models Authors: Tianran Liu, Shengwen Zhao, Mozhgan Pourkeshavarz, Weican Li, Nicholas Rhinehart
Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding Authors: Chengxi Li, Youssef Allouah, Rachid Guerraoui, Mikael Skoglund, Ming Xiao
APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay Authors: Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha
Lie Generator Networks for Nonlinear Partial Differential Equations Authors: Shafayeth Jamil, Rehan Kapadia
Big2Small: A Unifying Neural Network Framework for Model Compression Authors: Jing-Xiao Liao, Haoran Wang, Tao Li, Daoming Lyu, Yi Zhang, Chengjun Cai, Feng-Lei Fan
LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning Authors: Haihong Hao, Lei Chen, Mingfei Han, Changlin Li, Dong An, Yuqiang Yang, Zhihui Li, Xiaojun Chang
Metriplector: From Field Theory to Neural Architecture Authors: Dan Oprisa, Peter Toth
Minimum Norm Interpolation via The Local Theory of Banach Spaces: The Role of $2$-Uniform Convexity Authors: Gil Kur, Pierre Bizeul
From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability Authors: Max Hennick, Guillaume Corlouer
Is the Modality Gap a Bug or a Feature? A Robustness Perspective Authors: Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA Authors: Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu
A Pontryagin Method of Model-based Reinforcement Learning via Hamiltonian Actor-Critic Authors: Chengyang Gu, Yuxin Pan, Hui Xiong, Yize Chen
OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training Authors: Haiyue Song, Masao Utiyama
Concept frustration: Aligning human concepts and machine representations Authors: Enrico Parisini, Christopher J. Soelistyo, Ahab Isaac, Alessandro Barp, Christopher R. S. Banerji
Nonnegative Matrix Factorization in the Component-Wise L1 Norm for Sparse Data Authors: Giovanni Seraghiti, K\'evin Dubrulle, Arnaud Vandaele, Nicolas Gillis
Tracking Equivalent Mechanistic Interpretations Across Neural Networks Authors: Alan Sun, Mariya Toneva
The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training Authors: Yongzhong Xu
A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models Authors: Lixin Xiu, Xufang Luo, Hideki Nakayama
Baby Scale: Investigating Models Trained on Individual Children's Language Input Authors: Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank
Think Anywhere in Code Generation Authors: Xue Jiang, Tianyu Zhang, Ge Li, Mengyang Liu, Taozhi Chen, Zhenhua Xu, Binhua Li, Wenpin Jiao, Zhi Jin, Yongbin Li, Yihong Dong
ASI-Evolve: AI Accelerates AI Authors: Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, Pengfei Liu
GENIE: Gram-Eigenmode INR Editing with Closed-Form Geometry Updates Authors: Samundra Karki, Adarsh Krishnamurthy, Baskar Ganapathysubramanian

1. On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

ArXiv ID: 2603.29069

Authors: Zichao Wei

Abstract: Integer multiplication has long been considered a hard problem for neural networks, with the difficulty widely attributed to the O(n) long-range dependency induced by carry chains. We argue that this diagnosis is wrong: long-range dependency is not an intrinsic property of multiplication, but a mirage produced by the choice of computational spacetime. We formalize the notion of mirage and provide a constructive proof: when two n-bit binary integers are laid out as a 2D outer-product grid, every step of long multiplication collapses into a $3 \times 3$ local neighborhood operation. Under this representation, a neural cellular automaton with only 321 learnable parameters achieves perfect length generalization up to $683\times$ the training range. Five alternative architectures -- including Transformer (6,625 params), Transformer+RoPE, and Mamba -- all fail under the same representation. We further analyze how partial successes locked the community into an incorrect diagnosis, and argue that any task diagnosed as requiring long-range dependency should first be examined for whether the dependency is intrinsic to the task or induced by the computational spacetime.

Comment: Architectural/mechanistic insight: shows apparent long-range dependency in multiplication is a representation-induced mirage and solves it with local neural cellular automata.

Relevance: 9 Novelty: 9

2. Training-Free Dynamic Upcycling of Expert Language Models

ArXiv ID: 2603.29765

Authors: Eros Fan`i, O\u{g}uzhan Ersoy

Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model's original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: github.com/gensyn-ai/dume.

Comment: Architecture and training dynamics: training-free upcycling of dense domain experts into a unified MoE via closed-form ridge-regression routing/combination, directly targeting modular computation without additional optimization.

Relevance: 9 Novelty: 8

3. Grokking From Abstraction to Intelligence

ArXiv ID: 2603.29262

Authors: Junjie Zhang, Zhen Shen, Gang Xiong, Xisong Dong

Abstract: Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives this phenomenon. We propose that grokking originates from a spontaneous simplification of internal model structures governed by the principle of parsimony. We integrate causal, spectral, and algorithmic complexity measures alongside Singular Learning Theory to reveal that the transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, offering a novel perspective for understanding the mechanisms of model overfitting and generalization.

Comment: Representation learning theory and structure: analyzes grokking as global structural simplification and information compression using causal, spectral, and complexity-based measures.

Relevance: 9 Novelty: 8

4. Tucker Attention: A generalization of approximate attention mechanisms

ArXiv ID: 2603.30033

Authors: Timon Klein, Jonas Kusch, Sebastian Sager, Stefan Schnake, Steffen Schotth\"ofer

Abstract: The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared to GQA and MLA, as evaluated in LLM and ViT test cases. Additionally, Tucker Attention~encompasses GQA, MLA, MHA as special cases and is fully compatible with flash-attention and rotary position embeddings (RoPE). This generalization strategy yields insights of the actual ranks achieved by MHA, GQA, and MLA, and further enables simplifications for MLA.

Comment: Architecture and training dynamics: Tucker Attention generalizes MHA/GQA/MLA as a tensor-factorized attention family and studies the effective low-rank structure behind attention variants.

Relevance: 9 Novelty: 8

5. OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

ArXiv ID: 2603.28887

Authors: Tianran Liu, Shengwen Zhao, Mozhgan Pourkeshavarz, Weican Li, Nicholas Rhinehart

Abstract: Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an >80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.

Comment: Foundation world model: long-horizon occupancy world model generates kilometer-scale simulation from one frame plus future actions.

Relevance: 9 Novelty: 8

6. Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding

ArXiv ID: 2603.28780

Authors: Chengxi Li, Youssef Allouah, Rachid Guerraoui, Mikael Skoglund, Ming Xiao

Abstract: In this paper, we study the problem of distributed training (DT) under Byzantine attacks with communication constraints. While prior work has developed various robust aggregation rules at the server to enhance robustness to Byzantine attacks, the existing methods suffer from a critical limitation in that the solution error does not diminish when the local gradients sent by different devices vary considerably, as a result of data heterogeneity among the subsets held by different devices. To overcome this limitation, we propose a novel DT method, cyclic gradient coding-based DT (LAD). In LAD, the server allocates the entire training dataset to the devices before training begins. In each iteration, it assigns computational tasks redundantly to the devices using cyclic gradient coding. Each honest device then computes local gradients on a fixed number of data subsets and encodes the local gradients before transmitting to the server. The server aggregates the coded vectors from the honest devices and the potentially incorrect messages from Byzantine devices using a robust aggregation rule. Leveraging the redundancy of computation across devices, the convergence performance of LAD is analytically characterized, demonstrating improved robustness against Byzantine attacks and significantly lower solution error. Furthermore, we extend LAD to a communication-efficient variant, compressive and cyclic gradient coding-based DT (Com-LAD), which further reduces communication overhead under constrained settings. Numerical results validate the effectiveness of the proposed methods in enhancing both Byzantine resilience and communication efficiency.

Comment: Distributed training algorithm combining cyclic gradient coding, Byzantine robustness, and communication compression with convergence analysis.

Relevance: 9 Novelty: 8

7. APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

ArXiv ID: 2603.29093

Authors: Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha

Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Comment: Agent memory system with structured procedural-episodic replay and hybrid retrieval over plans, failures, and execution traces.

Relevance: 9 Novelty: 8

8. Lie Generator Networks for Nonlinear Partial Differential Equations

ArXiv ID: 2603.29264

Authors: Shafayeth Jamil, Rehan Kapadia

Abstract: Linear dynamical systems are fully characterized by their eigenspectra, accessible directly from the generator of the dynamics. For nonlinear systems governed by partial differential equations, no equivalent theory exists. We introduce Lie Generator Network--Koopman (LGN-KM), a neural operator that lifts nonlinear dynamics into a linear latent space and learns the continuous-time Koopman generator ($L_k$) through a decomposition $L_k = S - D_k$, where $S$ is skew-symmetric representing conservative inter-modal coupling, and $D_k$ is a positive-definite diagonal encoding modal dissipation. This architectural decomposition enforces stability and enables interpretability through direct spectral access to the learned dynamics. On two-dimensional Navier--Stokes turbulence, the generator recovers the known dissipation scaling and a complete multi-branch dispersion relation from trajectory data alone with no physics supervision. Independently trained models at different flow regimes recover matched gauge-invariant spectral structure, exposing a gauge freedom in the Koopman lifting. Because the generator is provably stable, it enables guaranteed long-horizon stability, continuous-time evaluation at arbitrary time, and physics-informed cross-viscosity model transfer.

Comment: Learns a stable Koopman-generator neural operator with explicit skew-symmetric/dissipative decomposition, directly targeting architectural mechanism and dynamics interpretability.

Relevance: 9 Novelty: 8

9. Big2Small: A Unifying Neural Network Framework for Model Compression

ArXiv ID: 2603.29768

Authors: Jing-Xiao Liao, Haoran Wang, Tao Li, Daoming Lyu, Yi Zhang, Chengjun Cai, Feng-Lei Fan

Abstract: With the development of foundational models, model compression has become a critical requirement. Various model compression approaches have been proposed such as low-rank decomposition, pruning, quantization, ergodic dynamic systems, and knowledge distillation, which are based on different heuristics. To elevate the field from fragmentation to a principled discipline, we construct a unifying mathematical framework for model compression grounded in measure theory. We further demonstrate that each model compression technique is mathematically equivalent to a neural network subject to a regularization. Building upon this mathematical and structural equivalence, we propose an experimentally-verified data-free model compression framework, termed \textit{Big2Small}, which translates Implicit Neural Representations (INRs) from data domain to the domain of network parameters. \textit{Big2Small} trains compact INRs to encode the weights of larger models and reconstruct the weights during inference. To enhance reconstruction fidelity, we introduce Outlier-Aware Preprocessing to handle extreme weight values and a Frequency-Aware Loss function to preserve high-frequency details. Experiments on image classification and segmentation demonstrate that \textit{Big2Small} achieves competitive accuracy and compression ratios compared to state-of-the-art baselines.

Comment: Presents a unifying mathematical framework for model compression, connecting pruning, quantization, low-rank methods, and distillation through regularized neural networks.

Relevance: 9 Novelty: 8

ArXiv ID: 2603.29165

Authors: Haihong Hao, Lei Chen, Mingfei Han, Changlin Li, Dong An, Yuqiang Yang, Zhihui Li, Xiaojun Chang

Abstract: Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene. Project page:https://abdd.top/latentpilot/

Comment: Learns action-conditioned latent visual dynamics for navigation, using latent memory carried across steps to dream ahead about future observations.

Relevance: 9 Novelty: 8

11. Metriplector: From Field Theory to Neural Architecture

ArXiv ID: 2603.29496

Authors: Dan Oprisa, Peter Toth

Abstract: We present Metriplector, a neural architecture primitive in which the input configures an abstract physical system--fields, sources, and operators--and the dynamics of that system is the computation. Multiple fields evolve via coupled metriplectic dynamics, and the stress-energy tensor T^{{\mu}{\nu}}, derived from Noether's theorem, provides the readout. The metriplectic formulation admits a natural spectrum of instantiations: the dissipative branch alone yields a screened Poisson equation solved exactly via conjugate gradient; activating the full structure--including the antisymmetric Poisson bracket--gives field dynamics for image recognition and language modeling. We evaluate Metriplector across four domains, each using a task-specific architecture built from this shared primitive with progressively richer physics: F1=1.0 on maze pathfinding, generalizing from 15x15 training grids to unseen 39x39 grids; 97.2% exact Sudoku solve rate with zero structural injection; 81.03% on CIFAR-100 with 2.26M parameters; and 1.182 bits/byte on language modeling with 3.6x fewer training tokens than a GPT baseline.

Comment: Proposes a new computation primitive where coupled metriplectic field dynamics define the network, making the core contribution an unusual neural architecture mechanism.

Relevance: 8 Novelty: 9

12. Minimum Norm Interpolation via The Local Theory of Banach Spaces: The Role of $2$-Uniform Convexity

ArXiv ID: 2603.28956

Authors: Gil Kur, Pierre Bizeul

Abstract: The minimum-norm interpolator (MNI) framework has recently attracted considerable attention as a tool for understanding generalization in overparameterized models, such as neural networks. In this work, we study the MNI under a $2$-uniform convexity assumption, which is weaker than requiring the norm to be induced by an inner product, and it typically does not admit a closed-form solution. At a high level, we show that this condition yields an upper bound on the MNI bias in both linear and nonlinear models. We further show that this bound is sharp for overparameterized linear regression when the unit ball of the norm is in isotropic (or John's) position, and the covariates are isotropic, symmetric, i.i.d. sub-Gaussian, such as vectors with i.i.d. Bernoulli entries. Finally, under the same assumption on the covariates, we prove sharp generalization bounds for the $\ell_p$-MNI when $p \in \bigl(1 + C/\log d, 2\bigr]$. To the best of our knowledge, this is the first work to establish sharp bounds for non-Gaussian covariates in linear models when the norm is not induced by an inner product. This work is deeply inspired by classical works on $K$-convexity, and more modern work on the geometry of 2-uniform and isotropic convex bodies.

Comment: Representation learning theory and structure: sharp generalization and bias bounds for minimum-norm interpolation under 2-uniform convexity, extending theory beyond inner-product norms and Gaussian covariates.