This is a remedial run for missed papers from 03/14/2026 to 03/15/2026.

Results generated on 03/21/2026.

Personalized Daily ArXiv Papers 2026-03-16

[gpt-5.4]	Prompt	Completion	Total
Token	208634	6781	215415
Cost	$0.52	$0.1	$0.62

Table of contents with paper titles:

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling Authors: Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, Tri Dao
PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers Authors: Eshed Gal, Moshe Eliasof, Siddharth Rout, Eldad Haber
Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys Authors: Xu Yang, Jiapeng Zhang, Dongyang Zhao, Guo Chen, Zhuo Tang
Rigorous Asymptotics for First-Order Algorithms Through the Dynamical Cavity Method Authors: Yatin Dandi, David Gamarnik, Francisco Pernice, Lenka Zdeborová
Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients Authors: J Rosser
FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference Authors: Wilhelm Tranheden, Shahnawaz Ahmed, Devdatt Dubhashi, Jonna Matthiesen, Hannes von Essen
GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent Authors: Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev
On Interpolation Formulas Describing Neural Network Generalization Authors: Jin Guo, Roy Y. He, Jean-Michel Morel
Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors Authors: Mark Rofin, Jalal Naghiyev, Michael Hahn
Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales Authors: Yongzhong Xu
The Phenomenology of Hallucinations Authors: Valeria Ruscio, Keiran Thompson
Enhancing LLM Training via Spectral Clipping Authors: Xiaowen Jiang, Andrei Semenov, Sebastian U. Stich
Power-Law Spectrum of the Random Feature Model Authors: Elliot Paquette, Ke Liang Xiao, Yizhe Zhu
Effective Sparsity: A Unified Framework via Normalized Entropy and the Effective Number of Nonzeros Authors: Haoyu He, Hao Wang, Jiashan Wang, Hao Zeng
High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise Authors: Avik Kar, Siddharth Chandak, Rahul Singh, Eric Moulines, Shalabh Bhatnagar, Nicholas Bambos
ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference Authors: Surendra Pathak, Bo Han
Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models Authors: Ying Xie
SVD Contextual Sparsity Predictors for Fast LLM Inference Authors: Georgii Serbin, Kirill Koshkin, Zhongao Sun, Anastasiya Bistrigova, C. C. Korikov
Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention Authors: Markus W. Baumgartner, Anson Lei, Joe Watson, Ingmar Posner
OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism Authors: Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu
Interleaved Resampling and Refitting: Data and Compute-Efficient Evaluation of Black-Box Predictors Authors: Haichen Hu, David Simchi-Levi
On the (Generative) Linear Sketching Problem Authors: Xinyu Yuan, Yan Qiao, Zonghui Wang, Wenzhi Chen
Sampling Boltzmann distributions via normalizing flow approximation of transport maps Authors: Zia Ur Rehman, Gero Friesecke
Windowed Fourier Propagator: A Frequency-Local Neural Operator for Wave Equations in Inhomogeneous Media Authors: Yiyang Cai, Zixuan Qiu, Yunlu Shu, Jiamao Wu, Yingzhou Li, Tianyu Wang, Xi Chen
Convergence of Two Time-Scale Stochastic Approximation: A Martingale Approach Authors: Mathukumalli Vidyasagar
TMPDiff: Temporal Mixed-Precision for Diffusion Models Authors: Basile Lewandowski, Simon Kurz, Aditya Shankar, Robert Birke, Jian-Jia Chen, Lydia Y. Chen
$K-$means with learned metrics Authors: Pablo Groisman, Matthieu Jonckheere, Jordan Serres, Mariela Sued
SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory Authors: Varun Pratap Bhardwaj
D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing Authors: Yuru Song, Qi Xin
ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization Authors: Ronghao Zhang, Shuaicheng Niu, Qi Deng, Yanjie Dong, Jian Chen, Runhao Zeng
From Specification to Architecture: A Theory Compiler for Knowledge-Guided Machine Learning Authors: Asela Hevapathige, Yu Xia, Sachith Seneviratne, Saman Halgamuge
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs Authors: Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz
Structure-Dependent Regret and Constraint Violation Bounds for Online Convex Optimization with Time-Varying Constraints Authors: Xiufeng Liu, Qian Chen, Zhijin Wang, Ruyu Liu
AEX: Non-Intrusive Multi-Hop Attestation and Provenance for LLM APIs Authors: Yongjie Guan
Towards One-for-All Anomaly Detection for Tabular Data Authors: Shiyuan Li, Yixin Liu, Yu Zheng, Xiaofeng Cao, Shirui Pan, Heng Tao Shen
Not All Latent Spaces Are Flat: Hyperbolic Concept Control Authors: Maria Rosaria Briglia, Simone Facchiano, Paolo Cursi, Alessio Sampieri, Emanuele Rodolà, Guido Maria D'Amely di Melendugno, Luca Franco, Fabio Galasso, Iacopo Masi
Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI Authors: Mark Baciak, Thomas A. Cellucci, Deanna M. Falkowski
The Institutional Scaling Law: Non-Monotonic Fitness, Capability-Trust Divergence, and Symbiogenetic Scaling in Generative AI Authors: Mark Baciak, Thomas A. Cellucci
True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity Authors: Shivnath Tathe
Is the reconstruction loss culprit? An attempt to outperform JEPA Authors: Alexey Potapov, Oleg Shcherbakov, Ivan Kravchenko
Representation Alignment for Just Image Transformers is not Easier than You Think Authors: Jaeyo Shin, Jiwook Kim, Hyunjung Shim
WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems Authors: Yuchen Wang, Jiangtao Kong, Sizhe Wei, Xiaochang Li, Haohong Lin, Hongjue Zhao, Tianyi Zhou, Lu Gan, Huajie Shao
Human-like Object Grouping in Self-supervised Vision Transformers Authors: Hossein Adeli, Seoyoung Ahn, Andrew Luo, Mengmi Zhang, Nikolaus Kriegeskorte, Gregory Zelinsky
Exploring the Dimensions of a Variational Neuron Authors: Yves Ruffenach
IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring Authors: Xuan Cui, Huiyue Li, Run Zeng, Yunfei Zhao, Jinrui Qian, Wei Duan, Bo Liu, Zhanpeng Zhou
Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks Authors: Youngsung Kim
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition Authors: Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo
DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression Authors: Bingzhou Li, Tao Huang
Soft Mean Expected Calibration Error (SMECE): A Calibration Metric for Probabilistic Labels Authors: Michael Leznik
ES-Merging: Biological MLLM Merging via Embedding Space Signals Authors: Wonbin Lee, Dongki Kim, Sung Ju Hwang
SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI Authors: Parth Patne, Mahdi Taheri, Ali Mahani, Maksim Jenihhin, Reza Mahani, Christian Herglotz
Exploiting temporal parallelism for LSTM Autoencoder acceleration on FPGA Authors: Aimilios Leftheriotis, Dimosthenis Masouros, Dimitrios Soudris, George Theodoridis
Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring Authors: Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu, Chengyang Fang, Xiaoshuai Hao, Can Ma, Weiping Wang
Routing Channel-Patch Dependencies in Time Series Forecasting with Graph Spectral Decomposition Authors: Dongyuan Li, Shun Zheng, Chang Xu, Jiang Bian, Renhe Jiang
U-Face: An Efficient and Generalizable Framework for Unsupervised Facial Attribute Editing via Subspace Learning Authors: Bo Liu, Xuan Cui, Run Zeng, Wei Duan, Chongwen Liu, Jinrui Qian, Lianggui Tang, Hongping Gan
On the Degrees of Freedom of Gridded Control Points in Learning-Based Medical Image Registration Authors: Wen Yan, Qianye Yang, Yipei Wang, Shonit Punwani, Mark Emberton, Vasilis Stavrinides, Yipeng Hu, Dean Barratt
High-Fidelity Compression of Seismic Velocity Models via SIREN Auto-Decoders Authors: Caiyun Liu, Xiaoxue Luo, Jie Xiong
PA-Net: Precipitation-Adaptive Mixture-of-Experts for Long-Tail Rainfall Nowcasting Authors: Xinyu Xiao, Sen Lei, Eryun Liu, Shiming Xiang, Hao Li, Cheng Yuan, Yuan Qi, Qizhao Jin

1. M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

ArXiv ID: 2603.14360

Authors: Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, Tri Dao

Abstract: Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling performance of non-linear RNNs is limited by their state size. We also demonstrate how the state size expansion mechanism enables efficient use of tensor cores. Empirically, M$^2$RNN achieves perfect state tracking generalization at sequence lengths not seen during training. These benefits also translate to large-scale language modeling. In hybrid settings that interleave recurrent layers with attention, Hybrid M$^2$RNN outperforms equivalent Gated DeltaNet hybrids by $0.4$-$0.5$ perplexity points on a 7B MoE model, while using $3\times$ smaller state sizes for the recurrent layers. Notably, replacing even a single recurrent layer with M$^2$RNN in an existing hybrid architecture yields accuracy gains comparable to Hybrid M$^2$RNN with minimal impact on training throughput. Further, the Hybrid Gated DeltaNet models with a single M$^2$RNN layer also achieve superior long-context generalization, outperforming state-of-the-art hybrid linear attention architectures by up to $8$ points on LongBench. Together, these results establish non-linear RNN layers as a compelling building block for efficient and scalable language models.

Comment: Introduces matrix-valued nonlinear recurrent layers as a scalable core architecture with stronger expressivity than standard transformer blocks.

Relevance: 10 Novelty: 9

2. PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers

ArXiv ID: 2603.13663

Authors: Eshed Gal, Moshe Eliasof, Siddharth Rout, Eldad Haber

Abstract: The success of vision transformers-especially for generative modeling-is limited by the quadratic cost and weak spatial inductive bias of self-attention. We propose PDE-SSM, a spatial state-space block that replaces attention with a learnable convection-diffusion-reaction partial differential equation. This operator encodes a strong spatial prior by modeling information flow via physically grounded dynamics rather than all-to-all token interactions. Solving the PDE in the Fourier domain yields global coupling with near-linear complexity of $O(N \log N)$, delivering a principled and scalable alternative to attention. We integrate PDE-SSM into a flow-matching generative model to obtain the PDE-based Diffusion Transformer PDE-SSM-DiT. Empirically, PDE-SSM-DiT matches or exceeds the performance of state-of-the-art Diffusion Transformers while substantially reducing compute. Our results show that, analogous to 1D settings where SSMs supplant attention, multi-dimensional PDE operators provide an efficient, inductive-bias-rich foundation for next-generation vision models.

Comment: Replaces transformer attention with a learnable Fourier-solved PDE state-space block, a core architectural innovation for efficient spatial mixing.

Relevance: 10 Novelty: 8

3. Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

ArXiv ID: 2603.14224

Authors: Xu Yang, Jiapeng Zhang, Dongyang Zhao, Guo Chen, Zhuo Tang

Abstract: The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules, relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability. In this paper, we propose a novel paradigm: treating the compressed key representation not merely as storage, but as a self-indexing structure that directly enables efficient sparse attention. By designing a sign-based 1-bit vector quantization (VQ) scheme, our method unifies compression and retrieval in a single, hardware-friendly format. This approach eliminates the need for external indices or learning-based predictors, offering a lightweight yet robust solution for memory-constrained inference. All components are designed to be hardware-efficient and easy to implement. By implementing custom CUDA kernels, our method integrates seamlessly with FlashAttention, minimizing additional runtime and memory overhead. Experimental results demonstrate that our approach delivers both effectiveness and efficiency.

Comment: Model compression and efficiency: unifies KV-cache compression and sparse attention retrieval via self-indexing 1-bit quantized keys with custom CUDA integration.

Relevance: 10 Novelty: 8

4. Rigorous Asymptotics for First-Order Algorithms Through the Dynamical Cavity Method

ArXiv ID: 2603.14573

Authors: Yatin Dandi, David Gamarnik, Francisco Pernice, Lenka Zdeborová

Abstract: Dynamical Mean Field Theory (DMFT) provides an asymptotic description of the dynamics of macroscopic observables in certain disordered systems. Originally pioneered in the context of spin glasses by Sompolinsky and Zippelius (1982), it has since been used to derive asymptotic dynamical equations for a wide range of models in physics, high-dimensional statistics and machine learning. One of the main tools used by physicists to obtain these equations is the dynamical cavity method, which has remained largely non-rigorous. In contrast, existing mathematical formalizations have relied on alternative approaches, including Gaussian conditioning, large deviations over paths, or Fourier analysis. In this work, we formalize the dynamical cavity method and use it to give a new proof of the DMFT equations for General First Order Methods, a broad class of dynamics encompassing algorithms such as Gradient Descent and Approximate Message Passing.

Comment: Provides a rigorous formalization of the dynamical cavity method for first-order algorithms, yielding asymptotic theory for optimization dynamics.

Relevance: 9 Novelty: 9

5. Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

ArXiv ID: 2603.14665

Authors: J Rosser

Abstract: Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. However, models often learn broad concepts shared across many examples. Moreover, existing TDA methods are supervised -- they require a predefined query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Each atom captures a shared update direction induced by a cluster of functionally similar documents, directly recovering the collective structure that per-document methods do not address. Among 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors -- refusal, arithmetic, yes/no classification, trivia QA -- without any behavioral labels. These atoms double as effective steering vectors: applying them as weight-space perturbations produces large, controllable shifts in model behavior (e.g., bulleted-list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query--document scoring stage, and scales independently of the number of query behaviors of interest. Code is available at https://github.com/jrosseruk/gradient_atoms.

Comment: Representation learning: unsupervised sparse dictionary decomposition of per-document training gradients to discover interpretable behavior atoms and steering directions.

Relevance: 9 Novelty: 9

6. FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

ArXiv ID: 2603.14591

Authors: Wilhelm Tranheden, Shahnawaz Ahmed, Devdatt Dubhashi, Jonna Matthiesen, Hannes von Essen

Abstract: Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification head a critical bottleneck that accounts for up to 60\% of model parameters, and 50\% of inference compute. We introduce FlashHead, the first efficient drop-in replacement for the dense classification head that is training-free and hardware-friendly. FlashHead builds on principles from information retrieval, reframing that computation at the output head as a retrieval problem rather than a dense classification over the full vocabulary. FlashHead introduces four key innovations: (1) a balanced clustering scheme that structures vocabulary partitions into compact hardware-efficient tensors, (2) extending multiprobe retrieval to language model heads, enabling thousands of clusters to be scored in parallel, (3) a novel inference-time sampling mechanism that extends retrieval beyond top tokens, enabling probabilistic sampling across the full vocabulary, and (4) selective quantization, enabling effective low-bit computation in the head. Experiments on Llama-3.2, Gemma-3, and Qwen-3 show that FlashHead delivers model-level inference speedups of up to \textbf{1.75x} which maintaining output accuracy compared to the original head. By overcoming the classification head bottleneck, FlashHead establishes a new benchmark for efficient inference and removes a key barrier to developing smaller, capable models for consumer hardware.

Comment: Inference efficiency: training-free retrieval-style replacement for the LM output head that reduces classification-head compute.

Relevance: 9 Novelty: 8

7. GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

ArXiv ID: 2603.13875

Authors: Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev

Abstract: Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Comment: Memory-efficient architecture: writes long context into compact prefix memory via test-time gradient descent instead of large KV caches.

Relevance: 9 Novelty: 8

8. On Interpolation Formulas Describing Neural Network Generalization

ArXiv ID: 2603.13872

Authors: Jin Guo, Roy Y. He, Jean-Michel Morel

Abstract: In 2020 Domingos introduced an interpolation formula valid for "every model trained by gradient descent". He concluded that such models behave approximately as kernel machines. In this work, we extend the Domingos formula to stochastic training. We introduce a stochastic gradient kernel that extends the deterministic version via a continuous-time diffusion approximation. We prove stochastic Domingos theorems and show that the expected network output admits a kernel-machine representation with optimizer-specific weighting. It reveals that training samples contribute through loss-dependent weights and gradient alignment along the training trajectory. We then link the generalization error to the null space of the integral operator induced by the stochastic gradient kernel. The same path-kernel viewpoint provides a unified interpretation of diffusion models and GANs: diffusion induces stage-wise, noise-localized corrections, whereas GANs induce distribution-guided corrections shaped by discriminator geometry. We visualize the evolution of implicit kernels during optimization and quantify out-of-distribution behaviors through a series of numerical experiments. Our results support a feature-space memory view of learning: training stores data-dependent information in an evolving tangent feature geometry, and predictions at test time arise from kernel-weighted retrieval and aggregation of these stored features, with generalization governed by alignment between test points and the learned feature memory.

Comment: Theory of training dynamics: extends Domingos-style kernel interpolation to stochastic gradient training with optimizer-specific path kernels.

Relevance: 9 Novelty: 8

9. Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

ArXiv ID: 2603.14087

Authors: Mark Rofin, Jalal Naghiyev, Michael Hahn

Abstract: Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.

Comment: Representation learning analysis: identifies which next-token gradient components cause transformers to develop seemingly redundant abstract features.

Relevance: 9 Novelty: 8

10. Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

ArXiv ID: 2603.15678

Authors: Yongzhong Xu

Abstract: Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce Spectral Edge Dynamics (SED) to quantify this structure: a rolling-window SVD of parameter updates reveals a sharp boundary -- the spectral edge -- between coherent optimization directions and stochastic noise, identified via the maximum consecutive singular value ratio $σ_k / σ_{k+1}$. Across a 51M-parameter TinyStories model (4 seeds) and GPT-2 124M under distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse). The effective signal rank adapts to task complexity ($k^ = 2$ at 51M, $k^ = 3$ at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a lag flip reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to $d = 10W$ dimensions (e.g., $d = 100$ for $W = 10$) preserves the spectral gap within $5.7\%$, making the framework applicable to models of arbitrary scale. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization $600$--$1{,}700$ steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

Comment: Training dynamics analysis: spectral-edge SVD reveals low-rank signal-noise structure and phase transitions in transformer optimization trajectories.

Relevance: 9 Novelty: 8

11. The Phenomenology of Hallucinations

ArXiv ID: 2603.13911

Authors: Valeria Ruscio, Keiran Thompson

Abstract: We show that language models hallucinate not because they fail to detect uncertainty, but because of a failure to integrate it into output generation. Across architectures, uncertain inputs are reliably identified, occupying high-dimensional regions with 2-3$\times$ the intrinsic dimensionality of factual inputs. However, this internal signal is weakly coupled to the output layer: uncertainty migrates into low-sensitivity subspaces, becoming geometrically amplified yet functionally silent. Topological analysis shows that uncertainty representations fragment rather than converging to a unified abstention state, while gradient and Fisher probes reveal collapsing sensitivity along the uncertainty direction. Because cross-entropy training provides no attractor for abstention and uniformly rewards confident prediction, associative mechanisms amplify these fractured activations until residual coupling forces a committed output despite internal detection. Causal interventions confirm this account by restoring refusal when uncertainty is directly connected to logits.

Comment: Representation-level theory of hallucination: uncertainty is internally encoded but weakly coupled to logits, explaining failure to abstain.

Relevance: 9 Novelty: 8

12. Enhancing LLM Training via Spectral Clipping

ArXiv ID: 2603.14315

Authors: Xiaowen Jiang, Andrei Semenov, Sebastian U. Stich

Abstract: While spectral-based optimizers like Muon operate directly on the spectrum of updates, standard adaptive methods such as AdamW do not account for the global spectral structure of weights and gradients, leaving them vulnerable to two empirical issues in large language model (LLM) training: (i) the optimizer updates can have large spectral norms, potentially destabilizing training and degrading generalization; (ii) stochastic gradient noise can exhibit sparse spectral spikes, with a few dominant singular values much larger than the rest. We propose SPECTRA, a general framework addressing these by (i) post-spectral clipping of updates to enforce spectral-norm constraints; (ii) optional pre-spectral clipping of gradients to suppress spectral noise spikes. We prove that post-clipping constitutes a Composite Frank-Wolfe method with spectral-norm constraints and weight regularization, recovering Frobenius and $\ell_{\infty}$-norm regularization with SGD-based and sign-based methods. We further analyze how pre-clipping mitigates sparse spectral spikes. We propose efficient soft spectral clipping via Newton-Schulz iterations, avoiding expensive SVD. Experiments on LLM pretraining show SPECTRA uniformly improves validation loss for various optimizers, including AdamW, Signum, and AdEMAMix, with the best-performing variants achieving state-of-the-art results. Models trained with SPECTRA exhibit smaller weight norms, confirming the link between spectral clipping and regularization.

Comment: Spectral clipping is a general optimizer-side efficiency/stability method for LLM training with theory and scalable Newton-Schulz implementation.

Relevance: 9 Novelty: 8

13. Power-Law Spectrum of the Random Feature Model

ArXiv ID: 2603.14578

Authors: Elliot Paquette, Ke Liang Xiao, Yizhe Zhu

Abstract: Scaling laws for neural networks, in which the loss decays as a power-law in the number of parameters, data, and compute, depend fundamentally on the spectral structure of the data covariance, with power-law eigenvalue decay appearing ubiquitously in vision and language tasks. A central question is whether this spectral structure is preserved or destroyed when data passes through the basic building block of a neural network: a random linear projection followed by a nonlinear activation. We study this question for the random feature model: given data $x \sim N(0,H)\in \mathbb{R}^v$ where $H$ has $α$-power-law spectrum ($λ_j(H ) \asymp j^{-α}$, $α> 1$), a Gaussian sketch matrix $W \in \mathbb{R}^{v\times d}$, and an entrywise monomial $f(y) = y^{p}$, we characterize the eigenvalues of the population random-feature covariance $\mathbb{E}_{x }[\frac{1}{d}f(W^\top x )^{\otimes 2}]$. We prove matching upper and lower bounds: for all $1 \leq j \leq c_1 d \log^{-(p+1)}(d)$, the $j$-th eigenvalue is of order $\left(\log^{p-1}(j+1)/j\right)^α$. For $ c_1 d \log^{-(p+1)}(d)\leq j\leq d$, the $j$-th eigenvalue is of order $j^{-α}$ up to a polylog factor. That is, the power-law exponent $α$ is inherited exactly from the input covariance, modified only by a logarithmic correction that depends on the monomial degree $p$. The proof combines a dyadic head-tail decomposition with Wick chaos expansions for higher-order monomials and random matrix concentration inequalities.

Comment: Derives power-law spectral preservation results for random feature models, directly addressing representation structure in core architectures.

Relevance: 9 Novelty: 8

14. Effective Sparsity: A Unified Framework via Normalized Entropy and the Effective Number of Nonzeros

ArXiv ID: 2603.13826

Authors: Haoyu He, Hao Wang, Jiashan Wang, Hao Zeng

Abstract: Classical sparsity promoting methods rely on the l0 norm, which treats all nonzero components as equally significant. In practical inverse problems, however, solutions often exhibit many small amplitude components that have little effect on reconstruction but lead to an overestimation of signal complexity. We address this limitation by shifting the paradigm from discrete cardinality to effective sparsity. Our approach introduces the effective number of nonzeros (ENZ), a unified class of normalized entropy-based regularizers, including Shannon and Renyi forms, that quantifies the concentration of significant coefficients. We show that, unlike the classical l0 norm, the ENZ provides a stable and continuous measure of effective sparsity that is insensitive to negligible perturbations. For noisy linear inverse problems, we establish theoretical guarantees under the Restricted Isometry Property (RIP), proving that ENZ based recovery is unique and stable. We also derive a decomposition showing that the ENZ equals the support cardinality times a distributional efficiency term, thereby linking entropy with l0 regularization. Numerical experiments show that this effective sparsity framework outperforms traditional cardinality based methods in robustness and accuracy.

Comment: Defines effective sparsity via normalized-entropy regularizers with RIP-based recovery guarantees, offering a new theoretical sparsity framework.

Relevance: 9 Novelty: 8

15. High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise

ArXiv ID: 2603.14514

Authors: Avik Kar, Siddharth Chandak, Rahul Singh, Eric Moulines, Shalabh Bhatnagar, Nicholas Bambos

Abstract: We present the first uniform-in-time high-probability bound for SGD under the PL condition, where the gradient noise contains both Markovian and martingale difference components. This significantly broadens the scope of finite-time guarantees, as the PL condition arises in many machine learning and deep learning models while Markovian noise naturally arises in decentralized optimization and online system identification problems. We further allow the magnitude of noise to grow with the function value, enabling the analysis of many practical sampling strategies. In addition to the high-probability guarantee, we establish a matching $1/k$ decay rate for the expected suboptimality. Our proof technique relies on the Poisson equation to handle the Markovian noise and a probabilistic induction argument to address the lack of almost-sure bounds on the objective. Finally, we demonstrate the applicability of our framework by analyzing three practical optimization problems: token-based decentralized linear regression, supervised learning with subsampling for privacy amplification, and online system identification.

Comment: Establishes the first uniform-in-time high-probability SGD bounds under PL with Markovian noise, a foundational optimization theory result.

Relevance: 9 Novelty: 8

16. ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

ArXiv ID: 2603.14549

Authors: Surendra Pathak, Bo Han

Abstract: While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.

Comment: Model efficiency: training-free LVLM token pruning that corrects attention shift and merges redundant tokens while remaining KV-cache compatible.

Relevance: 9 Novelty: 8

17. Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

ArXiv ID: 2603.14517

Authors: Ying Xie

Abstract: Large language models (LLMs) suffer from proactive interference (PI): outdated information in the context window disrupts retrieval of current values. This interference degrades retrieval accuracy log-linearly as stale associations accumulate, a bottleneck that persists regardless of context length and resists prompt-engineering mitigations. Biological brains resolve an analogous challenge through sleep-dependent memory consolidation: synaptic downscaling, selective replay, and targeted forgetting. We propose SleepGate, a biologically inspired framework that augments transformer-based LLMs with a learned sleep cycle over the key-value (KV) cache. SleepGate introduces three mechanisms: (1) a conflict-aware temporal tagger detecting when new entries supersede old ones; (2) a lightweight forgetting gate trained to selectively evict or compress stale cache entries; and (3) a consolidation module that merges surviving entries into compact summaries. These components activate periodically during inference in sleep micro-cycles, governed by an adaptive entropy-based trigger. We formalize a dual-phase training objective jointly optimizing language modeling during the wake phase and post-consolidation retrieval during the sleep phase. Theoretical analysis shows SleepGate reduces the interference horizon from O(n) to O(log n). In experiments with a small-scale transformer (4 layers, 793K parameters), SleepGate achieves 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while all five baselines -- full KV cache, sliding window, H2O, StreamingLLM, and decay-only ablation -- remain below 18%. Our framework offers an architecture-level solution that prompt engineering cannot address.

Comment: Inference-time KV-cache memory management architecture with selective forgetting/compression and theoretical interference reduction.

Relevance: 9 Novelty: 8

18. SVD Contextual Sparsity Predictors for Fast LLM Inference

ArXiv ID: 2603.14110

Authors: Georgii Serbin, Kirill Koshkin, Zhongao Sun, Anastasiya Bistrigova, C. C. Korikov

Abstract: Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building sparse pattern predictors using truncation-aware singular value decomposition (SVD) of the gate projection matrix, along with a threshold calibration algorithm, and inference executors supporting conditional computation on CUDA and CANN devices. Experiments on three sparse LLMs with an average activation sparsity level of 90% in the FFNs demonstrate up to a 1.8x reduction in end-to-end decoding time while maintaining less than 1% degradation in benchmark scores on tasks involving complex math and code generation. This work advances the deployment of LLMs on edge devices.

Comment: Uses training-free SVD-based contextual sparsity predictors for conditional FFN execution, directly targeting fast LLM inference.

Relevance: 9 Novelty: 7

19. Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention

ArXiv ID: 2603.14483

Authors: Markus W. Baumgartner, Anson Lei, Joe Watson, Ingmar Posner

Abstract: Parametric system identification methods estimate the parameters of explicitly defined physical systems from data. Yet, they remain constrained by the need to provide an explicit function space, typically through a predefined library of candidate functions chosen via available domain knowledge. In contrast, deep learning can demonstrably model systems of broad complexity with high fidelity, but black-box function approximation typically fails to yield explicit descriptive or disentangled representations revealing the structure of a system. We develop a novel identifiability theorem, leveraging causal representation learning, to uncover disentangled representations of system parameters without structural assumptions. We derive a graphical criterion specifying when system parameters can be uniquely disentangled from raw trajectory data, up to permutation and diffeomorphism. Crucially, our analysis demonstrates that global causal structures provide a lower bound on the disentanglement guarantees achievable when considering local state-dependent causal structures. We instantiate system parameter identification as a variational inference problem, leveraging a sparsity-regularised transformer to uncover state-dependent causal structures. We empirically validate our approach across four synthetic domains, demonstrating its ability to recover highly disentangled representations that baselines fail to recover. Corroborating our theoretical analysis, our results confirm that enforcing local causal structure is often necessary for full identifiability.

Comment: Combines causal representation learning with sparse attention and proves identifiability conditions for disentangled system representations.