This is a remedial run for missed papers from 03/18/2026 to 03/18/2026.

Results generated on 03/21/2026.

Personalized Daily ArXiv Papers 2026-03-19

[gpt-5.4]	Prompt	Completion	Total
Token	151212	5195	156407
Cost	$0.38	$0.08	$0.46

Table of contents with paper titles:

ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit Authors: Louis-Pierre Chaintron, Lénaïc Chizat, Javier Maass
A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks Authors: Leonardo Del Grande, Christoph Brune, Marcello Carioni
Path-Constrained Mixture-of-Experts Authors: Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly
Learning When to Attend: Conditional Memory Access for Long-Context LLMs Authors: Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, Stefano Soatto
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention Authors: Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song
Attention Sinks Induce Gradient Sinks Authors: Yihong Chen, Quanming Yao
A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models Authors: Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard
Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training Authors: Ben S. Southworth, Stephen Thomas
Only relative ranks matter in weight-clustered large language models Authors: Borja Aizpurua, Sukhbinder Singh, Román Orús
ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression Authors: Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, Xiaowen Chu
Computation-Utility-Privacy Tradeoffs in Bayesian Estimation Authors: Sitan Chen, Jingqiu Ding, Mahbod Majid, Walter McKelvie
Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum Authors: Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay Krishnamurthy
Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models Authors: Rui Wu, Hong Xie, Yongjun Li
Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity Authors: Felix Schur
The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions Authors: Rui Wu, Hong Xie, Yongjun Li
Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs Authors: Abhishek Gupta, Aditya Mahajan
Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning Authors: Ziran Liu
Discovering Decoupled Functional Modules in Large Language Models Authors: Yanke Yu, Jin Li, Ying Sun, Ping Li, Zhefeng Wang, Yi Zheng
How do LLMs Compute Verbal Confidence Authors: Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Velickovic
Gaussian Process Limit Reveals Structural Benefits of Graph Transformers Authors: Nil Ayday, Lingchu Yang, Debarghya Ghoshdastidar
The Phasor Transformer: Resolving Attention Bottlenecks on the Unit Circle Authors: Dibakar Sigdel
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation Authors: Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide
RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference Authors: Arpit Singh Gautam, Saurabh Jha
Learning Permutation Distributions via Reflected Diffusion on Ranks Authors: Sizhuang He, Yangtian Zhang, Shiyang Zhang, David van Dijk
rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks Authors: Suryasis Jana, Abhik Ghosh
Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis Authors: Jaein Kim, Hee Bin Yoo, Dong-Sig Han, Byoung-Tak Zhang
RHYME-XT: A Neural Operator for Spatiotemporal Control Systems Authors: Marijn Ruiter, Miguel Aguiar, Jake Rap, Karl H. Johansson, Amritam Das
VideoAtlas: Navigating Long-Form Video in Logarithmic Compute Authors: Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan
Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI Authors: Houston Haynes
Flow Matching Policy with Entropy Regularization Authors: Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control Authors: Zunzhe Zhang, Runhan Huang, Yicheng Liu, Shaoting Zhu, Linzhan Mou, Hang Zhao
LoST: Level of Semantics Tokenization for 3D Shapes Authors: Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen
Variational Phasor Circuits for Phase-Native Brain-Computer Interface Classification Authors: Dibakar Sigdel
Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training Authors: Sahil Tyagi, Feiyi Wang
Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity Authors: Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang, Ruobing Xie, Lei Jiang, Hayden Kwok-Hay So, Ngai Wong
KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference Authors: Sohaib Errabii, Olivier Sentieys, Marcello Traiola
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs Authors: Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee
Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory Authors: Oliver Zahn, Simran Chana
From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs Authors: Boyong Wu, Sanghwan Kim, Zeynep Akata

1. ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

ArXiv ID: 2603.18168

Authors: Louis-Pierre Chaintron, Lénaïc Chizat, Javier Maass

Abstract: We establish convergence of the training dynamics of residual neural networks (ResNets) to their joint infinite depth L, hidden width M, and embedding dimension D limit. Specifically, we consider ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime and prove that, after a bounded number of training steps, the error between the ResNet and its large-scale limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). This error rate is empirically tight when measured in embedding space. For a budget of P = Theta(L M D) parameters, this yields a convergence rate O(P^(-1/6)) for the scalings of (L, M, D) that minimize the bound. Our analysis exploits in an essential way the depth-two structure of residual blocks and applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimension. From a technical viewpoint, this work completes the program initiated in the companion paper [Chi25] where it is proved that for a fixed embedding dimension D, the training dynamics converges to a Mean ODE dynamics at rate O(1/L + sqrt(D)/sqrt(L M)). Here, we study the large-D limit of this Mean ODE model and establish convergence at rate O(1/sqrt(D)), yielding the above bound by a triangle inequality. To handle the rich probabilistic structure of the limit dynamics and obtain one of the first rigorous quantitative convergence for a DMFT-type limit, we combine the cavity method with propagation of chaos arguments at a functional level on so-called skeleton maps, which express the weight updates as functions of CLT-type sums from the past.

Comment: Theoretical characterization of large-scale ResNet training dynamics with rigorous convergence rates in the joint infinite depth-width-dimension limit.

Relevance: 10 Novelty: 9

2. A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks

ArXiv ID: 2603.17785

Authors: Leonardo Del Grande, Christoph Brune, Marcello Carioni

Abstract: In this paper, we study total variation (TV)-regularized training of infinite-width shallow ReLU neural networks, formulated as a convex optimization problem over measures on the unit sphere. Our approach leverages the duality theory of TV-regularized optimization problems to establish rigorous guarantees on the sparsity of the solutions to the training problem. Our analysis further characterizes how and when this sparsity persists in a low noise regime and for small regularization parameter. The key observation that motivates our analysis is that, for ReLU activations, the associated dual certificate is piecewise linear in the weight space. Its linearity regions, which we name dual regions, are determined by the activation patterns of the data via the induced hyperplane arrangement. Taking advantage of this structure, we prove that, on each dual region, the dual certificate admits at most one extreme value. As a consequence, the support of any minimizer is finite, and its cardinality can be bounded from above by a constant depending only on the geometry of the data-induced hyperplane arrangement. Then, we further investigate sufficient conditions ensuring uniqueness of such sparse solution. Finally, under a suitable non-degeneracy condition on the dual certificate along the boundaries of the dual regions, we prove that in the presence of low label noise and for small regularization parameter, solutions to the training problem remain sparse with the same number of Dirac deltas. Additionally, their location and the amplitudes converge, and, in case the locations lie in the interior of a dual region, the convergence happens with a rate that depends linearly on the noise and the regularization parameter.

Comment: Foundational sparsity theory for infinite-width ReLU networks using dual certificates in TV-regularized training.

Relevance: 10 Novelty: 9

3. Path-Constrained Mixture-of-Experts

ArXiv ID: 2603.18297

Authors: Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly

Abstract: Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling by activating only a subset of parameters for each input. However, conventional MoE routing selects each layer's experts independently, creating N^L possible expert paths -- for N experts across L layers. This far exceeds typical training set sizes, leading to statistical inefficiency as the model may not learn meaningful structure over such a vast path space. To constrain it, we propose \pathmoe, which shares router parameters across consecutive layers. Experiments on 0.9B and 16B parameter models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary load balancing losses. Analysis reveals that tokens following the same path naturally cluster by linguistic function, with \pathmoe{} producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations. These results offer a new perspective for understanding MoE architectures through the lens of expert paths.

Comment: MoE architecture innovation: constraining cross-layer expert path space by sharing routers across layers.

Relevance: 10 Novelty: 8

4. Learning When to Attend: Conditional Memory Access for Long-Context LLMs

ArXiv ID: 2603.17484

Authors: Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, Stefano Soatto

Abstract: Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.

Comment: Conditional attention architecture for long-context LLMs that learns token-wise global memory access.

Relevance: 10 Novelty: 8

5. CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

ArXiv ID: 2603.17946

Authors: Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song

Abstract: Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.

Comment: KV-cache-efficient attention architecture conversion: covariance-aware factorization and nonuniform rank allocation for converting GQA to MLA.

Relevance: 9 Novelty: 8

6. Attention Sinks Induce Gradient Sinks

ArXiv ID: 2603.17771

Authors: Yihong Chen, Quanming Yao

Abstract: Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing studies have largely focused on the forward pass, making it unclear whether their connection is direct or mediated by a training-time mechanism. We study this question from the perspective of backpropagation. Empirically and theoretically, we show that under causal mask, attention sinks can induce pronounced gradient concentration, which we term gradient sinks. Furthermore, in pre-norm architectures with RMSNorm, massive activations can be understood as an adaptive response to this localized gradient pressure during training. To test this hypothesis, we introduce V-scale, a modification that adjusts value-path backpropagated gradients. In pretrained V-scale models, attention sinks are preserved whereas massive activations are suppressed. These results support the interpretation that gradient sink is a key training-time mediator linking attention sinks and massive activations.

Comment: Mechanistic Transformer analysis linking attention sinks to gradient sinks and massive activations through backpropagation dynamics.

Relevance: 9 Novelty: 8

7. A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models

ArXiv ID: 2603.17896

Authors: Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard

Abstract: Understanding when learning is statistically possible yet computationally hard is a central challenge in high-dimensional statistics. In this work, we investigate this question in the context of single- and multi-index models, classes of functions widely studied as benchmarks to probe the ability of machine learning methods to discover features in high-dimensional data. Our main contribution is to show that a Noise Sensitivity Exponent (NSE) - a simple quantity determined by the activation function - governs the existence and magnitude of statistical-to-computational gaps within a broad regime of these models. We first establish that, in single-index models with large additive noise, the onset of a computational bottleneck is fully characterized by the NSE. We then demonstrate that the same exponent controls a statistical-computational gap in the specialization transition of large separable multi-index models, where individual components become learnable. Finally, in hierarchical multi-index models, we show that the NSE governs the optimal computational rate in which different directions are sequentially learned. Taken together, our results identify the NSE as a unifying property linking noise robustness, computational hardness, and feature specialization in high-dimensional learning.

Comment: Theory of statistical-to-computational gaps in high-dimensional learning via a unifying noise sensitivity exponent.

Relevance: 9 Novelty: 8

8. Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

ArXiv ID: 2603.17970

Authors: Ben S. Southworth, Stephen Thomas

Abstract: Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon's polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram--Schmidt and Gauss-Seidel ideas. We show that row-orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss-Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time-to-perplexity, MUD yields consistent 10-50\% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead -- relative to Muon, MUD improves peak tokens/s by roughly $1.3-2.6\times$ across most settings and up to nearly $3\times$ on GPT-2 large on an A100. We also demonstrate training a ESM-2 150M protein language model, where MUD matches Muon-level validation perplexity in significantly less wall-clock time.

Comment: Training efficiency method: a lower-overhead whitening optimizer for faster transformer training with convergence analysis.

Relevance: 9 Novelty: 8

9. Only relative ranks matter in weight-clustered large language models

ArXiv ID: 2603.17917

Authors: Borja Aizpurua, Sukhbinder Singh, Román Orús

Abstract: Large language models (LLMs) contain billions of parameters, yet many exact values are not essential. We show that what matters most is the relative rank of weights-whether one connection is stronger or weaker than another-rather than precise magnitudes. To reduce the number of unique weight values, we apply weight clustering to pretrained models, replacing every weight matrix with K shared values from K-means. For Llama 3.1-8B-Instruct and SmolLM2-135M, reducing each matrix to only 16-64 distinct values preserves strong accuracy without retraining, providing a simple, training-free method to compress LLMs on disk. Optionally fine-tuning only the cluster means (centroids) recovers 30-40 percent of the remaining accuracy gap at minimal cost. We then systematically randomize cluster means while keeping assignments fixed. Scrambling the relative ranks of the clusters degrades quality sharply-perplexity can increase by orders of magnitude-even when global statistics such as mean and variance are preserved. In contrast, rank-preserving randomizations cause almost no loss at mid and late layers. On the other hand, when many layers are perturbed simultaneously, progressive layer-by-layer replacement reveals that scale drift-not rank distortion-is the dominant collapse mechanism; however, an affine correction w' = aw + b with a > 0 (which preserves both rank order and overall weight distribution) can substantially delay this drift. This rank-based perspective offers a new lens on model compression and robustness.

Comment: Model compression/representation learning: shows clustered LLM weights preserve performance primarily through relative rank structure rather than exact magnitudes.

Relevance: 9 Novelty: 8

10. ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

ArXiv ID: 2603.17435

Authors: Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, Xiaowen Chu

Abstract: Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This "load-compressed, compute-decompressed" design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.

Comment: High-performance systems: hardware-aware lossless compression with fused decompression-GEMM for faster, memory-efficient LLM inference on GPUs.

Relevance: 9 Novelty: 8

11. Computation-Utility-Privacy Tradeoffs in Bayesian Estimation

ArXiv ID: 2603.18254

Authors: Sitan Chen, Jingqiu Ding, Mahbod Majid, Walter McKelvie

Abstract: Bayesian methods lie at the heart of modern data science and provide a powerful scaffolding for estimation in data-constrained settings and principled quantification and propagation of uncertainty. Yet in many real-world use cases where these methods are deployed, there is a natural need to preserve the privacy of the individuals whose data is being scrutinized. While a number of works have attempted to approach the problem of differentially private Bayesian estimation through either reasoning about the inherent privacy of the posterior distribution or privatizing off-the-shelf Bayesian methods, these works generally do not come with rigorous utility guarantees beyond low-dimensional settings. In fact, even for the prototypical tasks of Gaussian mean estimation and linear regression, it was unknown how close one could get to the Bayes-optimal error with a private algorithm, even in the simplest case where the unknown parameter comes from a Gaussian prior. In this work, we give the first efficient algorithms for both of these problems that achieve mean-squared error $(1+o(1))\mathrm{OPT}$ and additionally show that both tasks exhibit an intriguing computational-statistical gap. For Bayesian mean estimation, we prove that the excess risk achieved by our method is optimal among all efficient algorithms within the low-degree framework, yet is provably worse than what is achievable by an exponential-time algorithm. For linear regression, we prove a qualitatively similar lower bound. Our algorithms draw upon the privacy-to-robustness framework of arXiv:2212.05015, but with the curious twist that to achieve private Bayes-optimal estimation, we need to design sum-of-squares-based robust estimators for inherently non-robust objects like the empirical mean and OLS estimator. Along the way we also add to the sum-of-squares toolkit a new kind of constraint based on short-flat decompositions.

Comment: Foundational theory for differentially private Bayesian estimation, giving efficient near-Bayes-optimal algorithms and computational-statistical lower bounds.

Relevance: 8 Novelty: 9

12. Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

ArXiv ID: 2603.18325

Authors: Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay Krishnamurthy

Abstract: Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.

Comment: Provable algorithmic gains from autocurriculum for reasoning-model SFT and RL fine-tuning.

Relevance: 8 Novelty: 9

13. Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models

ArXiv ID: 2603.17384

Authors: Rui Wu, Hong Xie, Yongjun Li

Abstract: Current continuous generative models (e.g., Diffusion Models, Flow Matching) implicitly assume that locally consistent causal mechanisms naturally yield globally coherent counterfactuals. In this paper, we prove that this assumption fails fundamentally when the causal graph exhibits non-trivial homology (e.g., structural conflicts or hidden confounders). We formalize structural causal models as cellular sheaves over Wasserstein spaces, providing a strict algebraic topological definition of cohomological obstructions in measure spaces. To ensure computational tractability and avoid deterministic singularities (which we define as manifold tearing), we introduce entropic regularization and derive the Entropic Wasserstein Causal Sheaf Laplacian, a novel system of coupled non-linear Fokker-Planck equations. Crucially, we prove an entropic pullback lemma for the first variation of pushforward measures. By integrating this with the Implicit Function Theorem (IFT) on Sinkhorn optimality conditions, we establish a direct algorithmic bridge to automatic differentiation (VJP), achieving O(1)-memory reverse-mode gradients strictly independent of the iteration horizon. Empirically, our framework successfully leverages thermodynamic noise to navigate topological barriers ("entropic tunneling") in high-dimensional scRNA-seq counterfactuals. Finally, we invert this theoretical framework to introduce the Topological Causal Score, demonstrating that our Sheaf Laplacian acts as a highly sensitive algebraic detector for topology-aware causal discovery.

Comment: Foundational theory for generative causal models using sheaf/cohomology and an O(1)-memory reverse-mode differentiation bridge via Sinkhorn-IFT-VJP.

Relevance: 8 Novelty: 9

14. Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity

ArXiv ID: 2603.17577

Authors: Felix Schur

Abstract: Can latent actions and environment dynamics be recovered from offline trajectories when actions are never observed? We study this question in a setting where trajectories are action-free but tagged with demonstrator identity. We assume that each demonstrator follows a distinct policy, while the environment dynamics are shared across demonstrators and identity affects the next observation only through the chosen action. Under these assumptions, the conditional next-observation distribution $p(o_{t+1}\mid o_t,e)$ is a mixture of latent action-conditioned transition kernels with demonstrator-specific mixing weights. We show that this induces, for each state, a column-stochastic nonnegative matrix factorization of the observable conditional distribution. Using sufficiently scattered policy diversity and rank conditions, we prove that the latent transitions and demonstrator policies are identifiable up to permutation of the latent action labels. We extend the result to continuous observation spaces via a Gram-determinant minimum-volume criterion, and show that continuity of the transition map over a connected state space upgrades local permutation ambiguities to a single global permutation. A small amount of labeled action data then suffices to fix this final ambiguity. These results establish demonstrator diversity as a principled source of identifiability for learning latent actions and dynamics from offline RL data.

Comment: Foundational identifiability theory for recovering latent actions and dynamics from offline trajectories using demonstrator diversity.

Relevance: 8 Novelty: 9

15. The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

ArXiv ID: 2603.17385

Authors: Rui Wu, Hong Xie, Yongjun Li

Abstract: Judea Pearl's do-calculus provides a foundation for causal inference, but its translation to continuous generative models remains fraught with geometric challenges. We establish the fundamental limits of such interventions. We define the Counterfactual Event Horizon and prove the Manifold Tearing Theorem: deterministic flows inevitably develop finite-time singularities under extreme interventions. We establish the Causal Uncertainty Principle for the trade-off between intervention extremity and identity preservation. Finally, we introduce Geometry-Aware Causal Flow (GACF), a scalable algorithm that utilizes a topological radar to bypass manifold tearing, validated on high-dimensional scRNA-seq data.

Comment: Theoretical study of geometric limits of causal interventions in continuous generative models, introducing manifold tearing and a causal uncertainty principle.

Relevance: 8 Novelty: 9

16. Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

ArXiv ID: 2603.17875

Authors: Abhishek Gupta, Aditya Mahajan

Abstract: Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. Using the well-established perturbation theory of linear operators, this viewpoint allows one to identify derivatives of the objective function as a function of the linear operators. This leads to generalization of many well-known results in reinforcement learning to cases with generate state and action spaces. Prior results of this type were only established in the finite-state finite-action MDP settings and in settings with certain linear function approximations. The framework also leads to new low-complexity PPO-type reinforcement learning algorithms for general state and action space MDPs.

Comment: Foundational theory for RL/MDPs: operator-theoretic derivation of policy-gradient results for general state/action spaces with unbounded costs.

Relevance: 8 Novelty: 9

17. Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning

ArXiv ID: 2603.17365

Authors: Ziran Liu

Abstract: Internal noise in deep networks is usually inherited from heuristics such as dropout, hard masking, or additive perturbation. We ask two questions: what correlation geometry should internal noise have, and is the implemented perturbation compatible with the representations it acts on? We answer these questions through Variational Kernel Design (VKD), a framework in which a noise mechanism is specified by a law family, a correlation kernel, and an injection operator, and is derived from learning desiderata. In a solved spatial subfamily, a quadratic maximum-entropy principle over latent log-fields yields a Gaussian optimizer with precision given by the Dirichlet Laplacian, so the induced geometry is the Dirichlet Green kernel. Wick normalization then gives a canonical positive mean-one gate, Gaussian Chaos Noise (GCh). For the sample-wise gate used in practice, we prove exact Gaussian control of pairwise log-ratio deformation, margin-sensitive ranking stability, and an exact expected intrinsic roughness budget; hard binary masks instead induce singular or coherence-amplified distortions on positive coherent representations. On ImageNet and ImageNet-C, GCh consistently improves calibration and under shift also improves NLL at competitive accuracy.

Comment: Introduces a new internal-noise framework via variational kernel design, deriving Gaussian Chaos Noise with theoretical guarantees on representation distortion.