This is a remedial run for missed papers from 03/16/2026 to 03/16/2026.

Results generated on 03/21/2026.

Personalized Daily ArXiv Papers 2026-03-17

[gpt-5.4]	Prompt	Completion	Total
Token	207618	7887	215505
Cost	$0.52	$0.12	$0.64

Table of contents with paper titles:

Learning to Recall with Transformers Beyond Orthogonal Embeddings Authors: Nuri Mert Vural, Alberto Bietti, Mahdi Soltanolkotabi, Denny Wu
Mamba-3: Improved Sequence Modeling using State Space Principles Authors: Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu
Deep learning and the rate of approximation by flows Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen
Lost in Aggregation: On a Fundamental Expressivity Limit of Message-Passing Graph Neural Networks Authors: Eran Rosenbluth
Mixture-of-Depths Attention Authors: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
A Family of LLMs Liberated from Static Vocabularies Authors: Aleph Alpha, :, Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum
Local Urysohn Width: A Topological Complexity Measure for Classification Authors: Xin Li
Neural Networks as Local-to-Global Computations Authors: Vicente Bosca, Robert Ghrist
IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning Authors: Konstantinos Almpanakis, Anna Kreshuk
Self-Distillation of Hidden Layers for Self-Supervised Representation Learning Authors: Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor
Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold Authors: Pratyush Acharya, Habish Dhakal
FlashSampling: Fast and Memory-Efficient Exact Sampling Authors: Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang
Directional Routing in Transformers Authors: Kevin Taylor
Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity Authors: Jing Yan, Kang You, Zhezhi He, Yaoyu Zhang
Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs Authors: Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta
Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps Authors: Pengcheng Cheng
Spiking Layer-Adaptive Magnitude-based Pruning Authors: Junqiao Wang, Zhehang Ye, Yuqi Ouyang
Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks Authors: Yuri Kinoshita, Naoki Nishikawa, Taro Toyoizumi
MoLoRA: Composable Specialization via Per-Token Adapter Routing Authors: Shrey Shah, Justin Wagle
Massive Redundancy in Gradient Transport Enables Sparse Online Learning Authors: Aur Shalev Merin
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models Authors: Sijie Li, Biao Qian, Jungong Han
More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search Authors: Gal Dalal, Assaf Hallak, Gal Chechik, Yftah Ziser
Deriving Hyperparameter Scaling Laws via Modern Optimization Theory Authors: Egor Shulgin, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Bernhard Schölkopf, Antonio Orvieto
Ablate and Rescue: A Causal Analysis of Residual Stream Hyper-Connections Authors: William Peng, Josheev Rai, Kevin Tseng, Siwei Wang, Sean Wu
Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems Authors: Vladyslav Parakhin
W2T: LoRA Weights Already Know What They Can Do Authors: Xiaolong Han, Ferrante Neri, Zijian Jiang, Fang Wu, Yanfang Ye, Lu Yin, Zehong Wang
SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing Authors: Yuhuan Liu, Haitian Zhong, Xinyuan Xia, Qiang Liu, Shu Wu, Liang Wang
Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs Authors: Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang
Preconditioned One-Step Generative Modeling for Bayesian Inverse Problems in Function Spaces Authors: Zilan Cheng, Li-Lian Wang, Zhongjian Wang
LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs Authors: Ying Zhang, Hang Yu, Haipeng Zhang, Peng Di
Parallelised Differentiable Straightest Geodesics for 3D Meshes Authors: Hippolyte Verninas, Caner Korkmaz, Stefanos Zafeiriou, Tolga Birdal, Simone Foti
Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science Authors: Emmanuel Dupoux, Yann LeCun, Jitendra Malik
Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion Authors: Sonia Laguna, Jorge da Silva Goncalves, Moritz Vandenhirtz, Alain Ryser, Irene Cannistraci, Julia E. Vogt
Transition Flow Matching Authors: Chenrui Ma
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks Authors: Francesco Sovrano, Lidia Losavio, Giulia Vilone, Marc Langheinrich
Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning Authors: Ping Chen, Xiang Liu, Xingpeng Zhang, Fei Shen, Xun Gong, Zhaoxiang Liu, Zezhou Chen, Huan Hu, Kai Wang, Shiguo Lian
Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions Authors: Quoc Tran-Dinh, Nghia Nguyen-Trung
Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty Authors: Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang
PhasorFlow: A Python Library for Unit Circle Based Computing Authors: Dibakar Sigdel, Namuna Panday
Universe Routing: Why Self-Evolving Agents Need Epistemic Control Authors: Zhaohui Geoffrey Wang
Fold-CP: A Context Parallelism Framework for Biomolecular Modeling Authors: Dejun Lin, Simon Chu, Vishanth Iyer, Youhan Lee, John St John, Kevin Boyd, Brian Roland, Xiaowei Ren, Guoqing Zhou, Zhonglin Cao, Polina Binder, Yuliya Zhautouskaya, Jakub Zakrzewski, Maximilian Stadler, Kyle Gion, Yuxing Peng, Xi Chen, Tianjing Zhang, Philipp Junk, Michelle Dimon, Paweł Gniewek, Fabian Ortega, McKinley Polen, Ivan Grubisic, Ali Bashir, Graham Holt, Danny Kovtun, Matthias Grass, Luca Naef, Rui Wang, Jian Peng, Anthony Costa, Saee Paliwal, Eddie Calleja, Timur Rvachov, Neha Tadimeti, Roy Tal, Emine Kucukbenli
Interpretable Classification of Time Series Using Euler Characteristic Surfaces Authors: Salam Rabindrajit Luwang, Sushovan Majhi, Vishal Mandal, Atish J. Mitra, Md. Nurujjaman, Buddha Nath Sharma
Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models Authors: Lit Sin Tan, Junzhe Chen, Xiaolong Fu, Lichen Ma, Junshi Huang, Jianzhong Shi, Yan Li, Lijie Wen
Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies Authors: Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski
MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers Authors: Jérémy Morlier, Robin Geens, Stef Cuyckens, Arne Symons, Marian Verhelst, Vincent Gripon, Mathieu Léonardon
AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers Authors: Salim Khazem
SimCert: Probabilistic Certification for Behavioral Similarity in Deep Neural Network Compression Authors: Jingyang Li, Fu Song, Guoqiang Li
Controlled Langevin Dynamics for Sampling of Feedforward Neural Networks Trained with Minibatches Authors: Alessandro Zambon, Francesca Caruso, Riccardo Zecchina, Guido Tiana
Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction Authors: Yanghao Li, Changxin Liu, Yuhao Yi
Mechanistic Origin of Moral Indifference in Language Models Authors: Lingyu Li, Yan Teng, Yingchun Wang
Effective Distillation to Hybrid xLSTM Architectures Authors: Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter
MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Authors: Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi
Selective Memory for Artificial Intelligence: Write-Time Gating with Hierarchical Archiving Authors: Oliver Zahn, Simran Chana
Masked BRep Autoencoder via Hierarchical Graph Transformer Authors: Yifei Li, Kang Wu, Wenming Wu, Xiao-Ming Fu
TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins Authors: Shovon Niverd Pereira, Krishna Khadka, Yu Lei
Mechanistic Foundations of Goal-Directed Control Authors: Alma Lago
Tackling Over-smoothing on Hypergraphs: A Ricci Flow-guided Neural Diffusion Approach Authors: Mengyao Zhou, Zhiheng Zhou, Xiao Han, Xingqin Qi, Guanghui Wang, Guiying Yan
PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units Authors: Mark Deutel, Simon Geis, Axel Plinge
Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences Authors: Artem Sakhno, Ivan Sergeev, Alexey Shestov, Omar Zoloev, Elizaveta Kovtun, Gleb Gusev, Andrey Savchenko, Maksim Makarenko
CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds Authors: Vaishnavi Nagabhushana, Kartikay Agrawal, Ayon Borthakur
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models Authors: Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou
AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation Authors: Yusuke Takagi, Motonari Kambara, Daichi Yashima, Koki Seno, Kento Tokura, Komei Sugiura

1. Learning to Recall with Transformers Beyond Orthogonal Embeddings

ArXiv ID: 2603.15923

Authors: Nuri Mert Vural, Alberto Bietti, Mahdi Soltanolkotabi, Denny Wu

Abstract: Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training and retrieve it at inference. Existing theoretical analyses typically study transformers under idealized assumptions such as infinite data or orthogonal embeddings. In realistic settings, however, models are trained on finite datasets with non-orthogonal (random) embeddings. We address this gap by analyzing a single-layer transformer with random embeddings trained with (empirical) gradient descent on a simple token-retrieval task, where the model must identify an informative token within a length-$L$ sequence and learn a one-to-one mapping from tokens to labels. Our analysis tracks the ``early phase'' of gradient descent and yields explicit formulas for the model's storage capacity -- revealing a multiplicative dependence between sample size $N$, embedding dimension $d$, and sequence length $L$. We validate these scalings numerically and further complement them with a lower bound for the underlying statistical problem, demonstrating that this multiplicative scaling is intrinsic under non-orthogonal embeddings.

Comment: Transformer theory under finite data and non-orthogonal embeddings, yielding explicit storage-capacity scalings.

Relevance: 10 Novelty: 9

2. Mamba-3: Improved Sequence Modeling using State Space Principles

ArXiv ID: 2603.15569

Authors: Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu

Abstract: Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule that enables richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation for better model performance without increasing decode latency. Together with architectural refinements, our Mamba-3 model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with Mamba-3's MIMO variant further improving accuracy by another 1.2 points for a total 1.8 point gain. Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half of its predecessor's state size. Our evaluations demonstrate Mamba-3's ability to advance the performance-efficiency Pareto frontier.

Comment: State-space sequence architecture with complex recurrence and MIMO design improving the performance-efficiency frontier.

Relevance: 10 Novelty: 9

3. Deep learning and the rate of approximation by flows

ArXiv ID: 2603.15363

Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen

Abstract: We investigate the dependence of the approximation capacity of deep residual networks on its depth in a continuous dynamical systems setting. This can be formulated as the general problem of quantifying the minimal time-horizon required to approximate a diffeomorphism by flows driven by a given family $\mathcal F$ of vector fields. We show that this minimal time can be identified as a geodesic distance on a sub-Finsler manifold of diffeomorphisms, where the local geometry is characterised by a variational principle involving $\mathcal F$. This connects the learning efficiency of target relationships to their compatibility with the learning architectural choice. Further, the results suggest that the key approximation mechanism in deep learning, namely the approximation of functions by composition or dynamics, differs in a fundamental way from linear approximation theory, where linear spaces and norm-based rate estimates are replaced by manifolds and geodesic distances.

Comment: Gives a theoretical characterization of deep residual network approximation via geodesic distance on a sub-Finsler manifold of diffeomorphisms.

Relevance: 10 Novelty: 9

4. Lost in Aggregation: On a Fundamental Expressivity Limit of Message-Passing Graph Neural Networks

ArXiv ID: 2603.14846

Authors: Eran Rosenbluth

Abstract: We define a generic class of functions that captures most conceivable aggregations for Message-Passing Graph Neural Networks (MP-GNNs), and prove that any MP-GNN model with such aggregations induces only a polynomial number of equivalence classes on all graphs - while the number of non-isomorphic graphs is doubly-exponential (in number of vertices). Adding a familiar perspective, we observe that merely 2-iterations of Color Refinement (CR) induce at least an exponential number of equivalence classes, making the aforementioned MP-GNNs relatively infinitely weaker. Previous results state that MP-GNNs match full CR, however they concern a weak, 'non-uniform', notion of distinguishing-power where each graph size may required a different MP-GNN to distinguish graphs up to that size. Our results concern both distinguishing between non-equivariant vertices and distinguishing between non-isomorphic graphs.

Comment: Proves a fundamental expressivity limit of message-passing GNNs under generic aggregation, separating them sharply from graph isomorphism procedures.

Relevance: 10 Novelty: 9

5. Mixture-of-Depths Attention

ArXiv ID: 2603.15619

Authors: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang

Abstract: Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

Comment: Introduces a new transformer attention primitive that mixes current-layer and cross-layer KV access, with an accompanying hardware-efficient algorithm nearly matching FlashAttention-2 efficiency.

Relevance: 10 Novelty: 8

6. A Family of LLMs Liberated from Static Vocabularies

ArXiv ID: 2603.15953

Authors: Aleph Alpha, :, Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum

Abstract: Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

Comment: Core transformer architecture redesign replacing static token vocabularies with hierarchical byte-level encoding/decoding.

Relevance: 9 Novelty: 9

7. Local Urysohn Width: A Topological Complexity Measure for Classification

ArXiv ID: 2603.15412

Authors: Xin Li

Abstract: We introduce \emph{local Urysohn width}, a complexity measure for classification problems on metric spaces. Unlike VC dimension, fat-shattering dimension, and Rademacher complexity, which characterize the richness of hypothesis \emph{classes}, Urysohn width characterizes the topological-geometric complexity of the classification \emph{problem itself}: the minimum number of connected, diameter-bounded local experts needed to correctly classify all points within a margin-safe region. We prove four main results. First, a \textbf{strict hierarchy theorem}: for every integer $w \geq 1$, there exists a classification problem on a \emph{connected} compact metric space (a bouquet of circles with first Betti number $β_1 = w$) whose Urysohn width is exactly~$w$, establishing that topological complexity of the input space forces classifier complexity. Second, a \textbf{topology $\times$ geometry scaling law}: width scales as $Ω(w \cdot L/D_0)$, where $w$ counts independent loops and $L/D_0$ is the ratio of loop circumference to locality scale. Third, a \textbf{two-way separation from VC dimension}: there exist problem families where width grows unboundedly while VC dimension is bounded by a constant, and conversely, families where VC dimension grows unboundedly while width remains~1. Fourth, a \textbf{sample complexity lower bound}: any learner that must correctly classify all points in the safe region of a width-$w$ problem needs $Ω(w \log w)$ samples, independent of VC dimension.

Comment: Develops a new theoretical complexity measure for classification based on local Urysohn width, with hierarchy and sample-complexity results.

Relevance: 9 Novelty: 9

8. Neural Networks as Local-to-Global Computations

ArXiv ID: 2603.14831

Authors: Vicente Bosca, Robert Ghrist

Abstract: We construct a cellular sheaf from any feedforward ReLU neural network by placing one vertex for each intermediate quantity in the forward pass and encoding each computational step - affine transformation, activation, output - as a restriction map on an edge. The restricted coboundary operator on the free coordinates is unitriangular, so its determinant is $1$ and the restricted Laplacian is positive definite for every activation pattern. It follows that the relative cohomology vanishes and the forward pass output is the unique harmonic extension of the boundary data. The sheaf heat equation converges exponentially to this output despite the state-dependent switching introduced by piecewise linear activations. Unlike the forward pass, the heat equation propagates information bidirectionally across layers, enabling pinned neurons that impose constraints in both directions, training through local discrepancy minimization without a backward pass, and per-edge diagnostics that decompose network behavior by layer and operation type. We validate the framework experimentally on small synthetic tasks, confirming the convergence theorems and demonstrating that sheaf-based training, while not yet competitive with stochastic gradient descent, obeys quantitative scaling laws predicted by the theory.

Comment: Reinterprets feedforward ReLU networks as local-to-global sheaf computations with harmonic extension and bidirectional heat-equation dynamics.

Relevance: 9 Novelty: 9

9. IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning

ArXiv ID: 2603.15263

Authors: Konstantinos Almpanakis, Anna Kreshuk

Abstract: Self-supervised learning (SSL) has revolutionized representation learning, with Joint-Embedding Architectures (JEAs) emerging as an effective approach for capturing semantic features. Existing JEAs rely on implicit or explicit batch interaction -- via negative sampling or statistical regularization -- to prevent representation collapse. This reliance becomes problematic in regimes where batch sizes must be small, such as high-dimensional scientific data, where memory constraints and class imbalance make large, well-balanced batches infeasible. We introduce IConE (Instance-Contrasted Embeddings), a framework that decouples collapse prevention from the training batch size. Rather than enforcing diversity through batch statistics, IConE maintains a global set of learnable auxiliary instance embeddings regularized by an explicit diversity objective. This transfers the anti-collapse mechanism from the transient batch to a dataset-level embedding space, allowing stable training even when batch statistics are unreliable, down to batch size 1. Across diverse 2D and 3D biomedical modalities, IConE outperforms strong contrastive and non-contrastive baselines throughout the small-batch regime (from B=1 to B=64) and demonstrates marked robustness to severe class imbalance. Geometric analysis shows that IConE preserves high intrinsic dimensionality in the learned representations, preventing the collapse observed in existing JEAs as batch sizes shrink.

Comment: Batch-independent collapse prevention for self-supervised representation learning via dataset-level auxiliary embeddings.

Relevance: 9 Novelty: 8

10. Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

ArXiv ID: 2603.15553

Authors: Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor

Abstract: The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

Comment: Self-supervised representation learning through hidden-layer self-distillation instead of only final-layer targets.

Relevance: 9 Novelty: 8

11. Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold

ArXiv ID: 2603.15492

Authors: Pratyush Acharya, Habish Dhakal

Abstract: Standard optimization theories struggle to explain grokking, where generalization occurs long after training convergence. While geometric studies attribute this to slow drift, they often overlook the interaction between the optimizer's noise structure and landscape curvature. This work analyzes AdamW dynamics on modular arithmetic tasks, revealing a Spectral Gating'' mechanism that regulates the transition from memorization to generalization. We find that AdamW operates as a variance-gated stochastic system. Grokking is constrained by a stability condition: the generalizing solution resides in a sharp basin ($λ_{max}^H$) initially inaccessible under low-variance regimes. Thedelayed'' phase represents the accumulation of gradient variance required to lift the effective stability ceiling, permitting entry into this sharp manifold. Our ablation studies identify three complexity regimes: (1) \textbf{Capacity Collapse} ($P < 23$), where rank-deficiency prevents structural learning; (2) \textbf{The Variance-Limited Regime} ($P \approx 41$), where generalization waits for the spectral gate to open; and (3) \textbf{Stability Override} ($P > 67$), where memorization becomes dimensionally unstable. Furthermore, we challenge the "Flat Minima" hypothesis for algorithmic tasks, showing that isotropic noise injection fails to induce grokking. Generalization requires the \textit{anisotropic rectification} unique to adaptive optimizers, which directs noise into the tangent space of the solution manifold.

Comment: Training-dynamics theory of grokking as a variance-limited phase transition governed by optimizer-induced spectral gating.

Relevance: 9 Novelty: 8

12. FlashSampling: Fast and Memory-Efficient Exact Sampling

ArXiv ID: 2603.15854

Authors: Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang

Abstract: Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because $\argmax$ decomposes over a partition; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel-level decode workloads, and in end-to-end vLLM experiments, it reduces time per output token by up to $19%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth-bound postprocessing step into a lightweight epilogue. Project Page: https://github.com/FlashSampling/FlashSampling.

Comment: Presents an exact systems-level decoding primitive that fuses categorical sampling into the LM-head matmul to eliminate logits materialization and reduce memory traffic.

Relevance: 9 Novelty: 8

13. Directional Routing in Transformers

ArXiv ID: 2603.14923

Authors: Kevin Taylor

Abstract: We introduce directional routing, a lightweight mechanism that gives each transformer attention head learned suppression directions controlled by a shared router, at 3.9% parameter cost. We train a 433M-parameter model alongside an identical baseline in a single run, then trace the resulting circuits through mechanistic interpretability. Routing becomes the model's dominant computational pathway. Disabling it collapses factual recall to near-zero probability across all 8 test prompts and drops induction accuracy from 93.4% to 0.0%. Knocking out individual attention heads has negligible effect: the primary mover head's removal actually increases target probability, and induction heads retain 98.6% accuracy without their strongest member. The coordination mechanism is irreplaceable; the components it coordinates are not. The model also self-organizes, without explicit pressure, into two regimes: domain-adaptive routing in early layers and fixed syntactic pruning in late layers, where the least-varying layer is the most critical (+42.6 PPL when disabled). Routing reduces perplexity 31-56% relative to the baseline, though downstream multiple-choice benchmarks do not yet reflect these gains.

Comment: Proposes a lightweight transformer routing mechanism where attention heads use learned suppression directions controlled by a shared router, yielding a core architectural change analyzed mechanistically.

Relevance: 9 Novelty: 8

14. Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity

ArXiv ID: 2603.15987

Authors: Jing Yan, Kang You, Zhezhi He, Yaoyu Zhang

Abstract: Achieving deterministic computation results in asynchronous neuromorphic systems remains a fundamental challenge due to the inherent temporal stochasticity of continuous-time hardware. To address this, we develop a unified continuous-time framework for spiking neural networks (SNNs) that couples the Law of Charge Conservation with minimal neuron-level constraints. This integration ensures that the terminal state depends solely on the aggregate input charge, providing a unique cumulated output invariant to temporal stochasticity. We prove that this mapping is strictly invariant to spike timing in acyclic networks, whereas recurrent connectivity can introduce temporal sensitivity. Furthermore, we establish an exact representational correspondence between these charge-conserving SNNs and quantized artificial neural networks, bridging the gap between static deep learning and event-driven dynamics without approximation errors. These results establish a rigorous theoretical basis for designing continuous-time neuromorphic systems that harness the efficiency of asynchronous processing while maintaining algorithmic determinism.

Comment: Provides a theoretical foundation for charge-conserving continuous-time SNNs, proving spike-timing-invariant computation and exact correspondence to quantized ANNs.

Relevance: 9 Novelty: 8

ArXiv ID: 2603.15051

Authors: Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta

Abstract: Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.

Comment: Introduces adaptive latent-space reasoning with dynamic halting, a core architectural efficiency idea for implicit reasoning in LLMs.

Relevance: 9 Novelty: 8

16. Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps

ArXiv ID: 2603.14734

Authors: Pengcheng Cheng

Abstract: Learning solution operators of partial differential equations (PDEs) from data has emerged as a promising route to fast surrogate models in multi-query scientific workflows. However, for geometric PDEs whose inputs and outputs transform under changes of local frame (gauge), many existing operator-learning architectures remain representation-dependent, brittle under metric perturbations, and sensitive to discretization changes. We propose Gauge-Equivariant Intrinsic Neural Operators (GINO), a class of neural operators that parameterize elliptic solution maps primarily through intrinsic spectral multipliers acting on geometry-dependent spectra, coupled with gauge-equivariant nonlinearities. This design decouples geometry from learnable functional dependence and enforces consistency under frame transformations. We validate GINO on controlled problems on the flat torus ($\mathbb{T}^2$), where ground-truth resolvent operators and regularized Helmholtz--Hodge decompositions admit closed-form Fourier representations, enabling theory-aligned diagnostics. Across experiments E1--E6, GINO achieves low operator-approximation error, near machine-precision gauge equivariance, robustness to structured metric perturbations, strong cross-resolution generalization with small commutation error under restriction/prolongation, and structure-preserving performance on a regularized exact/coexact decomposition task. Ablations further link the smoothness of the learned spectral multiplier to stability under geometric perturbations. These results suggest that enforcing intrinsic structure and gauge equivariance yields operator surrogates that are more geometry-consistent and discretization-robust for elliptic PDEs on form-valued fields.

Comment: Presents gauge-equivariant intrinsic neural operators, a core operator-learning architecture with strong geometry-consistency guarantees.

Relevance: 9 Novelty: 8

17. Spiking Layer-Adaptive Magnitude-based Pruning

ArXiv ID: 2603.14946

Authors: Junqiao Wang, Zhehang Ye, Yuqi Ouyang

Abstract: Spiking Neural Networks (SNNs) provide energy-efficient computation but their deployment is constrained by dense connectivity and high spiking operation costs. Existing magnitude-based pruning strategies, when naively applied to SNNs, fail to account for temporal accumulation, non-uniform timestep contributions, and membrane stability, often leading to severe performance degradation. This paper proposes Spiking Layer-Adaptive Magnitude-based Pruning (SLAMP), a theory-guided pruning framework that generalizes layer-adaptive magnitude pruning to temporal SNNs by explicitly controlling worst-case output distortion across layers and timesteps. SLAMP formulates sparsity allocation as a temporal distortion-constrained optimization problem, yielding time-aware layer importance scores that reduce to conventional layer-adaptive pruning in single-timestep limit. An efficient two-stage procedure is derived, combining temporal score estimation, global sparsity allocation, and magnitude pruning with retraining for stability recovery. Experiments on CIFAR10, CIFAR100, and the event-based CIFAR10-DVS datasets demonstrate that SLAMP achieves substantial connectivity and spiking operation reductions while preserving accuracy, enabling efficient and deployable SNN inference.

Comment: Introduces a theory-guided pruning framework for temporal SNNs with time-aware layer importance and distortion-constrained sparsity allocation.

Relevance: 9 Novelty: 8

18. Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks

ArXiv ID: 2603.14830

Authors: Yuri Kinoshita, Naoki Nishikawa, Taro Toyoizumi

Abstract: Dataset distillation, a training-aware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width $L$. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of $\tildeΘ$$(r^2d+L)$, where $d$ and $r$ are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.

Comment: Provides theory for dataset distillation showing efficient encoding of low-dimensional task structure under gradient-based training of neural networks.

Relevance: 9 Novelty: 8

19. MoLoRA: Composable Specialization via Per-Token Adapter Routing

ArXiv ID: 2603.15965

Authors: Shrey Shah, Justin Wagle

Abstract: Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like "write code to solve this equation," which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.

Comment: Model architecture: per-token adapter routing with Mixture-of-LoRA enables composable specialization within a single sequence.

Relevance: 9 Novelty: 8

20. Massive Redundancy in Gradient Transport Enables Sparse Online Learning

ArXiv ID: 2603.15195

Authors: Aur Shalev Merin

Abstract: Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL's adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.

Comment: Shows strong redundancy in online gradient transport and proposes sparse propagation schemes that retain most adaptation ability, a foundational efficiency result for recurrent and transformer training dynamics.

Relevance: 8 Novelty: 9

21. Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

ArXiv ID: 2603.16001

Authors: Sijie Li, Biao Qian, Jungong Han

Abstract: Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.

Comment: Model compression: asymmetric text-visual pruning for LVLMs based on modality-specific sensitivity analysis and adaptive token calibration.

Relevance: 9 Novelty: 7

22. More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search

ArXiv ID: 2603.15377

Authors: Gal Dalal, Assaf Hallak, Gal Chechik, Yftah Ziser

Abstract: Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citep{qin2025dsbd, freitag2017beam}, without analyzing whether wider search can \emph{hurt} output quality. We present an analysis, grounded in Extreme Value Theory, that answers this question. Beam selection over noisy scorer outputs introduces a systematic overestimation bias that grows with the candidate pool size, and we derive a maximum useful beam width $\hat{k}$ beyond which search degrades performance. This critical width depends on the signal-to-noise ratio of the scorer: $\hat{k}$ grows exponentially with $(Δ/σ)^2$, where $Δ> 0$ is the quality advantage of correct paths over incorrect ones and $σ$ is the scorer noise. We validate this theory by comparing perplexity-guided and PRM-guided beam search across three 7B-parameter models and ten domains on MR-BEN (5,975 questions). Perplexity scoring, with its high noise, yields $\hat{k} = 1$: search provides no benefit at any width tested. PRM scoring, with lower noise, yields $\hat{k} \geq 4$, with gains of up to 8.9 percentage points. The same model, the same algorithm, but different scorers place $\hat{k}$ at opposite ends of the beam width range. Our analysis identifies the scorer's signal-to-noise ratio as the key quantity governing beam width selection, and we propose diagnostic indicators for choosing the beam width in practice.

Comment: Theoretical analysis of beam search overestimation bias with explicit critical-width scaling laws for LLM inference.