This is a remedial run for missed papers from 03/12/2026 to 03/12/2026.

Results generated on 03/21/2026.

Personalized Daily ArXiv Papers 2026-03-13

[gpt-5.4]	Prompt	Completion	Total
Token	153004	5639	158643
Cost	$0.38	$0.08	$0.47

Table of contents with paper titles:

Statistical and structural identifiability in representation learning Authors: Walter Nelson, Marco Fumero, Theofanis Karaletsos, Francesco Locatello
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks Authors: Yuval Ran-Milo
Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing Authors: Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun
Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks Authors: Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Leszek Rutkowski
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan
AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization Authors: Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin
Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem Authors: Vugar Ismailov
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers Authors: Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis
GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models Authors: Xiaoyun Liu, Divya Saxena, Jiannong Cao, Yuqing Zhao, Yiying Dong, Penghui Ruan
LongFlow: Efficient KV Cache Compression for Reasoning M Authors: Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang
Disentangled Representation Learning through Unsupervised Symmetry Group Discovery Authors: Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li
Chemical Reaction Networks Learn Better than Spiking Neural Networks Authors: Sophie Jaffard, Ivo F. Sbalzarini
Language Generation with Replay: A Learning-Theoretic View of Model Collapse Authors: Giorgio Racca, Michal Valko, Amartya Sanyal
Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights Authors: Yulu Gan, Phillip Isola
Sinkhorn-Drifting Generative Models Authors: Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri
Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia Authors: Haining Pan, Nakul Aggarwal, J. H. Pixley
Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE Authors: Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models Authors: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich
Geometry-Aware Probabilistic Circuits via Voronoi Tessellations Authors: Sahil Sidheekh, Sriraam Natarajan
Truth as a Compression Artifact in Language Model Training Authors: Konstantin Krestnikov
On-Average Stability of Multipass Preconditioned SGD and Effective Dimension Authors: Simon Vary, Tyler Farghly, Ilja Kuzborskij, Patrick Rebeschini
Harnessing Data Asymmetry: Manifold Learning in the Finsler World Authors: Thomas Dagès, Simon Weber, Daniel Cremers, Ron Kimmel
KernelFoundry: Hardware-aware evolutionary GPU kernel optimization Authors: Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, Benjamin Ummenhofer
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models Authors: Jae-Won Chung, Jeff J. Ma, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury
AutoScout: Structured Optimization for Automating ML System Configuration Authors: Jimmy Shong, Yuhan Ding, Yihan Jiang, Liheng Jing, Haonan Chen, Gaokai Zhang, Aditya Akella, Fan Lai
Separable neural architectures as a primitive for unified predictive and generative intelligence Authors: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha
A Quantitative Characterization of Forgetting in Post-Training Authors: Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan
Diffusion Models Generalize but Not in the Way You Might Think Authors: Tim Kaiser, Markus Kollmann
Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control Authors: Ihor Kendiukhov
Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA Authors: Rickard Brännvall
Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference Authors: Valentyn Melnychuk, Vahid Balazadeh, Stefan Feuerriegel, Rahul G. Krishnan
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors Authors: Zehua Zou, Yiran Ma, Yulong Zhang, Zhengnan Li, Zeyu Yang, Jinhao Xie, Xiaoyu Jiang, Zhichao Chen
HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal Understanding Authors: Ming Lei, Shufan Wu, Christophe Baehr
One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies Authors: Shaolong Li, Lichao Sun, Yongchao Chen
Revisiting Model Stitching In the Foundation Model Era Authors: Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo
Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness Authors: Arman Bolatov, Samuel Horváth, Martin Takáč, Eduard Gorbunov
A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis Authors: Bo Hu, Jose C Principe
Probing Length Generalization in Mamba via Image Reconstruction Authors: Jan Rathjens, Robin Schiewer, Laurenz Wiskott, Anand Subramoney
TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition Authors: Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen
Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group Authors: Alan Garbarz
Context-dependent manifold learning: A neuromodulated constrained autoencoder approach Authors: Jérôme Adriaens, Guillaume Drion, Pierre Sacré
Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency Authors: Xinyan Jiang, Wenjing Yu, Di Wang, Lijie Hu
NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation Authors: Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen, Zhenyu Wang, Tianhao Liu, Siqi Chen, Weilin Huang
BiGain: Unified Token Compression for Joint Generation and Classification Authors: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen
SpectralGuard: Detecting Memory Collapse Attacks in State Space Models Authors: Davi Bonetto
A Geometrically-Grounded Drive for MDL-Based Optimization in Deep Learning Authors: Ming Lei, Shufan Wu, Christophe Baehr
Efficient Reasoning with Balanced Thinking Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian
Event-Driven Video Generation Authors: Chika Maduabuchi
OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure Authors: Chuancheng Shi, Wenhua Wu, Fei Shen, Xiaogang Zhu, Kun Hu, Zhiyong Wang
Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder Authors: Alaa Yasser, Kittipat Phunjanna, Marcos Escudero Viñolo, Catarina Barata, Jenny Benois-Pineau

1. Statistical and structural identifiability in representation learning

ArXiv ID: 2603.11970

Authors: Walter Nelson, Marco Fumero, Theofanis Karaletsos, Francesco Locatello

Abstract: Representation learning models exhibit a surprising stability in their internal representations. Whereas most prior work treats this stability as a single property, we formalize it as two distinct concepts: statistical identifiability (consistency of representations across runs) and structural identifiability (alignment of representations with some unobserved ground truth). Recognizing that perfect pointwise identifiability is generally unrealistic for modern representation learning models, we propose new model-agnostic definitions of statistical and structural near-identifiability of representations up to some error tolerance $ε$. Leveraging these definitions, we prove a statistical $ε$-near-identifiability result for the representations of models with nonlinear decoders, generalizing existing identifiability theory beyond last-layer representations in e.g. generative pre-trained transformers (GPTs) to near-identifiability of the intermediate representations of a broad class of models including (masked) autoencoders (MAEs) and supervised learners. Although these weaker assumptions confer weaker identifiability, we show that independent components analysis (ICA) can resolve much of the remaining linear ambiguity for this class of models, and validate and measure our near-identifiability claims empirically. With additional assumptions on the data-generating process, statistical identifiability extends to structural identifiability, yielding a simple and practical recipe for disentanglement: ICA post-processing of latent representations. On synthetic benchmarks, this approach achieves state-of-the-art disentanglement using a vanilla autoencoder. With a foundation model-scale MAE for cell microscopy, it disentangles biological variation from technical batch effects, substantially improving downstream generalization.

Comment: Representation learning theory: formalizes statistical vs structural identifiability and proves near-identifiability beyond last-layer representations.

Relevance: 10 Novelty: 9

2. Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

ArXiv ID: 2603.11487

Authors: Yuval Ran-Milo

Abstract: Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? We prove that, in some settings, it is the latter: computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.

Comment: Provides a proof that attention sinks are functionally necessary in softmax Transformers for trigger-conditional computation.

Relevance: 10 Novelty: 9

3. Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

ArXiv ID: 2603.11535

Authors: Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun

Abstract: Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.

Comment: Model architecture innovation: threshold-based MoE routing gives causal dynamic computation allocation with load balancing without auxiliary losses.

Relevance: 10 Novelty: 8

4. Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

ArXiv ID: 2603.12354

Authors: Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Leszek Rutkowski

Abstract: Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network's structural "kinetic utility". First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

Comment: Model compression and dynamic networks: unified utility metric for structural pruning and routing based on alternating gradient flow.

Relevance: 9 Novelty: 8

5. Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

ArXiv ID: 2603.12038

Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan

Abstract: Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.

Comment: Inference efficiency for transformers: training-free decoding acceleration using stable within-sentence attention support and sparse memory refresh.

Relevance: 9 Novelty: 8

6. AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

ArXiv ID: 2603.11873

Authors: Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin

Abstract: The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.

Comment: Systems co-design for dynamic sparse models: token-level pre-gating and fused kernels to make dynamic LoRA/MoE-style adapter inference efficient.

Relevance: 9 Novelty: 8

7. Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem

ArXiv ID: 2603.11972

Authors: Vugar Ismailov

Abstract: Deep Operator Networks (DeepONets) provide a branch-trunk neural architecture for approximating nonlinear operators acting between function spaces. In the classical operator approximation framework, the input is a function $u\in C(K_1)$ defined on a compact set $K_1$ (typically a compact subset of a Banach space), and the operator maps $u$ to an output function $G(u)\in C(K_2)$ defined on a compact Euclidean domain $K_2\subset\mathbb{R}^d$. In this paper, we develop a topological extension in which the operator input lies in an arbitrary Hausdorff locally convex space $X$. We construct topological feedforward neural networks on $X$ using continuous linear functionals from the dual space $X^*$ and introduce topological DeepONets whose branch component acts on $X$ through such linear measurements, while the trunk component acts on the Euclidean output domain. Our main theorem shows that continuous operators $G:V\to C(K;\mathbb{R}^m)$, where $V\subset X$ and $K\subset\mathbb{R}^d$ are compact, can be uniformly approximated by such topological DeepONets. This extends the classical Chen-Chen operator approximation theorem from spaces of continuous functions to locally convex spaces and yields a branch-trunk approximation theorem beyond the Banach-space setting.

Comment: Theory of neural operators: extends DeepONet universal approximation from Banach-function settings to general locally convex spaces.

Relevance: 9 Novelty: 8

8. HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

ArXiv ID: 2603.12222

Authors: Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis

Abstract: Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.

Comment: Compression methodology: end-to-end multi-granular stochastic auto-pruning for ViTs across heads, FFNs, and intra-block dimensions.

Relevance: 9 Novelty: 8

9. GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models

ArXiv ID: 2603.13418

Authors: Xiaoyun Liu, Divya Saxena, Jiannong Cao, Yuqing Zhao, Yiying Dong, Penghui Ruan

Abstract: Structured pruning is widely used to compress large language models (LLMs), yet its effectiveness depends heavily on neuron importance estimation. Most existing methods estimate neuron importance from activation statistics on a single calibration dataset, which introduces calibration bias and degrades downstream cross-task generalization. We observe that neurons exhibit heterogeneous distribution sensitivity, with distribution-robust neurons maintaining consistent rankings across datasets and distribution-sensitive neurons showing high cross-dataset ranking variance. Based on this, we identify two structural limitations in existing methods. First, ranking all neurons within a shared space causes distribution-sensitive neurons that strongly activate on calibration inputs to dominate, crowding out distribution-robust neurons critical for out-of-distribution tasks. Second, applying activation-based importance metrics uniformly can be unreliable. Distribution-sensitive neurons that infrequently activate on calibration data receive insufficient activation signal for accurate local ranking. To address these limitations, we propose GPrune-LLM, a generalization-aware structured pruning framework that explicitly accounts for neuron differences in cross-distribution behavior. We first partition neurons into behavior-consistent modules to localize ranking competition, then evaluate activation-based metric reliability per module according to distribution sensitivity and score magnitude. For modules where activation-based scoring is unreliable, we switch to an activation-independent metric. Finally, we adaptively learn module-wise sparsity. Extensive experiments across multiple downstream tasks demonstrate GPrune-LLM's consistent improvements in post-compression generalization, particularly at high sparsity, and reduced dependence on importance metric choice.

Comment: Compression methodology: structured LLM pruning guided by cross-distribution neuron sensitivity to improve post-pruning generalization.

Relevance: 9 Novelty: 8

10. LongFlow: Efficient KV Cache Compression for Reasoning M

ArXiv ID: 2603.11504

Authors: Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang

Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.

Comment: Inference efficiency: KV-cache compression for long-output reasoning models with negligible-overhead importance estimation and fused custom kernel.

Relevance: 9 Novelty: 8

11. Disentangled Representation Learning through Unsupervised Symmetry Group Discovery

ArXiv ID: 2603.11790

Authors: Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri

Abstract: Symmetry-based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry-based disentanglement have required strong prior knowledge of the symmetry group's structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true symmetry group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.

Comment: Representation learning theory: unsupervised symmetry group discovery with identifiability guarantees for symmetry-based disentanglement.

Relevance: 9 Novelty: 8

12. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

ArXiv ID: 2603.12201

Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

Comment: Attention efficiency: cross-layer reuse of sparse attention top-k indices cuts indexer cost with training-free and training-aware configurations.

Relevance: 9 Novelty: 8

13. Chemical Reaction Networks Learn Better than Spiking Neural Networks

ArXiv ID: 2603.12060

Authors: Sophie Jaffard, Ivo F. Sbalzarini

Abstract: We mathematically prove that chemical reaction networks without hidden layers can solve tasks for which spiking neural networks require hidden layers. Our proof uses the deterministic mass-action kinetics formulation of chemical reaction networks. Specifically, we prove that a certain reaction network without hidden layers can learn a classification task previously proved to be achievable by a spiking neural network with hidden layers. We provide analytical regret bounds for the global behavior of the network and analyze its asymptotic behavior and Vapnik-Chervonenkis dimension. In a numerical experiment, we confirm the learning capacity of the proposed chemical reaction network for classifying handwritten digits in pixel images, and we show that it solves the task more accurately and efficiently than a spiking neural network with hidden layers. This provides a motivation for machine learning in chemical computers and a mathematical explanation for how biological cells might exhibit more efficient learning behavior within biochemical reaction networks than neuronal networks.

Comment: Theoretical architecture result proving stronger expressivity of chemical reaction networks than spiking neural networks, with regret and VC-dimension analysis.

Relevance: 8 Novelty: 9

14. Language Generation with Replay: A Learning-Theoretic View of Model Collapse

ArXiv ID: 2603.11784

Authors: Giorgio Racca, Michal Valko, Amartya Sanyal

Abstract: As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.

Comment: Learning theory for representation/data dynamics: formal characterization of model collapse under replayed self-generated text.

Relevance: 8 Novelty: 9

15. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

ArXiv ID: 2603.12228

Authors: Yulu Gan, Phillip Isola

Abstract: Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

Comment: Representation/post-training insight: shows large pretrained models contain dense nearby task experts, enabling parallel random perturbation selection and ensembling.

Relevance: 8 Novelty: 9

16. Sinkhorn-Drifting Generative Models

ArXiv ID: 2603.12366

Authors: Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri

Abstract: We establish a theoretical link between the recently proposed "drifting" generative dynamics and gradient flows induced by the Sinkhorn divergence. In a particle discretization, the drift field admits a cross-minus-self decomposition: an attractive term toward the target distribution and a repulsive/self-correction term toward the current model, both expressed via one-sided normalized Gibbs kernels. We show that Sinkhorn divergence yields an analogous cross-minus-self structure, but with each term defined by entropic optimal-transport couplings obtained through two-sided Sinkhorn scaling (i.e., enforcing both marginals). This provides a precise sense in which drifting acts as a surrogate for a Sinkhorn-divergence gradient flow, interpolating between one-sided normalization and full two-sided Sinkhorn scaling. Crucially, this connection resolves an identifiability gap in prior drifting formulations: leveraging the definiteness of the Sinkhorn divergence, we show that zero drift (equilibrium of the dynamics) implies that the model and target measures match. Experiments show that Sinkhorn drifting reduces sensitivity to kernel temperature and improves one-step generative quality, trading off additional training time for a more stable optimization, without altering the inference procedure used by drift methods. These theoretical gains translate to strong low-temperature improvements in practice: on FFHQ-ALAE at the lowest temperature setting we evaluate, Sinkhorn drifting reduces mean FID from 187.7 to 37.1 and mean latent EMD from 453.3 to 144.4, while on MNIST it preserves full class coverage across the temperature sweep. Project page: https://mint-vu.github.io/SinkhornDrifting/

Comment: Generative modeling theory: links drifting dynamics to Sinkhorn-divergence gradient flows and resolves equilibrium identifiability.

Relevance: 8 Novelty: 9

17. Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia

ArXiv ID: 2603.12316

Authors: Haining Pan, Nakul Aggarwal, J. H. Pixley

Abstract: Modern neural networks are heavily overparameterized, and pruning, which removes redundant neurons or connections, has emerged as a key approach to compressing them without sacrificing performance. However, while practical pruning methods are well developed, whether pruning induces sharp phase transitions in the neural networks and, if so, to what universality class they belong, remain open questions. To address this, we study fully-connected neural networks trained on MNIST, independently varying the dropout (i.e., removing neurons) rate at both the training and evaluation stages to map the phase diagram. We identify three distinct phases: eumentia (the network learns), dementia (the network has forgotten), and amentia (the network cannot learn), sharply distinguished by the power-law scaling of the cross-entropy loss with the training dataset size. {In the eumentia phase, the algebraic decay of the loss, as documented in the machine learning literature as neural scaling laws, is from the perspective of statistical mechanics the hallmark of quasi-long-range order.} We demonstrate that the transition between the eumentia and dementia phases is accompanied by scale invariance, with a diverging length scale that exhibits hallmarks of a Berezinskii-Kosterlitz-Thouless-like transition; the phase structure is robust across different network widths and depths. Our results establish that dropout-induced pruning provides a concrete setting in which neural network behavior can be understood through the lens of statistical mechanics.

Comment: Theory of compression dynamics: identifies pruning-induced phase transitions in fully connected networks with statistical-mechanics analysis.

Relevance: 8 Novelty: 9

18. Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE

ArXiv ID: 2603.11611

Authors: Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

Abstract: Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache, while achieving comparable final loss. In this work, we present a systematic study examining the impact of partial RoPE on training dynamics and convergence across architectures and datasets. Our findings uncover several notable patterns: (1) applying RoPE to only a small fraction of dimensions (around 10%) achieves convergence comparable to using full RoPE; (2) these trends hold consistently across model size, sequence lengths and datasets of varying quality and architectures, with higher-quality data resulting in lower overall loss and similar benchmark performance; and (3) some models trained with NoPE (No Positional Encoding) showcase unstable learning trajectories, which can be alleviated through minimal RoPE application or QK-Norm which converges to a higher loss. Together, these results offer practical guidance for model designers aiming to balance efficiency and training stability, while emphasizing the previously overlooked importance of partial RoPE.

Comment: Transformer efficiency: analyzes partial RoPE as a core positional-encoding design that preserves convergence while greatly reducing cache memory.

Relevance: 9 Novelty: 7

19. Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

ArXiv ID: 2603.12248

Authors: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich

Abstract: Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.

Comment: Training objective methodology for language models: sequence-level feature matching through energy-based fine-tuning with theoretical grounding.