Previous Day 2026-03-12
Monthly Overview 2026-03
Next Day 2026-03-14

This is a remedial run for missed papers from 03/12/2026 to 03/12/2026.

Results generated on 03/21/2026.

Personalized Daily ArXiv Papers 2026-03-13

[gpt-5.4] Prompt Completion Total
Token 153004 5639 158643
Cost $0.38 $0.08 $0.47

Table of contents with paper titles:

  1. Statistical and structural identifiability in representation learning Authors: Walter Nelson, Marco Fumero, Theofanis Karaletsos, Francesco Locatello

  2. Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks Authors: Yuval Ran-Milo

  3. Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing Authors: Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun

  4. Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks Authors: Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Leszek Rutkowski

  5. Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan

  6. AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization Authors: Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin

  7. Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem Authors: Vugar Ismailov

  8. HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers Authors: Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis

  9. GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models Authors: Xiaoyun Liu, Divya Saxena, Jiannong Cao, Yuqing Zhao, Yiying Dong, Penghui Ruan

  10. LongFlow: Efficient KV Cache Compression for Reasoning M Authors: Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang

  11. Disentangled Representation Learning through Unsupervised Symmetry Group Discovery Authors: Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri

  12. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

  13. Chemical Reaction Networks Learn Better than Spiking Neural Networks Authors: Sophie Jaffard, Ivo F. Sbalzarini

  14. Language Generation with Replay: A Learning-Theoretic View of Model Collapse Authors: Giorgio Racca, Michal Valko, Amartya Sanyal

  15. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights Authors: Yulu Gan, Phillip Isola

  16. Sinkhorn-Drifting Generative Models Authors: Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri

  17. Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia Authors: Haining Pan, Nakul Aggarwal, J. H. Pixley

  18. Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE Authors: Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

  19. Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models Authors: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich

  20. Geometry-Aware Probabilistic Circuits via Voronoi Tessellations Authors: Sahil Sidheekh, Sriraam Natarajan

  21. Truth as a Compression Artifact in Language Model Training Authors: Konstantin Krestnikov

  22. On-Average Stability of Multipass Preconditioned SGD and Effective Dimension Authors: Simon Vary, Tyler Farghly, Ilja Kuzborskij, Patrick Rebeschini

  23. Harnessing Data Asymmetry: Manifold Learning in the Finsler World Authors: Thomas Dagès, Simon Weber, Daniel Cremers, Ron Kimmel

  24. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization Authors: Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, Benjamin Ummenhofer

  25. Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models Authors: Jae-Won Chung, Jeff J. Ma, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury

  26. AutoScout: Structured Optimization for Automating ML System Configuration Authors: Jimmy Shong, Yuhan Ding, Yihan Jiang, Liheng Jing, Haonan Chen, Gaokai Zhang, Aditya Akella, Fan Lai

  27. Separable neural architectures as a primitive for unified predictive and generative intelligence Authors: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha

  28. A Quantitative Characterization of Forgetting in Post-Training Authors: Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

  29. Diffusion Models Generalize but Not in the Way You Might Think Authors: Tim Kaiser, Markus Kollmann

  30. Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control Authors: Ihor Kendiukhov

  31. Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA Authors: Rickard Brännvall

  32. Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference Authors: Valentyn Melnychuk, Vahid Balazadeh, Stefan Feuerriegel, Rahul G. Krishnan

  33. Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors Authors: Zehua Zou, Yiran Ma, Yulong Zhang, Zhengnan Li, Zeyu Yang, Jinhao Xie, Xiaoyu Jiang, Zhichao Chen

  34. HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal Understanding Authors: Ming Lei, Shufan Wu, Christophe Baehr

  35. One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies Authors: Shaolong Li, Lichao Sun, Yongchao Chen

  36. Revisiting Model Stitching In the Foundation Model Era Authors: Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo

  37. Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness Authors: Arman Bolatov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

  38. A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis Authors: Bo Hu, Jose C Principe

  39. Probing Length Generalization in Mamba via Image Reconstruction Authors: Jan Rathjens, Robin Schiewer, Laurenz Wiskott, Anand Subramoney

  40. TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition Authors: Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen

  41. Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group Authors: Alan Garbarz

  42. Context-dependent manifold learning: A neuromodulated constrained autoencoder approach Authors: Jérôme Adriaens, Guillaume Drion, Pierre Sacré

  43. Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency Authors: Xinyan Jiang, Wenjing Yu, Di Wang, Lijie Hu

  44. NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation Authors: Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen, Zhenyu Wang, Tianhao Liu, Siqi Chen, Weilin Huang

  45. BiGain: Unified Token Compression for Joint Generation and Classification Authors: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen

  46. SpectralGuard: Detecting Memory Collapse Attacks in State Space Models Authors: Davi Bonetto

  47. A Geometrically-Grounded Drive for MDL-Based Optimization in Deep Learning Authors: Ming Lei, Shufan Wu, Christophe Baehr

  48. Efficient Reasoning with Balanced Thinking Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian

  49. Event-Driven Video Generation Authors: Chika Maduabuchi

  50. OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure Authors: Chuancheng Shi, Wenhua Wu, Fei Shen, Xiaogang Zhu, Kun Hu, Zhiyong Wang

  51. Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder Authors: Alaa Yasser, Kittipat Phunjanna, Marcos Escudero Viñolo, Catarina Barata, Jenny Benois-Pineau


1. Statistical and structural identifiability in representation learning

ArXiv ID: 2603.11970

Authors: Walter Nelson, Marco Fumero, Theofanis Karaletsos, Francesco Locatello

Abstract: Representation learning models exhibit a surprising stability in their internal representations. Whereas most prior work treats this stability as a single property, we formalize it as two distinct concepts: statistical identifiability (consistency of representations across runs) and structural identifiability (alignment of representations with some unobserved ground truth). Recognizing that perfect pointwise identifiability is generally unrealistic for modern representation learning models, we propose new model-agnostic definitions of statistical and structural near-identifiability of representations up to some error tolerance $ε$. Leveraging these definitions, we prove a statistical $ε$-near-identifiability result for the representations of models with nonlinear decoders, generalizing existing identifiability theory beyond last-layer representations in e.g. generative pre-trained transformers (GPTs) to near-identifiability of the intermediate representations of a broad class of models including (masked) autoencoders (MAEs) and supervised learners. Although these weaker assumptions confer weaker identifiability, we show that independent components analysis (ICA) can resolve much of the remaining linear ambiguity for this class of models, and validate and measure our near-identifiability claims empirically. With additional assumptions on the data-generating process, statistical identifiability extends to structural identifiability, yielding a simple and practical recipe for disentanglement: ICA post-processing of latent representations. On synthetic benchmarks, this approach achieves state-of-the-art disentanglement using a vanilla autoencoder. With a foundation model-scale MAE for cell microscopy, it disentangles biological variation from technical batch effects, substantially improving downstream generalization.

Comment: Representation learning theory: formalizes statistical vs structural identifiability and proves near-identifiability beyond last-layer representations.

Relevance: 10 Novelty: 9


2. Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

ArXiv ID: 2603.11487

Authors: Yuval Ran-Milo

Abstract: Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? We prove that, in some settings, it is the latter: computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.

Comment: Provides a proof that attention sinks are functionally necessary in softmax Transformers for trigger-conditional computation.

Relevance: 10 Novelty: 9


3. Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

ArXiv ID: 2603.11535

Authors: Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun

Abstract: Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.

Comment: Model architecture innovation: threshold-based MoE routing gives causal dynamic computation allocation with load balancing without auxiliary losses.

Relevance: 10 Novelty: 8


4. Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

ArXiv ID: 2603.12354

Authors: Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Leszek Rutkowski

Abstract: Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network's structural "kinetic utility". First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

Comment: Model compression and dynamic networks: unified utility metric for structural pruning and routing based on alternating gradient flow.

Relevance: 9 Novelty: 8


5. Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

ArXiv ID: 2603.12038

Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan

Abstract: Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.

Comment: Inference efficiency for transformers: training-free decoding acceleration using stable within-sentence attention support and sparse memory refresh.

Relevance: 9 Novelty: 8


6. AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

ArXiv ID: 2603.11873

Authors: Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin

Abstract: The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.

Comment: Systems co-design for dynamic sparse models: token-level pre-gating and fused kernels to make dynamic LoRA/MoE-style adapter inference efficient.

Relevance: 9 Novelty: 8


7. Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem

ArXiv ID: 2603.11972

Authors: Vugar Ismailov

Abstract: Deep Operator Networks (DeepONets) provide a branch-trunk neural architecture for approximating nonlinear operators acting between function spaces. In the classical operator approximation framework, the input is a function $u\in C(K_1)$ defined on a compact set $K_1$ (typically a compact subset of a Banach space), and the operator maps $u$ to an output function $G(u)\in C(K_2)$ defined on a compact Euclidean domain $K_2\subset\mathbb{R}^d$. In this paper, we develop a topological extension in which the operator input lies in an arbitrary Hausdorff locally convex space $X$. We construct topological feedforward neural networks on $X$ using continuous linear functionals from the dual space $X^*$ and introduce topological DeepONets whose branch component acts on $X$ through such linear measurements, while the trunk component acts on the Euclidean output domain. Our main theorem shows that continuous operators $G:V\to C(K;\mathbb{R}^m)$, where $V\subset X$ and $K\subset\mathbb{R}^d$ are compact, can be uniformly approximated by such topological DeepONets. This extends the classical Chen-Chen operator approximation theorem from spaces of continuous functions to locally convex spaces and yields a branch-trunk approximation theorem beyond the Banach-space setting.

Comment: Theory of neural operators: extends DeepONet universal approximation from Banach-function settings to general locally convex spaces.

Relevance: 9 Novelty: 8


8. HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

ArXiv ID: 2603.12222

Authors: Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis

Abstract: Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.

Comment: Compression methodology: end-to-end multi-granular stochastic auto-pruning for ViTs across heads, FFNs, and intra-block dimensions.

Relevance: 9 Novelty: 8


9. GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models

ArXiv ID: 2603.13418

Authors: Xiaoyun Liu, Divya Saxena, Jiannong Cao, Yuqing Zhao, Yiying Dong, Penghui Ruan

Abstract: Structured pruning is widely used to compress large language models (LLMs), yet its effectiveness depends heavily on neuron importance estimation. Most existing methods estimate neuron importance from activation statistics on a single calibration dataset, which introduces calibration bias and degrades downstream cross-task generalization. We observe that neurons exhibit heterogeneous distribution sensitivity, with distribution-robust neurons maintaining consistent rankings across datasets and distribution-sensitive neurons showing high cross-dataset ranking variance. Based on this, we identify two structural limitations in existing methods. First, ranking all neurons within a shared space causes distribution-sensitive neurons that strongly activate on calibration inputs to dominate, crowding out distribution-robust neurons critical for out-of-distribution tasks. Second, applying activation-based importance metrics uniformly can be unreliable. Distribution-sensitive neurons that infrequently activate on calibration data receive insufficient activation signal for accurate local ranking. To address these limitations, we propose GPrune-LLM, a generalization-aware structured pruning framework that explicitly accounts for neuron differences in cross-distribution behavior. We first partition neurons into behavior-consistent modules to localize ranking competition, then evaluate activation-based metric reliability per module according to distribution sensitivity and score magnitude. For modules where activation-based scoring is unreliable, we switch to an activation-independent metric. Finally, we adaptively learn module-wise sparsity. Extensive experiments across multiple downstream tasks demonstrate GPrune-LLM's consistent improvements in post-compression generalization, particularly at high sparsity, and reduced dependence on importance metric choice.

Comment: Compression methodology: structured LLM pruning guided by cross-distribution neuron sensitivity to improve post-pruning generalization.

Relevance: 9 Novelty: 8


10. LongFlow: Efficient KV Cache Compression for Reasoning M

ArXiv ID: 2603.11504

Authors: Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang

Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.

Comment: Inference efficiency: KV-cache compression for long-output reasoning models with negligible-overhead importance estimation and fused custom kernel.

Relevance: 9 Novelty: 8


11. Disentangled Representation Learning through Unsupervised Symmetry Group Discovery

ArXiv ID: 2603.11790

Authors: Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri

Abstract: Symmetry-based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry-based disentanglement have required strong prior knowledge of the symmetry group's structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true symmetry group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.

Comment: Representation learning theory: unsupervised symmetry group discovery with identifiability guarantees for symmetry-based disentanglement.

Relevance: 9 Novelty: 8


12. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

ArXiv ID: 2603.12201

Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

Comment: Attention efficiency: cross-layer reuse of sparse attention top-k indices cuts indexer cost with training-free and training-aware configurations.

Relevance: 9 Novelty: 8


13. Chemical Reaction Networks Learn Better than Spiking Neural Networks

ArXiv ID: 2603.12060

Authors: Sophie Jaffard, Ivo F. Sbalzarini

Abstract: We mathematically prove that chemical reaction networks without hidden layers can solve tasks for which spiking neural networks require hidden layers. Our proof uses the deterministic mass-action kinetics formulation of chemical reaction networks. Specifically, we prove that a certain reaction network without hidden layers can learn a classification task previously proved to be achievable by a spiking neural network with hidden layers. We provide analytical regret bounds for the global behavior of the network and analyze its asymptotic behavior and Vapnik-Chervonenkis dimension. In a numerical experiment, we confirm the learning capacity of the proposed chemical reaction network for classifying handwritten digits in pixel images, and we show that it solves the task more accurately and efficiently than a spiking neural network with hidden layers. This provides a motivation for machine learning in chemical computers and a mathematical explanation for how biological cells might exhibit more efficient learning behavior within biochemical reaction networks than neuronal networks.

Comment: Theoretical architecture result proving stronger expressivity of chemical reaction networks than spiking neural networks, with regret and VC-dimension analysis.

Relevance: 8 Novelty: 9


14. Language Generation with Replay: A Learning-Theoretic View of Model Collapse

ArXiv ID: 2603.11784

Authors: Giorgio Racca, Michal Valko, Amartya Sanyal

Abstract: As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.

Comment: Learning theory for representation/data dynamics: formal characterization of model collapse under replayed self-generated text.

Relevance: 8 Novelty: 9


15. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

ArXiv ID: 2603.12228

Authors: Yulu Gan, Phillip Isola

Abstract: Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

Comment: Representation/post-training insight: shows large pretrained models contain dense nearby task experts, enabling parallel random perturbation selection and ensembling.

Relevance: 8 Novelty: 9


16. Sinkhorn-Drifting Generative Models

ArXiv ID: 2603.12366

Authors: Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri

Abstract: We establish a theoretical link between the recently proposed "drifting" generative dynamics and gradient flows induced by the Sinkhorn divergence. In a particle discretization, the drift field admits a cross-minus-self decomposition: an attractive term toward the target distribution and a repulsive/self-correction term toward the current model, both expressed via one-sided normalized Gibbs kernels. We show that Sinkhorn divergence yields an analogous cross-minus-self structure, but with each term defined by entropic optimal-transport couplings obtained through two-sided Sinkhorn scaling (i.e., enforcing both marginals). This provides a precise sense in which drifting acts as a surrogate for a Sinkhorn-divergence gradient flow, interpolating between one-sided normalization and full two-sided Sinkhorn scaling. Crucially, this connection resolves an identifiability gap in prior drifting formulations: leveraging the definiteness of the Sinkhorn divergence, we show that zero drift (equilibrium of the dynamics) implies that the model and target measures match. Experiments show that Sinkhorn drifting reduces sensitivity to kernel temperature and improves one-step generative quality, trading off additional training time for a more stable optimization, without altering the inference procedure used by drift methods. These theoretical gains translate to strong low-temperature improvements in practice: on FFHQ-ALAE at the lowest temperature setting we evaluate, Sinkhorn drifting reduces mean FID from 187.7 to 37.1 and mean latent EMD from 453.3 to 144.4, while on MNIST it preserves full class coverage across the temperature sweep. Project page: https://mint-vu.github.io/SinkhornDrifting/

Comment: Generative modeling theory: links drifting dynamics to Sinkhorn-divergence gradient flows and resolves equilibrium identifiability.

Relevance: 8 Novelty: 9


17. Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia

ArXiv ID: 2603.12316

Authors: Haining Pan, Nakul Aggarwal, J. H. Pixley

Abstract: Modern neural networks are heavily overparameterized, and pruning, which removes redundant neurons or connections, has emerged as a key approach to compressing them without sacrificing performance. However, while practical pruning methods are well developed, whether pruning induces sharp phase transitions in the neural networks and, if so, to what universality class they belong, remain open questions. To address this, we study fully-connected neural networks trained on MNIST, independently varying the dropout (i.e., removing neurons) rate at both the training and evaluation stages to map the phase diagram. We identify three distinct phases: eumentia (the network learns), dementia (the network has forgotten), and amentia (the network cannot learn), sharply distinguished by the power-law scaling of the cross-entropy loss with the training dataset size. {In the eumentia phase, the algebraic decay of the loss, as documented in the machine learning literature as neural scaling laws, is from the perspective of statistical mechanics the hallmark of quasi-long-range order.} We demonstrate that the transition between the eumentia and dementia phases is accompanied by scale invariance, with a diverging length scale that exhibits hallmarks of a Berezinskii-Kosterlitz-Thouless-like transition; the phase structure is robust across different network widths and depths. Our results establish that dropout-induced pruning provides a concrete setting in which neural network behavior can be understood through the lens of statistical mechanics.

Comment: Theory of compression dynamics: identifies pruning-induced phase transitions in fully connected networks with statistical-mechanics analysis.

Relevance: 8 Novelty: 9


18. Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE

ArXiv ID: 2603.11611

Authors: Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

Abstract: Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache, while achieving comparable final loss. In this work, we present a systematic study examining the impact of partial RoPE on training dynamics and convergence across architectures and datasets. Our findings uncover several notable patterns: (1) applying RoPE to only a small fraction of dimensions (around 10%) achieves convergence comparable to using full RoPE; (2) these trends hold consistently across model size, sequence lengths and datasets of varying quality and architectures, with higher-quality data resulting in lower overall loss and similar benchmark performance; and (3) some models trained with NoPE (No Positional Encoding) showcase unstable learning trajectories, which can be alleviated through minimal RoPE application or QK-Norm which converges to a higher loss. Together, these results offer practical guidance for model designers aiming to balance efficiency and training stability, while emphasizing the previously overlooked importance of partial RoPE.

Comment: Transformer efficiency: analyzes partial RoPE as a core positional-encoding design that preserves convergence while greatly reducing cache memory.

Relevance: 9 Novelty: 7


19. Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

ArXiv ID: 2603.12248

Authors: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich

Abstract: Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.

Comment: Training objective methodology for language models: sequence-level feature matching through energy-based fine-tuning with theoretical grounding.

Relevance: 8 Novelty: 8


20. Geometry-Aware Probabilistic Circuits via Voronoi Tessellations

ArXiv ID: 2603.11946

Authors: Sahil Sidheekh, Sriraam Natarajan

Abstract: Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to capture local geometry of the data manifold. We propose Voronoi tessellations (VT) as a natural way to incorporate geometric structure directly into the sum nodes of a PC. However, naïvely introducing such structure breaks tractability. We formalize this incompatibility and develop two complementary solutions: (1) an approximate inference framework that provides guaranteed lower and upper bounds for inference, and (2) a structural condition for VT under which exact tractable inference is recovered. Finally, we introduce a differentiable relaxation for VT that enables gradient-based learning and empirically validate the resulting approach on standard density estimation tasks.

Comment: Probabilistic modeling architecture: geometry-aware probabilistic circuits with Voronoi-structured sum nodes and tractability conditions.

Relevance: 8 Novelty: 8


21. Truth as a Compression Artifact in Language Model Training

ArXiv ID: 2603.11749

Authors: Konstantin Krestnikov

Abstract: Why do language models trained on contradictory data prefer correct answers? In controlled experiments with small transformers (3.5M--86M parameters), we show that this preference tracks the compressibility structure of errors rather than truth per se. We train GPT-2 style models on corpora where each mathematical problem appears with both correct and incorrect solutions -- a denoising design that directly models conflicting information about the same fact. When errors are random, models extract the correct signal with accuracy scaling from 65% to 85% with model size. When errors follow a coherent alternative rule system, accuracy drops to chance (~45--51%): the model cannot distinguish the false system from truth. A multi-rule experiment reveals a sharp crossover: a single coherent alternative rule eliminates truth bias entirely, but adding a second competing rule restores most of it (47%->78%), with continued growth through N=10 (88%). The same pattern reproduces on real Wikipedia text (71% vs 46%). We propose the Compression--Consistency Principle as an explanatory hypothesis: in these settings, gradient descent favors the most compressible answer cluster, not truth per se. Truth bias emerges only when falsehood is structurally incoherent. Whether this principle extends to large-scale pretraining remains an open question.

Comment: Representation-learning insight: argues truth preference emerges from compression structure, supported by controlled transformer training studies.

Relevance: 8 Novelty: 8


22. On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

ArXiv ID: 2603.11989

Authors: Simon Vary, Tyler Farghly, Ilja Kuzborskij, Patrick Rebeschini

Abstract: We study trade-offs between the population risk curvature, geometry of the noise, and preconditioning on the generalisation ability of the multipass Preconditioned Stochastic Gradient Descent (PSGD). Many practical optimisation heuristics implicitly navigate this trade-off in different ways -- for instance, some aim to whiten gradient noise, while others aim to align updates with expected loss curvature. When the geometry of the population risk curvature and the geometry of the gradient noise do not match, an aggressive choice that improves one aspect can amplify instability along the other, leading to suboptimal statistical behavior. In this paper we employ on-average algorithmic stability to connect generalisation of PSGD to the effective dimension that depends on these sources of curvature. While existing techniques for on-average stability of SGD are limited to a single pass, as first contribution we develop a new on-average stability analysis for multipass SGD that handles the correlations induced by data reuse. This allows us to derive excess risk bounds that depend on the effective dimension. In particular, we show that an improperly chosen preconditioner can yield suboptimal effective dimension dependence in both optimisation and generalisation. Finally, we complement our upper bounds with matching, instance-dependent lower bounds.

Comment: Foundational optimization theory: multipass PSGD stability analysis with effective-dimension-dependent excess risk bounds and matching lower bounds.

Relevance: 8 Novelty: 8


23. Harnessing Data Asymmetry: Manifold Learning in the Finsler World

ArXiv ID: 2603.11396

Authors: Thomas Dagès, Simon Weber, Daniel Cremers, Ron Kimmel

Abstract: Manifold learning is a fundamental task at the core of data analysis and visualisation. It aims to capture the simple underlying structure of complex high-dimensional data by preserving pairwise dissimilarities in low-dimensional embeddings. Traditional methods rely on symmetric Riemannian geometry, thus forcing symmetric dissimilarities and embedding spaces, e.g. Euclidean. However, this discards in practice valuable asymmetric information inherent to the non-uniformity of data samples. We suggest to harness this asymmetry by switching to Finsler geometry, an asymmetric generalisation of Riemannian geometry, and propose a Finsler manifold learning pipeline that constructs asymmetric dissimilarities and embeds in a Finsler space. This greatly broadens the applicability of existing asymmetric embedders beyond traditionally directed data to any data. We also modernise asymmetric embedders by generalising current reference methods to asymmetry, like Finsler t-SNE and Finsler Umap. On controlled synthetic and large real datasets, we show that our asymmetric pipeline reveals valuable information lost in the traditional pipeline, e.g. density hierarchies, and consistently provides superior quality embeddings than their Euclidean counterparts.

Comment: Foundational representation learning: extends manifold learning from symmetric Riemannian to asymmetric Finsler geometry with generalized t-SNE/UMAP.

Relevance: 8 Novelty: 8


24. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization

ArXiv ID: 2603.12440

Authors: Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, Benjamin Ummenhofer

Abstract: Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance profiling outputs. Most existing LLM-based approaches to kernel generation rely on simple prompting and feedback loops, incorporating hardware awareness only indirectly through profiling feedback. We introduce KernelFoundry, an evolutionary framework that efficiently explores the GPU kernel design space through three key mechanisms: (1) MAP-Elites quality-diversity search with kernel-specific behavioral dimensions to sustain exploration across diverse optimization strategies; (2) meta-prompt evolution, which co-evolves prompts with kernels to uncover task-specific optimization strategies, and (3) template-based parameter optimization to tune kernels to inputs and hardware. We evaluate this framework on KernelBench, robust-kbench, and custom tasks, generating SYCL kernels as a cross-platform GPU programming model and CUDA kernels for comparison to prior work. Our approach consistently outperforms the baseline methods, achieving an average speedup of 2.3x on KernelBench for SYCL. Moreover, KernelFoundry is implemented as a distributed framework with remote access to diverse hardware, enabling rapid benchmarking and featuring a flexible user input layer that supports kernel generation for a wide range of real-world use cases beyond benchmarking.

Comment: Systems-level GPU optimization: evolutionary MAP-Elites framework for hardware-aware kernel search and prompt co-evolution.

Relevance: 8 Novelty: 8


25. Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models

ArXiv ID: 2603.12118

Authors: Jae-Won Chung, Jeff J. Ma, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury

Abstract: Any-to-Any models are an emerging class of multimodal models that accept combinations of multimodal data (e.g., text, image, video, audio) as input and generate them as output. Serving these models are challenging; different requests with different input and output modalities traverse different paths through the model computation graph, and each component of the model have different scaling characteristics. We present Cornserve, a distributed serving system for generic Any-to-Any models. Cornserve provides a flexible task abstraction for expressing Any-to-Any model computation graphs, enabling component disaggregation and independent scaling. The distributed runtime dispatches compute to the data plane via an efficient record-and-replay execution model that keeps track of data dependencies, and forwards tensor data between components directly from the producer to the consumer. Built on Kubernetes with approximately 23K new lines of Python, Cornserve supports diverse Any-to-Any models and delivers up to 3.81$\times$ higher throughput and 5.79$\times$ lower tail latency. Cornserve is open-source, and the demo video is available on YouTube.

Comment: Distributed systems contribution: disaggregated serving architecture for any-to-any multimodal models with flexible computation-graph execution.

Relevance: 8 Novelty: 8


26. AutoScout: Structured Optimization for Automating ML System Configuration

ArXiv ID: 2603.11603

Authors: Jimmy Shong, Yuhan Ding, Yihan Jiang, Liheng Jing, Haonan Chen, Gaokai Zhang, Aditya Akella, Fan Lai

Abstract: Machine learning (ML) systems expose a rapidly expanding configuration space spanning model-parallelism strategies, communication optimizations, and low-level runtime parameters. End-to-end system efficiency is highly sensitive to these choices, yet identifying high-performance configurations is challenging due to heterogeneous feature types (e.g., sparse and dense parameters), conditional dependencies (e.g., valid execution parameters only under specific upstream decisions), and the high search (profiling) cost. Existing approaches either optimize a narrow subset of configuration dimensions or rely on ad-hoc heuristics that fail to generalize as configuration spaces continue to grow. We present AutoScout, a general-purpose systems configurator for ML training, fine-tuning, and inference. It formulates the system configuration as a mixed-discrete/continuous optimization problem with hierarchical dependencies and introduces a hybrid optimization framework that jointly refines sparse structural decisions and dense execution parameters. To reduce profiling cost, AutoScout adaptively prioritizes high-impact configuration features and ensembles simulators with varying fidelity. Across diverse models, hardware platforms, and deployment objectives, AutoScout consistently identifies high-performance configurations, achieving 2.7-3.0$\times$ training speedup over expert-tuned settings.

Comment: Systems-level optimizer for ML configuration spaces with hierarchical mixed discrete/continuous decisions and multi-fidelity profiling.

Relevance: 8 Novelty: 8


27. Separable neural architectures as a primitive for unified predictive and generative intelligence

ArXiv ID: 2603.12244

Authors: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha

Abstract: Intelligent systems across physics, language and perception often exhibit factorisable structure, yet are typically modelled by monolithic neural architectures that do not explicitly exploit this structure. The separable neural architecture (SNA) addresses this by formalising a representational class that unifies additive, quadratic and tensor-decomposed neural models. By constraining interaction order and tensor rank, SNAs impose a structural inductive bias that factorises high-dimensional mappings into low-arity components. Separability need not be a property of the system itself: it often emerges in the coordinates or representations through which the system is expressed. Crucially, this coordinate-aware formulation reveals a structural analogy between chaotic spatiotemporal dynamics and linguistic autoregression. By treating continuous physical states as smooth, separable embeddings, SNAs enable distributional modelling of chaotic systems. This approach mitigates the nonphysical drift characteristics of deterministic operators whilst remaining applicable to discrete sequences. The compositional versatility of this approach is demonstrated across four domains: autonomous waypoint navigation via reinforcement learning, inverse generation of multifunctional microstructures, distributional modelling of turbulent flow and neural language modelling. These results establish the separable neural architecture as a domain-agnostic primitive for predictive and generative intelligence, capable of unifying both deterministic and distributional representations.

Comment: Introduces separable neural architectures as a core architectural primitive that factorizes high-dimensional mappings via controlled interaction order and tensor rank.

Relevance: 8 Novelty: 8


28. A Quantitative Characterization of Forgetting in Post-Training

ArXiv ID: 2603.12163

Authors: Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

Abstract: Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develop theoretical results under a two-mode mixture abstraction (representing old and new tasks), proposed by Chen et al. (2025) (arXiv:2510.18874), and formalize forgetting in two forms: (i) mass forgetting, where the old mixture weight collapses to zero, and (ii) old-component drift, where an already-correct old component shifts during training. For equal-covariance Gaussian modes, we prove that forward-KL objectives trained on data from the new distribution drive the old weight to zero, while reverse-KL objectives converge to the true target (thereby avoiding mass forgetting) and perturb the old mean only through overlap-gated misassignment probabilities controlled by the Bhattacharyya coefficient, yielding drift that decays exponentially with mode separation and a locally well-conditioned geometry with exponential convergence. We further quantify how replay interacts with these objectives. For forward-KL, replay must modify the training distribution to change the population optimum; for reverse-KL, replay leaves the population objective unchanged but prevents finite-batch old-mode starvation through bounded importance weighting. Finally, we analyze three recently proposed near-on-policy post-training methods, SDFT (arxiv:2601.19897), TTT-Discover (arxiv:2601.16175), and OAPL (arxiv:2602.19362), via the same lens and derive explicit conditions under which each retains old mass and exhibits overlap-controlled drift. Overall, our results show that forgetting can by precisely quantified based on the interaction between divergence direction, geometric behavioral overlap, sampling regime, and the visibility of past behavior during training.

Comment: Theoretical analysis of forgetting in post-training, deriving objective-dependent conditions for mass forgetting and component drift.

Relevance: 8 Novelty: 8


29. Diffusion Models Generalize but Not in the Way You Might Think

ArXiv ID: 2603.13419

Authors: Tim Kaiser, Markus Kollmann

Abstract: Standard evaluation metrics suggest that Denoising Diffusion Models based on U-Net or Transformer architectures generalize well in practice. However, as it can be shown that an optimal Diffusion Model fully memorizes the training data, the model error determines generalization. Here, we show that although sufficiently large denoiser models show increasing memorization of the training set with increasing training time, the resulting denoising trajectories do not follow this trend. Our experiments indicate that the reason for this observation is rooted in the fact that overfitting occurs at intermediate noise levels, but the distribution of noisy training data at these noise levels has little overlap with denoising trajectories during inference. To gain more insight, we make use of a 2D toy diffusion model to show that overfitting at intermediate noise levels is largely determined by model error and the density of the data support. While the optimal denoising flow field localizes sharply around training samples, sufficient model error or dense support on the data manifold suppresses exact recall, yielding a smooth, generalizing flow field. To further support our results, we investigate how several factors, such as training time, model size, dataset size, condition granularity, and diffusion guidance, influence generalization behavior.

Comment: Foundational analysis of memorization and generalization dynamics in diffusion models across noise levels and denoising trajectories.

Relevance: 8 Novelty: 8


30. Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control

ArXiv ID: 2603.11940

Authors: Ihor Kendiukhov

Abstract: Mechanistic interpretability of biological foundation models has relied on selective feature sampling, pairwise interaction testing, and observational trajectory analysis. Each of these can introduce systematic bias. Here we present three experiments that address these limitations through exhaustive circuit tracing, higher order combinatorial ablation, and causal trajectory steering in Geneformer, a transformer based single cell foundation model. First, exhaustive tracing of all 4065 active sparse autoencoder features at layer 5 yields 1393850 significant downstream edges, a 27 fold expansion over selective sampling. This reveals a heavy tailed hub distribution in which 1.8 percent of features account for disproportionate connectivity and 40 percent of the top 20 hubs lack biological annotation. These results indicate systematic annotation bias in prior selective analyses. Second, three way combinatorial ablation across 8 feature triplets shows that redundancy deepens monotonically with interaction order, with a three way ratio of 0.59 versus a pairwise ratio of 0.74, and with zero synergy. This confirms that the model architecture is subadditive at all tested orders. Third, trajectory guided feature steering establishes a causal link between layer position and differentiation directionality. Late layer features at L17 consistently push cell states toward maturity, with fraction positive equal to 1.0. Early and mid layer features at L0 and L11 mostly push away from maturity, with fraction positive ranging from 0.00 to 0.58. Together these results move from correlation toward causal evidence for layer dependent control of cell state.

Comment: Representation learning and mechanistic interpretability study using exhaustive circuit tracing and higher-order ablations to characterize internal feature organization.

Relevance: 8 Novelty: 8


31. Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA

ArXiv ID: 2603.11799

Authors: Rickard Brännvall

Abstract: Membership inference attacks (MIAs) are becoming standard tools for auditing the privacy of machine learning models. The leading attacks -- LiRA (Carlini et al., 2022) and RMIA (Zarifzadeh et al., 2024) -- appear to use distinct scoring strategies, while the recently proposed BASE (Lassila et al., 2025) was shown to be equivalent to RMIA, making it difficult for practitioners to choose among them. We show that all three are instances of a single exponential-family log-likelihood ratio framework, differing only in their distributional assumptions and the number of parameters estimated per data point. This unification reveals a hierarchy (BASE1-4) that connects RMIA and LiRA as endpoints of a spectrum of increasing model complexity. Within this framework, we identify variance estimation as the key bottleneck at small shadow-model budgets and propose BaVarIA, a Bayesian variance inference attack that replaces threshold-based parameter switching with conjugate normal-inverse-gamma priors. BaVarIA yields a Student-t predictive (BaVarIA-t) or a Gaussian with stabilized variance (BaVarIA-n), providing stable performance without additional hyperparameter tuning. Across 12 datasets and 7 shadow-model budgets, BaVarIA matches or improves upon LiRA and RMIA, with the largest gains in the practically important low-shadow-model and offline regimes.

Comment: Unifies major membership inference attacks under an exponential-family likelihood-ratio framework and introduces Bayesian variance estimation for low-shadow-model regimes.

Relevance: 8 Novelty: 8


32. Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference

ArXiv ID: 2603.12037

Authors: Valentyn Melnychuk, Vahid Balazadeh, Stefan Feuerriegel, Rahul G. Krishnan

Abstract: Foundation models based on prior-data fitted networks (PFNs) have shown strong empirical performance in causal inference by framing the task as an in-context learning problem.However, it is unclear whether PFN-based causal estimators provide uncertainty quantification that is consistent with classical frequentist estimators. In this work, we address this gap by analyzing the frequentist consistency of PFN-based estimators for the average treatment effect (ATE). (1) We show that existing PFNs, when interpreted as Bayesian ATE estimators, can exhibit prior-induced confounding bias: the prior is not asymptotically overwritten by data, which, in turn, prevents frequentist consistency. (2) As a remedy, we suggest employing a calibration procedure based on a one-step posterior correction (OSPC). We show that the OSPC helps to restore frequentist consistency and can yield a semi-parametric Bernstein-von Mises theorem for calibrated PFNs (i.e., both the calibrated PFN-based estimators and the classical semi-parametric efficient estimators converge in distribution with growing data size). (3) Finally, we implement OSPC through tailoring martingale posteriors on top of the PFNs. In this way, we are able to recover functional nuisance posteriors from PFNs, required by the OSPC. In multiple (semi-)synthetic experiments, PFNs calibrated with our martingale posterior OSPC produce ATE uncertainty that (i) asymptotically matches frequentist uncertainty and (ii) is well calibrated in finite samples in comparison to other Bayesian ATE estimators.

Comment: Provides theory for prior-data fitted networks, proving inconsistency and proposing a calibrated posterior correction with Bernstein-von Mises guarantees.

Relevance: 8 Novelty: 8


33. Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors

ArXiv ID: 2603.11473

Authors: Zehua Zou, Yiran Ma, Yulong Zhang, Zhengnan Li, Zeyu Yang, Jinhao Xie, Xiaoyu Jiang, Zhichao Chen

Abstract: Nonlinear Probabilistic Latent Variable Models (NPLVMs) are a cornerstone of soft sensor modeling due to their capacity for uncertainty delineation. However, conventional NPLVMs are trained using amortized variational inference, where neural networks parameterize the variational posterior. While facilitating model implementation, this parameterization converts the distributional optimization problem within an infinite-dimensional function space to parameter optimization within a finite-dimensional parameter space, which introduces an approximation error gap, thereby degrading soft sensor modeling accuracy. To alleviate this issue, we introduce KProxNPLVM, a novel NPLVM that pivots to relaxing the objective itself and improving the NPLVM's performance. Specifically, we first prove the approximation error induced by the conventional approach. Based on this, we design the Wasserstein distance as the proximal operator to relax the learning objective, yielding a new variational inference strategy derived from solving this relaxed optimization problem. Based on this foundation, we provide a rigorous derivation of KProxNPLVM's optimization implementation, prove the convergence of our algorithm can finally sidestep the approximation error, and propose the KProxNPLVM by summarizing the abovementioned content. Finally, extensive experiments on synthetic and real-world industrial datasets are conducted to demonstrate the efficacy of the proposed KProxNPLVM.

Comment: Advances probabilistic latent variable modeling with a new proximal variational inference objective and convergence analysis to reduce amortization error.

Relevance: 8 Novelty: 8


34. HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal Understanding

ArXiv ID: 2603.12305

Authors: Ming Lei, Shufan Wu, Christophe Baehr

Abstract: The ability to understand and reason about cause and effect -- encompassing interventions, counterfactuals, and underlying mechanisms -- is a cornerstone of robust artificial intelligence. While deep learning excels at pattern recognition, it fundamentally lacks a model of causality, making systems brittle under distribution shifts and unable to answer ``what-if'' questions. This paper introduces the \emph{Hierarchical Causal Primitive Dynamic Composition Network (HCP-DCNet)}, a unified framework that bridges continuous physical dynamics with discrete symbolic causal inference. Departing from monolithic representations, HCP-DCNet decomposes causal scenes into reusable, typed \emph{causal primitives} organized into four abstraction layers: physical, functional, event, and rule. A dual-channel routing network dynamically composes these primitives into task-specific, fully differentiable \emph{Causal Execution Graphs (CEGs)}. Crucially, the system employs a \emph{causal-intervention-driven meta-evolution} strategy, enabling autonomous self-improvement through a constrained Markov decision process. We establish rigorous theoretical guarantees, including type-safe composition, routing convergence, and universal approximation of causal dynamics. Extensive experiments across simulated physical and social environments demonstrate that HCP-DCNet significantly outperforms state-of-the-art baselines in causal discovery, counterfactual reasoning, and compositional generalization. This work provides a principled, scalable, and interpretable architecture for building AI systems with human-like causal abstraction and continual self-refinement capabilities.

Comment: Dynamic composition architecture with typed causal primitives and routing into differentiable execution graphs directly targets core model architecture design.

Relevance: 8 Novelty: 8


35. One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies

ArXiv ID: 2603.12480

Authors: Shaolong Li, Lichao Sun, Yongchao Chen

Abstract: Generative flow and diffusion models provide the continuous, multimodal action distributions needed for high-precision robotic policies. However, their reliance on iterative sampling introduces severe inference latency, degrading control frequency and harming performance in time-sensitive manipulation. To address this problem, we propose the One-Step Flow Policy (OFP), a from-scratch self-distillation framework for high-fidelity, single-step action generation without a pre-trained teacher. OFP unifies a self-consistency loss to enforce coherent transport across time intervals, and a self-guided regularization to sharpen predictions toward high-density expert modes. In addition, a warm-start mechanism leverages temporal action correlations to minimize the generative transport distance. Evaluations across 56 diverse simulated manipulation tasks demonstrate that a one-step OFP achieves state-of-the-art results, outperforming 100-step diffusion and flow policies while accelerating action generation by over $100\times$. We further integrate OFP into the $π_{0.5}$ model on RoboTwin 2.0, where one-step OFP surpasses the original 10-step policy. These results establish OFP as a practical, scalable solution for highly accurate and low-latency robot control.

Comment: Model efficiency via one-step self-distillation for diffusion/flow visuomotor policies, reducing iterative sampling cost by 100x.

Relevance: 8 Novelty: 8


36. Revisiting Model Stitching In the Foundation Model Era

ArXiv ID: 2603.12433

Authors: Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo

Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

Comment: Representation learning via model stitching: a systematic study of cross-model feature compatibility in heterogeneous vision foundation models.

Relevance: 8 Novelty: 7


37. Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

ArXiv ID: 2603.12512

Authors: Arman Bolatov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

Abstract: We consider distributed optimization under Byzantine attacks in the presence of $(L_0,L_1)$-smoothness, a generalization of standard $L$-smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by $(L_0,L_1)$-smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of $O(K^{-1/4})$ up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic $(L_0,L_1)$-smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.

Comment: Distributed optimization theory: Byzantine-robust training under generalized (L0,L1)-smoothness with convergence guarantees.

Relevance: 8 Novelty: 7


38. A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis

ArXiv ID: 2603.11428

Authors: Bo Hu, Jose C Principe

Abstract: Statistical dependence measures like mutual information is ideal for analyzing autoencoders, but it can be ill-posed for deterministic, static, noise-free networks. We adopt the variational (Gaussian) formulation that makes dependence among inputs, latents, and reconstructions measurable, and we propose a stable neural dependence estimator based on an orthonormal density-ratio decomposition. Unlike MINE, our method avoids input concatenation and product-of-marginals re-pairing, reducing computational cost and improving stability. We introduce an efficient NMF-like scalar objective and demonstrate empirically that assuming Gaussian noise to form an auxiliary variable enables meaningful dependence measurements and supports quantitative feature analysis, with a sequential convergence of singular values.

Comment: Representation analysis: stable neural statistical dependence estimator for quantifying input-latent-reconstruction dependence in autoencoders.

Relevance: 8 Novelty: 7


39. Probing Length Generalization in Mamba via Image Reconstruction

ArXiv ID: 2603.12499

Authors: Jan Rathjens, Robin Schiewer, Laurenz Wiskott, Anand Subramoney

Abstract: Mamba has attracted widespread interest as a general-purpose sequence model due to its low computational complexity and competitive performance relative to transformers. However, its performance can degrade when inference sequence lengths exceed those seen during training. We study this phenomenon using a controlled vision task in which Mamba reconstructs images from sequences of image patches. By analyzing reconstructions at different stages of sequence processing, we reveal that Mamba qualitatively adapts its behavior to the distribution of sequence lengths encountered during training, resulting in strategies that fail to generalize beyond this range. To support our analysis, we introduce a length-adaptive variant of Mamba that improves performance across training sequence lengths. Our results provide an intuitive perspective on length generalization in Mamba and suggest directions for improving the architecture.

Comment: Core architecture analysis: probes Mamba length generalization failure modes and introduces a length-adaptive variant.

Relevance: 8 Novelty: 7


40. TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

ArXiv ID: 2603.12465

Authors: Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen

Abstract: Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.

Comment: Systems analysis methodology: decomposes LLM inference host-side overhead into actionable components and characterizes host-device boundedness.

Relevance: 8 Novelty: 7


41. Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group

ArXiv ID: 2603.12459

Authors: Alan Garbarz

Abstract: We present an alternative way of solving the steerable kernel constraint that appears in the design of steerable equivariant convolutional neural networks. We find explicit real and complex bases which are ready to use, for different symmetry groups and for feature maps of arbitrary tensor type. A major advantage of this method is that it bypasses the need to numerically or analytically compute Clebsch-Gordan coefficients and works directly with the representations of the input and output feature maps. The strategy is to find a basis of kernels that respect a simpler invariance condition at some point $x_0$, and then \textit{steer} it with the defining equation of steerability to move to some arbitrary point $x=g\cdot x_0$. This idea has already been mentioned in the literature before, but not advanced in depth and with some generality. Here we describe how it works with minimal technical tools to make it accessible for a general audience.

Comment: Explicit kernel-basis construction for equivariant CNNs that avoids Clebsch-Gordan coefficients and generalizes across symmetry groups.

Relevance: 8 Novelty: 7


42. Context-dependent manifold learning: A neuromodulated constrained autoencoder approach

ArXiv ID: 2603.11673

Authors: Jérôme Adriaens, Guillaume Drion, Pierre Sacré

Abstract: Constrained autoencoders (cAE) provide a successful path towards interpretable dimensionality reduction by enforcing geometric structure on latent spaces. However, standard cAEs cannot adapt to varying physical parameters or environmental conditions without conflating these contextual shifts with the primary input. To address this, we integrated a neuromodulatory mechanism into the cAE framework to allow for context-dependent manifold learning. This paper introduces the Neuromodulated Constrained Autoencoder (NcAE), which adaptively parameterizes geometric constraints via gain and bias tuning conditioned on static contextual information. Experimental results on dynamical systems show that the NcAE accurately captures how manifold geometry varies across different regimes while maintaining rigorous projection properties. These results demonstrate that neuromodulation effectively decouples global contextual parameters from local manifold representations. This architecture provides a foundation for developing more flexible, physics-informed representations in systems subject to (non-stationary) environmental constraints.

Comment: Autoencoder architecture for context-dependent manifold learning using neuromodulated geometric constraints.

Relevance: 8 Novelty: 7


43. Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency

ArXiv ID: 2603.12298

Authors: Xinyan Jiang, Wenjing Yu, Di Wang, Lijie Hu

Abstract: Activation engineering enables precise control over Large Language Models (LLMs) without the computational cost of fine-tuning. However, existing methods deriving vectors from static activation differences are susceptible to high-dimensional noise and layer-wise semantic drift, often capturing spurious correlations rather than the target intent. To address this, we propose Global Evolutionary Refined Steering (GER-steer), a training-free framework that grounded in the geometric stability of the network's representation evolution. GER-steer exploits this global signal to rectify raw steering vectors, effectively decoupling robust semantic intent from orthogonal artifacts. Extensive evaluations confirm that GER-steer consistently outperforms baselines, delivering superior efficacy and generalization without layer-specific tuning, establishing a universal solution for reliable model alignment.

Comment: Activation engineering method that improves steering vectors via cross-layer representation evolution, directly targeting core representation/control methodology in LLMs.

Relevance: 8 Novelty: 7


44. NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

ArXiv ID: 2603.12378

Authors: Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen, Zhenyu Wang, Tianhao Liu, Siqi Chen, Weilin Huang

Abstract: Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly Low-Rank Adaptation (LoRA), have become essential for adapting Large Language Models (LLMs) to downstream tasks. While the recent FlyLoRA framework successfully leverages bio-inspired sparse random projections to mitigate parameter interference, it relies on a static, magnitude-based routing mechanism that is agnostic to input context. In this paper, we propose NeuroLoRA, a novel Mixture-of-Experts (MoE) based LoRA framework inspired by biological neuromodulation -- the dynamic regulation of neuronal excitability based on context. NeuroLoRA retains the computational efficiency of frozen random projections while introducing a lightweight, learnable neuromodulation gate that contextually rescales the projection space prior to expert selection. We further propose a Contrastive Orthogonality Loss to explicitly enforce separation between expert subspaces, enhancing both task decoupling and continual learning capacity. Extensive experiments on MMLU, GSM8K, and ScienceQA demonstrate that NeuroLoRA consistently outperforms FlyLoRA and other strong baselines across single-task adaptation, multi-task model merging, and sequential continual learning scenarios, while maintaining comparable parameter efficiency.

Comment: MoE-style PEFT architecture with context-aware neuromodulation gating and orthogonality regularization for better expert separation.

Relevance: 8 Novelty: 7


45. BiGain: Unified Token Compression for Joint Generation and Classification

ArXiv ID: 2603.12240

Authors: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

Comment: Training-free token compression for diffusion backbones using frequency-aware merging/downsampling, directly addressing efficient model computation.

Relevance: 8 Novelty: 7


46. SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

ArXiv ID: 2603.12414

Authors: Davi Bonetto

Abstract: State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.

Comment: Systems/theory for state-space models: spectral-radius analysis of memory collapse with a real-time architectural monitor.

Relevance: 8 Novelty: 7


47. A Geometrically-Grounded Drive for MDL-Based Optimization in Deep Learning

ArXiv ID: 2603.12304

Authors: Ming Lei, Shufan Wu, Christophe Baehr

Abstract: This paper introduces a novel optimization framework that fundamentally integrates the Minimum Description Length (MDL) principle into the training dynamics of deep neural networks. Moving beyond its conventional role as a model selection criterion, we reformulate MDL as an active, adaptive driving force within the optimization process itself. The core of our method is a geometrically-grounded cognitive manifold whose evolution is governed by a \textit{coupled Ricci flow}, enriched with a novel \textit{MDL Drive} term derived from first principles. This drive, modulated by the task-loss gradient, creates a seamless harmony between data fidelity and model simplification, actively compressing the internal representation during training. We establish a comprehensive theoretical foundation, proving key properties including the monotonic decrease of description length (Theorem~\ref{thm:convergence}), a finite number of topological phase transitions via a geometric surgery protocol (Theorems~\ref{thm:surgery}, \ref{thm:ultimate_fate}), and the emergence of universal critical behavior (Theorem~\ref{thm:universality}). Furthermore, we provide a practical, computationally efficient algorithm with $O(N \log N)$ per-iteration complexity (Theorem~\ref{thm:complexity}), alongside guarantees for numerical stability (Theorem~\ref{thm:stability}) and exponential convergence under convexity assumptions (Theorem~\ref{thm:convergence_rate}). Empirical validation on synthetic regression and classification tasks confirms the theoretical predictions, demonstrating the algorithm's efficacy in achieving robust generalization and autonomous model simplification. This work provides a principled path toward more autonomous, generalizable, and interpretable AI systems by unifying geometric deep learning with information-theoretic principles.

Comment: Representation learning/compression: integrates MDL directly into training dynamics with a theoretical geometric optimization framework.

Relevance: 8 Novelty: 7


48. Efficient Reasoning with Balanced Thinking

ArXiv ID: 2603.12372

Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian

Abstract: Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .

Comment: Efficiency for transformers/LRMs: training-free hidden-state steering to adapt reasoning compute between overthinking and underthinking.

Relevance: 8 Novelty: 7


49. Event-Driven Video Generation

ArXiv ID: 2603.13402

Authors: Chika Maduabuchi

Abstract: State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.

Comment: Core architecture innovation for video transformers: event-gated sampling adds explicit interaction structure to DiT generation.

Relevance: 8 Novelty: 7


50. OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

ArXiv ID: 2603.11493

Authors: Chuancheng Shi, Wenhua Wu, Fei Shen, Xiaogang Zhu, Kun Hu, Zhiyong Wang

Abstract: Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.

Comment: Uses sparse autoencoders to disentangle superposed features and applies orthogonal projection for concept erasure, directly targeting representation structure.

Relevance: 8 Novelty: 7


51. Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder

ArXiv ID: 2603.11793

Authors: Alaa Yasser, Kittipat Phunjanna, Marcos Escudero Viñolo, Catarina Barata, Jenny Benois-Pineau

Abstract: Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness

Comment: Mechanistic interpretability of transformers by localizing demographic bias to individual attention heads in CLIP's vision encoder.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - Do not output markdown, code fences, or any extra text.

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

  • 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
  • 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
  • 5-6: touches the target topics, but the main contribution is elsewhere.
  • 3-4: largely outside the target topics, often application-focused or domain-specific.
  • 1-2: unrelated.

Rare exception: If a paper looks off-topic at first glance but plausibly introduces a new foundational direction with major future impact, you may still assign Relevance 9-10.

Novelty Scoring

  • 9-10: new paradigm, theory, or major methodological breakthrough.
  • 7-8: substantial methodological advance or strong new insight.
  • 5-6: meaningful but incremental extension or refinement.
  • 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
  • 1-2: little originality; mainly standard application of existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Focus on foundational research. Keep papers whose main contribution is methodological, theoretical, or systems-level. Filter out papers that are mainly application-driven.

  1. Model Architecture - Keep: Mixture-of-Experts (MoE), Transformers, conditional or dynamic networks, autoencoders, or analysis and innovation on core architectures. - Filter: papers that mainly apply existing architectures to a task without architectural insight.

  2. Model Compression and Efficiency - Keep: sparsity, pruning, quantization, low-rank methods, cache, or other algorithmic and theoretical efficiency advances. - Filter: straightforward application of known compression methods to a new task.

  3. High Performance Computing - Keep: algorithmic or systems innovations for training large models, distributed training, or memory optimization. - Filter: incremental engineering improvements without clear methodological contribution.

  4. Representation Learning - Keep: work on how networks encode information, feature or dictionary learning, sparse or contrastive methods, or training dynamics. - Filter: standard applications of known techniques without new theoretical or methodological insight.

Usually irrelevant unless the core contribution is clearly foundational: - Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning - Domain applications such as medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, etc.