Personalized Daily ArXiv Papers 2025-09-29

[gpt-5]	Prompt	Completion	Total
Token	89447	81726	171173
Cost	$0.11	$0.82	$0.93

Total arXiv papers: 939

Total scanned papers: 569

Total relevant papers: 60

Table of contents with paper titles:

Active Attacks: Red-teaming LLMs via Adaptive Environments Authors: Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms Authors: Rohan Deb, Qiaobo Li, Mayank Shrivastava, Arindam Banerjee
$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization Authors: Yuandong Tian
Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova
HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space Authors: Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang
Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models Authors: Aleksandar Terzi\'c, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule Authors: Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen
Statistical Advantage of Softmax Attention: Insights from Single-Location Regression Authors: O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborov\'a
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts Authors: Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Partial Parameter Updates for Efficient Distributed Training Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert
A Law of Data Reconstruction for Random Features (and Beyond) Authors: Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli
Enhancing Low-Rank Adaptation with Structured Nonlinear Transformations Authors: Guanzhi Deng, Mingyang Liu, Dapeng Wu, Yinqiao Li, Linqi Song
Wavelet-Induced Rotary Encodings: RoPE Meets Graphs Authors: Isaac Reid, Arijit Sehanobish, Cedrik H\"ofs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veli\v{c}kovi\'c
Concept-SAE: Active Causal Probing of Visual Model Behavior Authors: Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu
Mechanistic Independence: A Principle for Identifiable Disentangled Representations Authors: Stefan Matthes, Zhiwei Han, Hao Shen
Linear Causal Representation Learning by Topological Ordering, Pruning, and Disentanglement Authors: Hao Chen, Lin Liu, Yu Guang Wang
REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model Authors: Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qinghua Zhao, Linqi Song, Lijie Wen
Neural Feature Geometry Evolves as Discrete Ricci Flow Authors: Moritz Hehl, Max von Renesse, Melanie Weber
Scale-Wise VAR is Secretly Discrete Diffusion Authors: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel
Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding Authors: Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou
Bilinear relational structure fixes reversal curse and enables consistent model editing Authors: Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha
StateX: Enhancing RNN Recall via Post-training State Expansion Authors: Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu
SlimDiff: Training-Free, Activation-Guided Hands-free Slimming of Diffusion Models Authors: Arani Roy, Shristi Das Biswas, Kaushik Roy
LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning Authors: Marco Paul E. Apolinario, Kaushik Roy
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets
PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters Authors: Krishu K Thapa, Reet Barik, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
Blockwise Hadamard high-Rank Adaptation for Parameter-Efficient LLM Fine-Tuning Authors: Feng Yu, Jia Hu, Geyong Min
InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models Authors: Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Yuan Xie, Hongxia Yang
Lightweight error mitigation strategies for post-training N:M activation sparsity in LLMs Authors: Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones Authors: Wenyi Gong, Mieszko Lis
Stochastic activations Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazar\'e, Herv\'e J\'egou
IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method Authors: Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time Authors: Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang
General Pruning Criteria for Fast SBL Authors: Jakob M\"oderl, Erik Leitinger, Bernard Henri Fleury
Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs Authors: Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda
Global Convergence in Neural ODEs: Impact of Activation Functions Authors: Tianxiang Gao, Siyuan Sun, Hailiang Liu, Hongyang Gao
Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning Authors: Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting Authors: Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe
Toward a Physics of Deep Learning and Brains Authors: Arsham Ghavasieh, Meritxell Vila-Minana, Akanksha Khurd, John Beggs, Gerardo Ortiz, Santo Fortunato
TRACE: Learning to Compute on Graphs Authors: Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu
Differentiable Structure Learning for General Binary Data Authors: Chang Deng, Bryon Aragam
Overclocking Electrostatic Generative Models Authors: Daniil Shlenskii, Alexander Korotin
A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems Authors: Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Christopher R\'e, Scott W. Linderman
Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics Authors: Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang
R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning Authors: Hongyu Shan, Mingyang Song, Chang Dai, Di Liang, Han Chen
From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement Authors: Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan, Yang Xiang, Buzhou Tang
ChaosNexus: A Foundation Model for Universal Chaotic System Forecasting with Multi-scale Representations Authors: Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li
Effective continuous equations for adaptive SGD: a stochastic analysis view Authors: Luca Callisti, Marco Romito, Francesco Triggiano
Prophecy: Inferring Formal Properties from Neuron Activations Authors: Divya Gopinath, Corina S. Pasareanu, Muhammad Usman
A Data-driven Typology of Vision Models from Integrated Representational Metrics Authors: Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla
SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders Authors: Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto
Kernel Regression of Multi-Way Data via Tensor Trains with Hadamard Overparametrization: The Dynamic Graph Flow Case Authors: Duc Thien Nguyen, Konstantinos Slavakis, Eleftherios Kofidis, Dimitris Pados
Sharpness-Aware Minimization Can Hallucinate Minimizers Authors: Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang
A circuit for predicting hierarchical structure in-context in Large Language Models Authors: Tankred Saanum, Can Demircan, Samuel J. Gershman, Eric Schulz
IndiSeek learns information-guided disentangled representations Authors: Yu Gui, Cong Ma, Zongming Ma
Null-Space Filtering for Data-Free Continual Model Merging: Preserving Transparency, Promoting Fidelity Authors: Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Transformers Can Learn Connectivity in Some Graphs but Not Others Authors: Amit Roy, Abulhair Saparov
Understanding and Enhancing Mask-Based Pretraining towards Universal Representations Authors: Mingze Dong, Leda Wang, Yuval Kluger

1. Active Attacks: Red-teaming LLMs via Adaptive Environments

ArXiv ID: 2509.21947

Authors: Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim

Abstract: We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.

Comment: Author match

2. Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms

ArXiv ID: 2509.21847

Authors: Rohan Deb, Qiaobo Li, Mayank Shrivastava, Arindam Banerjee

Abstract: Uniform bounds on sketched inner products of vectors or matrices underpin several important computational and statistical results in machine learning and randomized algorithms, including the Johnson-Lindenstrauss (J-L) lemma, the Restricted Isometry Property (RIP), randomized sketching, and approximate linear algebra. However, many modern analyses involve sketched bilinear forms, for which existing uniform bounds either do not apply or are not sharp on general sets. In this work, we develop a general framework to analyze such sketched bilinear forms and derive uniform bounds in terms of geometric complexities of the associated sets. Our approach relies on generic chaining and introduces new techniques for handling suprema over pairs of sets. We further extend these results to the setting where the bilinear form involves a sum of $T$ independent sketching matrices and show that the deviation scales as $\sqrt{T}$. This unified analysis recovers known results such as the J-L lemma as special cases, while extending RIP-type guarantees. Additionally, we obtain improved convergence bounds for sketched Federated Learning algorithms where such cross terms arise naturally due to sketched gradient compression, and design sketched variants of bandit algorithms with sharper regret bounds that depend on the geometric complexity of the action and parameter sets, rather than the ambient dimension.

Comment: Compression/Efficiency/HPC Theory: Uniform bounds for sketched bilinear forms via generic chaining; extends JL/RIP and improves guarantees for gradient compression and bandits.

Relevance: 10 Novelty: 9

3. $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

ArXiv ID: 2509.21519

Authors: Yuandong Tian

Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient $G_F$ across layers. In (I), $G_F$ is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of $G_F$) only depends on its own activation, and thus each hidden node learns their representation independently from $G_F$, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.

Comment: Representation Learning: provides a mathematical framework for grokking and feature emergence with provable dynamics and scaling laws in neural networks.

Relevance: 10 Novelty: 9

4. Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

ArXiv ID: 2509.22445

Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova

Abstract: The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

Comment: Model Compression and Generalization Theory: proposes asymptotically optimal MDL objectives for Transformers grounded in Kolmogorov complexity; constructs a tractable variational objective.

Relevance: 10 Novelty: 9

5. HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

ArXiv ID: 2509.22299

Authors: Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang

Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.

Comment: Model Architecture (MoE) + Compression/Efficiency: second-order, Hessian-based atomic expert pruning with reduced complexity enables fine-grained MoE compression.

Relevance: 10 Novelty: 9

6. Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

ArXiv ID: 2509.22284

Authors: Aleksandar Terzi\'c, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi

Abstract: Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model's expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N \times N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model

Comment: Model Architecture with structured sparsity: PD-SSM factorizes transition matrices (one-hot P times diagonal D) enabling FSA state tracking at diagonal-SSM cost with strong expressivity guarantees.

Relevance: 10 Novelty: 9

7. OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule

ArXiv ID: 2509.21623

Authors: Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen

Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.

Comment: Model Compression and Efficiency: context-aware online low-rank KV cache compression with Oja’s rule and a hybrid storage policy; practical long-context memory optimization compatible with FlashAttention.

Relevance: 10 Novelty: 8

8. Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

ArXiv ID: 2509.21936

Authors: O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborov\'a

Abstract: Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.

Comment: Representation Learning/Architecture Analysis: high-dimensional theory shows softmax attention attains Bayes risk and outperforms linear attention.

Relevance: 10 Novelty: 8

9. Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

ArXiv ID: 2509.21892

Authors: Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

Abstract: Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. Intuitively, activating more experts at inference $k'$ (where $k'> k$) means engaging a larger set of model parameters for the computation and thus is expected to improve performance. However, contrary to this intuition, we find the scaling range to be so narrow that performance begins to degrade rapidly after only a slight increase in the number of experts. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead. By simultaneously training experts to collaborate in diverse combinations and encouraging the router for high-quality selections, EMoE ensures robust performance across computational budgets at inference. We conduct extensive experiments on various MoE settings. Our results show that EMoE significantly expands the effective performance-scaling range, extending it to as much as 2-3$\times$ the training-time $k$, while also pushing the model's peak performance to a higher level.

Comment: Model Architecture (MoE): training framework that enables scaling the number of activated experts at inference by fostering expert collaboration and robust routing.

Relevance: 10 Novelty: 8

10. Partial Parameter Updates for Efficient Distributed Training

ArXiv ID: 2509.22418

Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert

Abstract: We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a $1.3$B-parameter language model trained across $32$ nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.

Comment: Matches High Performance Computing and Efficiency: introduces partial parameter updates for low-communication distributed training, reducing memory/FLOPs and avoiding activation exchange while maintaining perplexity.

Relevance: 10 Novelty: 8

11. A Law of Data Reconstruction for Random Features (and Beyond)

ArXiv ID: 2509.22214

Authors: Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli

Abstract: Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$. In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when $p$ is larger than $dn$, where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.

Comment: Representation Learning/Theory: shows a law for full data reconstruction (p ≳ d·n) in random features and beyond, with an accompanying reconstruction method.

Relevance: 9 Novelty: 9

12. Enhancing Low-Rank Adaptation with Structured Nonlinear Transformations

ArXiv ID: 2509.21870

Authors: Guanzhi Deng, Mingyang Liu, Dapeng Wu, Yinqiao Li, Linqi Song

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models. However, its linear nature limits expressiveness. We propose LoRAN, a non-linear extension of LoRA that applies lightweight transformations to the low-rank updates. We further introduce Sinter, a sine-based activation that adds structured perturbations without increasing parameter count. Experiments across summarization and classification tasks show that LoRAN consistently improves over QLoRA. Ablation studies reveal that Sinter outperforms standard activations such as Sigmoid, ReLU, and Tanh, highlighting the importance of activation design in lowrank tuning.

Comment: Model Compression and Efficiency: non-linear low-rank adaptation (LoRAN) with sine-based activation for parameter-efficient fine-tuning.

Relevance: 10 Novelty: 7

13. Wavelet-Induced Rotary Encodings: RoPE Meets Graphs

ArXiv ID: 2509.22259

Authors: Isaac Reid, Arijit Sehanobish, Cedrik H\"ofs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veli\v{c}kovi\'c

Abstract: We introduce WIRE: Wavelet-Induced Rotary Encodings. WIRE extends Rotary Position Encodings (RoPE), a popular algorithm in LLMs and ViTs, to graph-structured data. We demonstrate that WIRE is more general than RoPE, recovering the latter in the special case of grid graphs. WIRE also enjoys a host of desirable theoretical properties, including equivariance under node ordering permutation, compatibility with linear attention, and (under select assumptions) asymptotic dependence on graph resistive distance. We test WIRE on a range of synthetic and real-world tasks, including identifying monochromatic subgraphs, semantic segmentation of point clouds, and more standard graph benchmarks. We find it to be effective in settings where the underlying graph structure is important.

Comment: Model Architecture/Representation Learning: Generalizes RoPE to graphs (WIRE) with theoretical guarantees (permutation equivariance, linear-attention compatibility).

Relevance: 9 Novelty: 8

14. Concept-SAE: Active Causal Probing of Visual Model Behavior

ArXiv ID: 2509.22015

Authors: Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu

Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.

Comment: Representation Learning: Concept-grounded sparse autoencoders with dual supervision enabling causal probing via interventions.

Relevance: 9 Novelty: 8

15. Mechanistic Independence: A Principle for Identifiable Disentangled Representations

ArXiv ID: 2509.22196

Authors: Stefan Matthes, Zhiwei Han, Hao Shen

Abstract: Disentangled representations seek to recover latent factors of variation underlying observed data, yet their identifiability is still not fully understood. We introduce a unified framework in which disentanglement is achieved through mechanistic independence, which characterizes latent factors by how they act on observed variables rather than by their latent distribution. This perspective is invariant to changes of the latent density, even when such changes induce statistical dependencies among factors. Within this framework, we propose several related independence criteria -- ranging from support-based and sparsity-based to higher-order conditions -- and show that each yields identifiability of latent subspaces, even under nonlinear, non-invertible mixing. We further establish a hierarchy among these criteria and provide a graph-theoretic characterization of latent subspaces as connected components. Together, these results clarify the conditions under which disentangled representations can be identified without relying on statistical assumptions.

Comment: Representation Learning: principled identifiability of disentangled latent subspaces under nonlinear, non-invertible mixing via mechanistic independence.

Relevance: 9 Novelty: 8

16. Linear Causal Representation Learning by Topological Ordering, Pruning, and Disentanglement

ArXiv ID: 2509.22553

Authors: Hao Chen, Lin Liu, Yu Guang Wang

Abstract: Causal representation learning (CRL) has garnered increasing interests from the causal inference and artificial intelligence community, due to its capability of disentangling potentially complex data-generating mechanism into causally interpretable latent features, by leveraging the heterogeneity of modern datasets. In this paper, we further contribute to the CRL literature, by focusing on the stylized linear structural causal model over the latent features and assuming a linear mixing function that maps latent features to the observed data or measurements. Existing linear CRL methods often rely on stringent assumptions, such as accessibility to single-node interventional data or restrictive distributional constraints on latent features and exogenous measurement noise. However, these prerequisites can be challenging to satisfy in certain scenarios. In this work, we propose a novel linear CRL algorithm that, unlike most existing linear CRL methods, operates under weaker assumptions about environment heterogeneity and data-generating distributions while still recovering latent causal features up to an equivalence class. We further validate our new algorithm via synthetic experiments and an interpretability analysis of large language models (LLMs), demonstrating both its superiority over competing methods in finite samples and its potential in integrating causality into AI.

Comment: Representation Learning: linear causal representation learning with topological ordering, pruning, and disentanglement to recover latent causal features under weaker assumptions.

Relevance: 9 Novelty: 8

17. REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model

ArXiv ID: 2509.22518

Authors: Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qinghua Zhao, Linqi Song, Lijie Wen

Abstract: Understanding how Large Language Models (LLMs) perform complex reasoning and their failure mechanisms is a challenge in interpretability research. To provide a measurable geometric analysis perspective, we define the concept of the Reasoning Manifold, a latent low-dimensional geometric structure formed by the internal representations corresponding to all correctly reasoned generations. This structure can be conceptualized as the embodiment of the effective thinking paths that the model has learned to successfully solve a given task. Based on this concept, we build REMA, a framework that explains the origins of failures by quantitatively comparing the spatial relationships of internal model representations corresponding to both erroneous and correct reasoning samples. Specifically, REMA first quantifies the geometric deviation of each erroneous representation by calculating its k-nearest neighbors distance to the approximated manifold formed by correct representations, thereby providing a unified failure signal. It then localizes the divergence points where these deviations first become significant by tracking this deviation metric across the model's layers and comparing it against a baseline of internal fluctuations from correct representations, thus identifying where the reasoning chain begins to go off-track. Our extensive experiments on diverse language and multimodal models and tasks demonstrate the low-dimensional nature of the reasoning manifold and the high separability between erroneous and correct reasoning representations. The results also validate the effectiveness of the REMA framework in analyzing the origins of reasoning failures. This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for in-depth understanding and diagnosis of the internal computational processes of black-box models.

Comment: Representation Learning: proposes a low-dimensional reasoning manifold and geometric deviation metric to localize failure points in LLM reasoning.

Relevance: 9 Novelty: 8

18. Neural Feature Geometry Evolves as Discrete Ricci Flow

ArXiv ID: 2509.22362

Authors: Moritz Hehl, Max von Renesse, Melanie Weber

Abstract: Deep neural networks learn feature representations via complex geometric transformations of the input data manifold. Despite the models' empirical success across domains, our understanding of neural feature representations is still incomplete. In this work we investigate neural feature geometry through the lens of discrete geometry. Since the input data manifold is typically unobserved, we approximate it using geometric graphs that encode local similarity structure. We provide theoretical results on the evolution of these graphs during training, showing that nonlinear activations play a crucial role in shaping feature geometry in feedforward neural networks. Moreover, we discover that the geometric transformations resemble a discrete Ricci flow on these graphs, suggesting that neural feature geometry evolves analogous to Ricci flow. This connection is supported by experiments on over 20,000 feedforward neural networks trained on binary classification tasks across both synthetic and real-world datasets. We observe that the emergence of class separability corresponds to the emergence of community structure in the associated graph representations, which is known to relate to discrete Ricci flow dynamics. Building on these insights, we introduce a novel framework for locally evaluating geometric transformations through comparison with discrete Ricci flow dynamics. Our results suggest practical design principles, including a geometry-informed early-stopping heuristic and a criterion for selecting network depth.

Comment: Representation learning: theoretical link between neural feature geometry and discrete Ricci flow, explaining training dynamics and informing design (depth, early stopping).

Relevance: 9 Novelty: 8

19. Scale-Wise VAR is Secretly Discrete Diffusion

ArXiv ID: 2509.22636

Authors: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel

Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.

Comment: Model architecture: proves VAR with Markovian attention is equivalent to discrete diffusion, enabling diffusion-style iterative refinement for AR transformers and improving efficiency.

Relevance: 9 Novelty: 8

20. Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

ArXiv ID: 2509.22134

Authors: Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou

Abstract: Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B), GTO increases acceptance length by 7.4% and yields an additional 7.7% speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference.

Comment: Efficiency for LLM inference: Group Tree Optimization aligns training with speculative decoding’s tree policy, with a provable reward tied to acceptance length and speedup.

Relevance: 9 Novelty: 8

21. Bilinear relational structure fixes reversal curse and enables consistent model editing

ArXiv ID: 2509.21993

Authors: Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha

Abstract: The reversal curse -- a language model's (LM) inability to infer an unseen fact B is A'' from a learned factA is B'' -- is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. By training LMs from scratch on a synthetic dataset of relational knowledge graphs, we demonstrate that bilinear relational structure emerges in their hidden representations. This structure substantially alleviates the reversal curse, enabling LMs to infer unseen reverse facts. Crucially, we also find that this bilinear structure plays a key role in consistent model editing. When a fact is updated in a LM with this structure, the edit correctly propagates to its reverse and other logically dependent facts. In contrast, models lacking this representation not only suffer from the reversal curse but also fail to generalize edits, further introducing logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn enable LMs to behave in a logically consistent manner after editing. This implies that the success of model editing depends critically not just on editing algorithms but on the underlying representational geometry of the knowledge being modified.

Comment: Representation Learning: identifies bilinear relational structure in LM representations that fixes reversal curse and enables consistent model editing—linking internal geometry to logical generalization.

Relevance: 9 Novelty: 8

22. StateX: Enhancing RNN Recall via Post-training State Expansion

ArXiv ID: 2509.22630

Authors: Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract: While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.

Comment: Model Architecture and Efficiency: post-training recurrent state expansion to boost recall for linear-attention/SSM RNNs with minimal parameter growth.

Relevance: 9 Novelty: 8

23. IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

ArXiv ID: 2509.22621

Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu

Abstract: Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

Comment: Matches Representation Learning/Training Dynamics: self-distills SFT to align internal activations with ICL, transferring in-context computation mechanisms to improve accuracy and calibration.

Relevance: 9 Novelty: 8

24. SlimDiff: Training-Free, Activation-Guided Hands-free Slimming of Diffusion Models

ArXiv ID: 2509.21498

Authors: Arani Roy, Shristi Das Biswas, Kaushik Roy

Abstract: Diffusion models (DMs), lauded for their generative performance, are computationally prohibitive due to their billion-scale parameters and iterative denoising dynamics. Existing efficiency techniques, such as quantization, timestep reduction, or pruning, offer savings in compute, memory, or runtime but are strictly bottlenecked by reliance on fine-tuning or retraining to recover performance. In this work, we introduce SlimDiff, an automated activation-informed structural compression framework that reduces both attention and feedforward dimensionalities in DMs, while being entirely gradient-free. SlimDiff reframes DM compression as a spectral approximation task, where activation covariances across denoising timesteps define low-rank subspaces that guide dynamic pruning under a fixed compression budget. This activation-aware formulation mitigates error accumulation across timesteps by applying module-wise decompositions over functional weight groups: query--key interactions, value--output couplings, and feedforward projections, rather than isolated matrix factorizations, while adaptively allocating sparsity across modules to respect the non-uniform geometry of diffusion trajectories. SlimDiff achieves up to 35\% acceleration and $\sim$100M parameter reduction over baselines, with generation quality on par with uncompressed models without any backpropagation. Crucially, our approach requires only about 500 calibration samples, over 70$\times$ fewer than prior methods. To our knowledge, this is the first closed-form, activation-guided structural compression of DMs that is entirely training-free, providing both theoretical clarity and practical efficiency.

Comment: Compression/Efficiency: training-free, activation-guided structural slimming (low-rank/sparsity) of diffusion models for speed and parameter reduction.

Relevance: 9 Novelty: 8

25. LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning

ArXiv ID: 2509.21617

Authors: Marco Paul E. Apolinario, Kaushik Roy

Abstract: On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but relying on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decompsoition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250$\times$ while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it achieves performance competitive with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.

Comment: Compression/Efficiency: Low-rank activation compression via one-shot HOSVD for memory-optimized backprop; reusable subspaces enable efficient on-device continual learning.

Relevance: 9 Novelty: 7

26. OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

ArXiv ID: 2509.22033

Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets

Abstract: Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

Comment: Representation Learning: Sparse autoencoders with orthogonality regularization to mitigate feature absorption/composition while scaling linearly.

Relevance: 9 Novelty: 7

27. PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters

ArXiv ID: 2509.21619

Authors: Krishu K Thapa, Reet Barik, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath

Abstract: Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%

Comment: Model Compression/Efficiency: hybrid pretraining that switches to Low-Rank Adapters with per-layer rank selection to cut trainable parameters and memory.

Relevance: 9 Novelty: 7

28. Blockwise Hadamard high-Rank Adaptation for Parameter-Efficient LLM Fine-Tuning

ArXiv ID: 2509.21637

Authors: Feng Yu, Jia Hu, Geyong Min

Abstract: Parameter-efficient fine-tuning (PEFT) methods must be resource-efficient yet handle heterogeneous reasoning transformations, and classical low-rank adaptation (LoRA) is constrained by the nominal rank $r$. Hadamard-style extensions like HiRA raise the nominal rank but couple every update to the global energy pattern of the frozen weight matrix, while ABBA trades this inductive bias for fully learned dense intermediates. To address the limitation of global modulation, we propose Block Hadamard high-Rank Adaptation (BHRA), which partitions each weight matrix and applies HiRA-style multiplicative modulation independently within every block, preserving the PEFT parameter footprint while unlocking localized rank amplification. Our empirical analyses reveal that this blockwise design maintains rich spectra across rank budgets, mitigating the collapse induced by global modulation. Across eight commonsense reasoning tasks and two arithmetic benchmarks with Llama-3.2 1B/3B, Mistral-7B, and Gemma-2 9B, BHRA consistently surpasses strong PEFT baselines under matched parameter budgets.

Comment: Model Compression/Efficiency (PEFT): blockwise Hadamard high-rank adaptation increases effective rank under fixed parameter budget, improving fine-tuning efficiency.

Relevance: 9 Novelty: 7

29. InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

ArXiv ID: 2509.22536

Authors: Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Yuan Xie, Hongxia Yang

Abstract: The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

Comment: Model Compression and Efficiency: comprehensive FP8 quantized training recipe; High Performance Computing: hybrid-granularity quantization improves throughput, memory, and training time for LLMs.

Relevance: 9 Novelty: 7

30. Lightweight error mitigation strategies for post-training N:M activation sparsity in LLMs

ArXiv ID: 2509.22166

Authors: Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov

Abstract: The demand for efficient large language model (LLM) inference has intensified the focus on sparsification techniques. While semi-structured (N:M) pruning is well-established for weights, its application to activation pruning remains underexplored despite its potential for dynamic, input-adaptive compression and reductions in I/O overhead. This work presents a comprehensive analysis of methods for post-training N:M activation pruning in LLMs. Across multiple LLMs, we demonstrate that pruning activations enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. We evaluate lightweight, plug-and-play error mitigation techniques and pruning criteria, establishing strong hardware-friendly baselines that require minimal calibration. Furthermore, we explore sparsity patterns beyond NVIDIA's standard 2:4, showing that the 16:32 pattern achieves performance nearly on par with unstructured sparsity. However, considering the trade-off between flexibility and hardware implementation complexity, we focus on the 8:16 pattern as a superior candidate. Our findings provide both effective practical methods for activation pruning and a motivation for future hardware to support more flexible sparsity patterns. Our code is available https://anonymous.4open.science/r/Structured-Sparse-Activations-Inference-EC3C/README.md .

Comment: Strong match to Model Compression and Efficiency: post-training N:M activation sparsity for LLMs with lightweight error mitigation and analysis of hardware-friendly patterns (e.g., 8:16, 16:32).

Relevance: 9 Novelty: 7

31. CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones

ArXiv ID: 2509.21764

Authors: Wenyi Gong, Mieszko Lis

Abstract: Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

Comment: Compression/Efficiency: spatial-preserving token merging tailored to ViT backbones with window/relative position designs, improving speed with minimal accuracy loss.

Relevance: 9 Novelty: 7

32. Stochastic activations

ArXiv ID: 2509.22358

Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazar\'e, Herv\'e J\'egou

Abstract: We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.

Comment: Model Architecture and Efficiency: introduces stochastic activations to enable ReLU at inference for sparse latent vectors and reduced FLOPs; addresses optimization/training dynamics of activation functions.

Relevance: 9 Novelty: 7

33. IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method

ArXiv ID: 2509.22463

Authors: Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu

Abstract: High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the \textbf{I}terative \textbf{I}mplicit \textbf{E}uler \textbf{T}ransformer \textbf{(IIET)}, which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce \textbf{I}teration \textbf{I}nfluence-\textbf{A}ware \textbf{D}istillation \textbf{(IIAD)}. Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65\% over vanilla Transformers and 0.8\% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55\% while retaining 99.4\% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6\% over vanilla Transformer with comparable speed.

Comment: Model Architecture + Efficiency: ODE-based Transformer using iterative implicit Euler with an influence-aware distillation scheme for improved performance-efficiency and compressibility.

Relevance: 9 Novelty: 7

34. Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time

ArXiv ID: 2509.22572

Authors: Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang

Abstract: Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.

Comment: Model Architecture (MoE) + Test-time scaling: controls number of activated experts during inference to induce diverse reasoning paths without extra cost.

Relevance: 9 Novelty: 7

35. General Pruning Criteria for Fast SBL

ArXiv ID: 2509.21572

Authors: Jakob M\"oderl, Erik Leitinger, Bernard Henri Fleury

Abstract: Sparse Bayesian learning (SBL) associates to each weight in the underlying linear model a hyperparameter by assuming that each weight is Gaussian distributed with zero mean and precision (inverse variance) equal to its associated hyperparameter. The method estimates the hyperparameters by marginalizing out the weights and performing (marginalized) maximum likelihood (ML) estimation. SBL returns many hyperparameter estimates to diverge to infinity, effectively setting the estimates of the corresponding weights to zero (i.e., pruning the corresponding weights from the model) and thereby yielding a sparse estimate of the weight vector. In this letter, we analyze the marginal likelihood as function of a single hyperparameter while keeping the others fixed, when the Gaussian assumptions on the noise samples and the weight distribution that underlies the derivation of SBL are weakened. We derive sufficient conditions that lead, on the one hand, to finite hyperparameter estimates and, on the other, to infinite ones. Finally, we show that in the Gaussian case, the two conditions are complementary and coincide with the pruning condition of fast SBL (F-SBL), thereby providing additional insights into this algorithm.

Comment: Compression/Efficiency: theoretical pruning criteria in Sparse Bayesian Learning clarifying when weights are pruned (sparsity).

Relevance: 9 Novelty: 7

36. Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs

ArXiv ID: 2509.21895

Authors: Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda

Abstract: We derive a new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and reproducing kernel Hilbert spaces (RKHSs). The proposed bound describes why the models with high-rank weight matrices generalize well. Although there are existing bounds that attempt to describe this phenomenon, these existing bounds can be applied to limited types of models. We introduce an algebraic representation of neural networks and a kernel function to construct an RKHS to derive a bound for a wider range of realistic models. This work paves the way for the Koopman-based theory for Rademacher complexity bounds to be valid for more practical situations.

Comment: Representation Learning/Training Dynamics: new Rademacher complexity bounds via RKHS/Koopman explain why high-rank weight matrices generalize.

Relevance: 8 Novelty: 8

37. Global Convergence in Neural ODEs: Impact of Activation Functions

ArXiv ID: 2509.22436

Authors: Tianxiang Gao, Siyuan Sun, Hailiang Liu, Hongyang Gao

Abstract: Neural Ordinary Differential Equations (ODEs) have been successful in various applications due to their continuous nature and parameter-sharing efficiency. However, these unique characteristics also introduce challenges in training, particularly with respect to gradient computation accuracy and convergence analysis. In this paper, we address these challenges by investigating the impact of activation functions. We demonstrate that the properties of activation functions, specifically smoothness and nonlinearity, are critical to the training dynamics. Smooth activation functions guarantee globally unique solutions for both forward and backward ODEs, while sufficient nonlinearity is essential for maintaining the spectral properties of the Neural Tangent Kernel (NTK) during training. Together, these properties enable us to establish the global convergence of Neural ODEs under gradient descent in overparameterized regimes. Our theoretical findings are validated by numerical experiments, which not only support our analysis but also provide practical guidelines for scaling Neural ODEs, potentially leading to faster training and improved performance in real-world applications.

Comment: Training Dynamics/Architecture: establishes global convergence of Neural ODEs under smooth, sufficiently nonlinear activations via NTK properties.

Relevance: 8 Novelty: 8

38. Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning

ArXiv ID: 2509.22335

Authors: Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris

Abstract: We investigate why deep neural networks suffer from \emph{loss of plasticity} in deep continual learning, failing to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. To characterize the necessary condition for successful training, we introduce the notion of $\tau$-trainability and show that current plasticity preserving algorithms can be unified under this framework. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying $L2$ penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.

Comment: Representation Learning/Training Dynamics: links Hessian spectral collapse to loss of plasticity and proposes regularizers (feature-rank, L2) to preserve trainability.

Relevance: 8 Novelty: 8

39. Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

ArXiv ID: 2509.21761

Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen

Abstract: Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

Comment: Representation Learning: mechanistic attribution of backdoor features and attention heads; Sparsity: identifies a sparse set of heads enabling backdoor control via a single vector intervention.

Relevance: 8 Novelty: 8

40. Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting

ArXiv ID: 2509.22615

Authors: Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe

Abstract: Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.

Comment: Matches Model Compression and Efficiency: 2D Gaussian Splatting compresses visual inputs and reduces tokenization cost; High Performance Computing: batched CUDA kernels with ~90x faster fitting and high GPU utilization; efficient adapter over a frozen Transformer.

Relevance: 8 Novelty: 8

41. Toward a Physics of Deep Learning and Brains

ArXiv ID: 2509.22649

Authors: Arsham Ghavasieh, Meritxell Vila-Minana, Akanksha Khurd, John Beggs, Gerardo Ortiz, Santo Fortunato

Abstract: Deep neural networks and brains both learn and share superficial similarities: processing nodes are likened to neurons and adjustable weights are likened to modifiable synapses. But can a unified theoretical framework be found to underlie them both? Here we show that the equations used to describe neuronal avalanches in living brains can also be applied to cascades of activity in deep neural networks. These equations are derived from non-equilibrium statistical physics and show that deep neural networks learn best when poised between absorbing and active phases. Because these networks are strongly driven by inputs, however, they do not operate at a true critical point but within a quasi-critical regime -- one that still approximately satisfies crackling noise scaling relations. By training networks with different initializations, we show that maximal susceptibility is a more reliable predictor of learning than proximity to the critical point itself. This provides a blueprint for engineering improved network performance. Finally, using finite-size scaling we identify distinct universality classes, including Barkhausen noise and directed percolation. This theoretical framework demonstrates that universal features are shared by both biological and artificial neural networks.

Comment: Representation learning/training dynamics: non-equilibrium statistical physics framework (quasi-criticality, susceptibility, universality classes) explaining when networks learn best.

Relevance: 8 Novelty: 8

42. TRACE: Learning to Compute on Graphs

ArXiv ID: 2509.21886

Authors: Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu

Abstract: Learning to compute, the ability to model the functional behavior of a computational graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce \textbf{TRACE}, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce \textbf{function shift learning}, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the \textit{function shift}, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on electronic circuits, one of the most complex and economically critical classes of computational graphs. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning to compute on graphs.

Comment: Model Architecture: Hierarchical Transformer backbone aligned with stepwise computation and a function-shift learning objective for graph computation.

Relevance: 8 Novelty: 8

43. Differentiable Structure Learning for General Binary Data

ArXiv ID: 2509.21658

Authors: Chang Deng, Bryon Aragam

Abstract: Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.

Comment: Matches Representation Learning: general differentiable structure learning for binary variables with identifiability analysis and a single differentiable optimization, capturing arbitrary dependencies.

Relevance: 8 Novelty: 8

44. Overclocking Electrostatic Generative Models

ArXiv ID: 2509.22454

Authors: Daniil Shlenskii, Alexander Korotin

Abstract: Electrostatic generative models such as PFGM++ have recently emerged as a powerful framework, achieving state-of-the-art performance in image synthesis. PFGM++ operates in an extended data space with auxiliary dimensionality $D$, recovering the diffusion model framework as $D\to\infty$, while yielding superior empirical results for finite $D$. Like diffusion models, PFGM++ relies on expensive ODE simulations to generate samples, making it computationally costly. To address this, we propose Inverse Poisson Flow Matching (IPFM), a novel distillation framework that accelerates electrostatic generative models across all values of $D$. Our IPFM reformulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher. We derive a tractable training objective for this problem and show that, as $D \to \infty$, our IPFM closely recovers Score Identity Distillation (SiD), a recent method for distilling diffusion models. Empirically, our IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Moreover, we observe that distillation converges faster for finite $D$ than in the $D \to \infty$ (diffusion) limit, which is consistent with prior findings that finite-$D$ PFGM++ models exhibit more favorable optimization and sampling properties.

Comment: Matches Model Compression and Efficiency: introduces IPFM, a distillation framework to accelerate electrostatic generative models (PFGM++) across auxiliary dimensions, reducing function evaluations.

Relevance: 8 Novelty: 8

45. A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

ArXiv ID: 2509.21716

Authors: Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Christopher R\'e, Scott W. Linderman

Abstract: Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.

Comment: High Performance Computing: unifies fixed-point parallelization of sequential models via an LDS framework for efficient, scalable evaluation.

Relevance: 8 Novelty: 8

46. Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics

ArXiv ID: 2509.22279

Authors: Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang

Abstract: Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.

Comment: Model Architecture: Mixture-of-Experts with task-aware recurrent noisy gating and temporal/channel token routing with load-balancing loss.