Personalized Daily ArXiv Papers 2025-10-07

[gpt-5]	Prompt	Completion	Total
Token	109658	106795	216453
Cost	$0.14	$1.07	$1.21

Total arXiv papers: 1350

Total scanned papers: 866

Total relevant papers: 70

Table of contents with paper titles:

Boomerang Distillation Enables Zero-Shot Model Size Interpolation Authors: Sara Kangaslahti, Nihal V. Nayak, Jonathan Geuter, Marco Fumero, Francesco Locatello, David Alvarez-Melis
Implicit Models: Expressive Power Scales with Test-Time Compute Authors: Jialin Liu, Lisang Ding, Stanley Osher, Wotao Yin
PolyKAN: A Polyhedral Analysis Framework for Provable and Minimal KAN Compression Authors: Di Zhang
Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention Authors: Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava
Allocation of Parameters in Transformers Authors: Ruoxi Yu, Haotian Jiang, Jingpu Cheng, Penghao Yu, Qianxiao Li, Zhong Li
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space Authors: Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility Authors: Annan Yu, Danielle C. Maddix, Boran Han, Xiyuan Zhang, Abdul Fatir Ansari, Oleksandr Shchur, Christos Faloutsos, Andrew Gordon Wilson, Michael W. Mahoney, Yuyang Wang
Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning Authors: Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention Authors: Haiquan Qiu, Quanming Yao
UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs Authors: Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu
PT$^2$-LLM: Post-Training Ternarization for Large Language Models Authors: Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, Yulun Zhang
MemMamba: Rethinking Memory Patterns in State Space Model Authors: Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun
A Mathematical Explanation of Transformers for Large Language Models and GPTs Authors: Xue-Cheng Tai, Hao Liu, Lingfeng Li, Raymond H. Chan
Post-training quantization of vision encoders needs prefixing registers Authors: Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee
StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory Authors: Xinyuan Song, Guangji Bai, Liang Zhao
SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size Authors: Junhao Xia, Ming Zhao, Limin Xiao, Xiujun Zhang
On Structured State-Space Duality Authors: Jerry Yao-Chieh Hu, Xiwen Zhang, Weimin Wu, Han Liu
On residual network depth Authors: Benoit Dherin, Michael Munn
Multilingual Routing in Mixture-of-Experts Authors: Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng
Quantization Range Estimation for Convolutional Neural Networks Authors: Bingtao Yang, Yujia Wang, Mengzhi Jiao, Hongwei Huo
From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing Authors: Rana Shahout, Colin Cai, Yilun Du, Minlan Yu, Michael Mitzenmacher
On the Limitations and Capabilities of Position Embeddings for Length Generalization Authors: Yang Chen, Yitao Liang, Zhouchen Lin
PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling Authors: Yukun Zhang, Xueqing Zhou
Learning without Global Backpropagation via Synergistic Information Distillation Authors: Chenhao Ye, Ming Tang
Paris: A Decentralized Trained Open-Weight Diffusion Model Authors: Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy
Composite Optimization with Error Feedback: the Dual Averaging Approach Authors: Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian Stich
Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts Authors: Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
Arithmetic-Mean $\mu$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets Authors: Haosong Zhang, Shenxi Wu, Yichi Zhang, Wei Lin
Platonic Transformers: A Solid Choice For Equivariance Authors: Mohammad Mohaiminul Islam, Rishabh Anand, David R. Wessels, Friso de Kruiff, Thijs P. Kuipers, Rex Ying, Clara I. S\'anchez, Sharvaree Vadgama, Georg B\"okman, Erik J. Bekkers
Decrypt Modality Gap in Multimodal Contrastive Learning: From Convergent Representation to Pair Alignment Authors: Lingjie Yi, Raphael Douady, Chao Chen
Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors Authors: Zhiwei Han, Stefan Matthes, Hao Shen
Expand Neurons, Not Parameters Authors: Linghao Kong, Inimai Subramanian, Yonadav Shavit, Micah Adler, Dan Alistarh, Nir Shavit
What Scales in Cross-Entropy Scaling Law? Authors: Junxi Yan, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu
Understanding the Role of Training Data in Test-Time Scaling Authors: Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni
HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks Authors: Nghiem T. Diep, Dung Le, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho
Distributed Low-Communication Training with Decoupled Momentum Optimization Authors: Sasho Nedelkoski, Alexander Acker, Odej Kao, Soeren Becker, Dominik Scheinert
DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks Authors: Nghiem T. Diep, Hien Dang, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders Authors: Xu Wang, Yan Hu, Benyou Wang, Difan Zou
Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models Authors: Tianao Zhang, Zhiteng Li, Xianglong Yan, Haotong Qin, Yong Guo, Yulun Zhang
Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models Authors: Sofiane Ennadir, Levente Z\'olyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao
Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy Authors: Karthik Viswanathan, Sang Eon Park
Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices Authors: Congzheng Song, Xinyu Tang
KVComm: Enabling Efficient LLM Communication through Selective KV Sharing Authors: Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire Jr., Dejan Kostic
Optimal Scaling Needs Optimal Norm Authors: Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim
Sharp Lower Bounds for Linearized ReLU^k Approximation on the Sphere Authors: Tong Mao, Jinchao Xu
How does the optimizer implicitly bias the model merging loss landscape? Authors: Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw
Decision Potential Surface: A Theoretical and Practical Approximation of LLM's Decision Boundary Authors: Zi Liang, Zhiyao Wu, Haoyang Shang, Yulin Jin, Qingqing Ye, Huadi Zheng, Peizhao Hu, Haibo Hu
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models Authors: Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia
Internal states before wait modulate reasoning patterns Authors: Dmitrii Troitskii, Koyena Pal, Chris Wendler, Callum Stuart McDougall, Neel Nanda
Learning to Interpret Weight Differences in Language Models Authors: Avichal Goel, Yoon Kim, Nir Shavit, Tony T. Wang
Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining Authors: Anirudh Subramanyam, Yuxin Chen, Robert L. Grossman
QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks Authors: Qian Chen, Linxin Yang, Akang Wang, Xiaodong Luo, Yin Zhang
Decomposing Attention To Find Context-Sensitive Neurons Authors: Alex Gibson
Learning Linear Regression with Low-Rank Tasks in-Context Authors: Kaito Takanami, Takashi Takahashi, Yoshiyuki Kabashima
Efficient Training of Spiking Neural Networks by Spike-aware Data Pruning Authors: Chenxiang Ma, Xinyi Chen, Yujie Wu, Kay Chen Tan, Jibin Wu
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking Authors: Ali Saheb Pasand, Elvis Dohmatob
Activation Steering with a Feedback Controller Authors: Dung V. Nguyen, Hieu M. Vu, Nhi Y. Pham, Lei Zhang, Tan M. Nguyen
REG: A Regularization Optimizer for Robust Training Dynamics Authors: Zehua Liu, Han Wu, Xiaojin Fu, Shuqi Liu, Xiongwei Han, Tao Zhong, Mingxuan Yuan
Why Cannot Neural Networks Master Extrapolation? Insights from Physical Laws Authors: Ramzi Dakhmouche, Hossein Gorji
Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout Authors: Andi Zhang, Xuan Ding, Haofan Wang, Steven McDonagh, Samuel Kaski
Adaptively Sampling-Reusing-Mixing Decomposed Gradients to Speed Up Sharpness Aware Minimization Authors: Jiaxin Deng, Junbiao Pang
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning Authors: Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework Authors: Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar, Jayvart Sharma, Ryan Lagasse
Compressed Concatenation of Small Embedding Models Authors: Mohamed Ayoub Ben Ayad, Michael Dinzinger, Kanishka Ghosh Dastidar, Jelena Mitrovic, Michael Granitzer
GRACE: Generative Representation Learning via Contrastive Policy Optimization Authors: Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, Jiawei Han
Light Differentiable Logic Gate Networks Authors: Lukas R\"uttgers, Till Aczel, Andreas Plesner, Roger Wattenhofer
Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs Authors: Emanuele Mule, Stefano Fiorini, Antonio Purificato, Federico Siciliano, Stefano Coniglio, Fabrizio Silvestri
From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning Authors: Ali Azizpour, Reza Ramezanpour, Ashutosh Sabharwal, Santiago Segarra
MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment Authors: Yufei Li, Yu Fu, Yue Dong, Cong Liu

1. Boomerang Distillation Enables Zero-Shot Model Size Interpolation

ArXiv ID: 2510.05064

Authors: Sara Kangaslahti, Nihal V. Nayak, Jonathan Geuter, Marco Fumero, Francesco Locatello, David Alvarez-Melis

Abstract: Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments. The code and models are available at https://github.com/dcml-lab/boomerang-distillation.

Comment: Strongly matches Model Compression/Efficiency and Model Architecture: zero-shot model size interpolation by re-incorporating teacher blocks after distillation (no extra training).

Relevance: 10 Novelty: 9

2. Implicit Models: Expressive Power Scales with Test-Time Compute

ArXiv ID: 2510.03638

Authors: Jialin Liu, Lisang Ding, Stanley Osher, Wotao Yin

Abstract: Implicit models, an emerging model class, compute outputs by iterating a single parameter block to a fixed point. This architecture realizes an infinite-depth, weight-tied network that trains with constant memory, significantly reducing memory needs for the same level of performance compared to explicit models. While it is empirically known that these compact models can often match or even exceed larger explicit networks by allocating more test-time compute, the underlying mechanism remains poorly understood. We study this gap through a nonparametric analysis of expressive power. We provide a strict mathematical characterization, showing that a simple and regular implicit operator can, through iteration, progressively express more complex mappings. We prove that for a broad class of implicit models, this process lets the model's expressive power scale with test-time compute, ultimately matching a much richer function class. The theory is validated across three domains: image reconstruction, scientific computing, and operations research, demonstrating that as test-time iterations increase, the complexity of the learned mapping rises, while the solution quality simultaneously improves and stabilizes.

Comment: Model Architecture and Efficiency: theory for implicit (infinite-depth, weight-tied) models showing expressive power scales with test-time iterations and constant-memory training.

Relevance: 10 Novelty: 9

3. PolyKAN: A Polyhedral Analysis Framework for Provable and Minimal KAN Compression

ArXiv ID: 2510.04205

Authors: Di Zhang

Abstract: Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to traditional Multi-Layer Perceptrons (MLPs), offering enhanced interpretability and a strong mathematical foundation. However, their parameter efficiency remains a significant challenge for practical deployment. This paper introduces PolyKAN, a novel theoretical framework for KAN compression that provides formal guarantees on both model size reduction and approximation error. By leveraging the inherent piecewise polynomial structure of KANs, we formulate the compression problem as one of optimal polyhedral region merging. We establish a rigorous polyhedral characterization of KANs, develop a complete theory of $\epsilon$-equivalent compression, and design an optimal dynamic programming algorithm that guarantees minimal compression under specified error bounds. Our theoretical analysis demonstrates that PolyKAN achieves provably minimal compression while maintaining strict error control, with polynomial-time complexity in all network parameters. The framework provides the first formal foundation for KAN compression with mathematical guarantees, opening new directions for efficient deployment of interpretable neural architectures.

Comment: Model Compression and Efficiency: provable, minimal KAN compression via polyhedral analysis and ε-equivalent compression with an optimal DP algorithm.

Relevance: 10 Novelty: 9

4. Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention

ArXiv ID: 2510.04008

Authors: Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava

Abstract: Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today's hardware. We hope that it gets adopted in practice.

Comment: Matches Model Compression/Efficiency and HPC: replaces Softmax with linear-time RACE attention via sharpened angular similarity, randomized projections, and soft LSH; enables million-token contexts with reduced memory/runtime.

Relevance: 10 Novelty: 9

5. Allocation of Parameters in Transformers

ArXiv ID: 2510.03784

Authors: Ruoxi Yu, Haotian Jiang, Jingpu Cheng, Penghao Yu, Qianxiao Li, Zhong Li

Abstract: Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers' layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.

Comment: Strongly matches Model Architecture/Efficiency: theoretical allocation of attention heads and dimensions across Transformer layers with saturation analysis.

Relevance: 10 Novelty: 8

6. Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

ArXiv ID: 2510.04476

Authors: Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge

Abstract: Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

Comment: Model Compression and Efficiency: introduces compressed convolutional attention (CCA/CCGQA) reducing KV-cache and FLOPs with significant speedups; applicable to dense and MoE models.

Relevance: 10 Novelty: 8

7. Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility

ArXiv ID: 2510.03358

Authors: Annan Yu, Danielle C. Maddix, Boran Han, Xiyuan Zhang, Abdul Fatir Ansari, Oleksandr Shchur, Christos Faloutsos, Andrew Gordon Wilson, Michael W. Mahoney, Yuyang Wang

Abstract: Transformers are widely used across data modalities, and yet the principles distilled from text models often transfer imperfectly to models trained to other modalities. In this paper, we analyze Transformers through the lens of rank structure. Our focus is on the time series setting, where the structural properties of the data differ remarkably from those of text or vision. We show that time-series embeddings, unlike text or vision, exhibit sharply decaying singular value spectra: small patch sizes and smooth continuous mappings concentrate the data into low-rank subspaces. From this, we prove that the associated $Q/K/V$ projections admit accurate low-rank approximations, and that attention layers become compressible in proportion to the decay of the embedding spectrum. We introduce the concept of flow-of-ranks, a phenomenon by which nonlinear mixing across depth inflates the rank, explaining why early layers are most amenable to compression and why ranks grow with depth. Guided by these theoretical and empirical results, we use these insights to compress Chronos, a large time series foundation model, achieving a reduction of $65\%$ in inference time and $81\%$ in memory, without loss of accuracy. Our findings provide principled guidance for allocating width, depth, and heads in time series foundation models, and for exploiting their inherent compressibility.

Comment: Model Compression and Efficiency: establishes low-rank structure in time-series embeddings, proves compressibility of Q/K/V and attention, introduces flow-of-ranks; guides width/depth/head allocation and achieves large inference/memory reductions on a foundation TS model.

Relevance: 10 Novelty: 8

ArXiv ID: 2508.04581

Authors: Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

Abstract: Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

Comment: Model Architecture and Efficiency: structured cross-layer weight sharing via matrix dictionary learning for attention projections, yielding 66.7% parameter reduction with strong performance.

Relevance: 10 Novelty: 8

9. ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

ArXiv ID: 2505.02819

Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko

Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.

Comment: Matches Model Compression and Efficiency: training-free depth pruning by replacing Transformer blocks with a linear operator using small calibration data; no retraining needed.

Relevance: 10 Novelty: 8

10. Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

ArXiv ID: 2510.04212

Authors: Haiquan Qiu, Quanming Yao

Abstract: The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosions. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem.

Comment: Matches Low-Precision Training and Efficiency: mechanistic analysis of flash attention failures under low precision and a minimal modification to mitigate biased rounding errors.

Relevance: 10 Novelty: 8

11. UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs

ArXiv ID: 2510.03291

Authors: Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu

Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.

Comment: Compression/Efficiency: unified post-training pruning with mirror descent combining local saliency and global coordination; supports unstructured and N:M sparsity with one-shot mask generation.

Relevance: 10 Novelty: 8

12. PT$^2$-LLM: Post-Training Ternarization for Large Language Models

ArXiv ID: 2510.03267

Authors: Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, Yulun Zhang

Abstract: Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment. Ternarization has gained attention as a promising compression technique, delivering substantial size reduction and high computational efficiency. However, its potential in the post-training quantization (PTQ) setting remains underexplored, due to the challenge of training-free parameter optimization and the quantization difficulty posed by outliers and dispersed weights. To address these issues, we propose PT$^2$-LLM, a post-training ternarization framework tailored for LLMs. At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline: (1) Iterative Ternary Fitting (ITF), which alternates between optimal ternary grid construction and flexible rounding to minimize quantization error, and (2) Activation-aware Grid Alignment (AGA), which further refines the ternary grid to better match full-precision outputs. In addition, we propose a plug-and-play Structural Similarity-based Reordering (SSR) strategy that leverages inter-column structural similarity to ease quantization and mitigate outlier effects, further enhancing overall performance. Extensive experiments demonstrate that PT$^2$-LLM delivers competitive performance against state-of-the-art (SOTA) 2-bit PTQ methods with lower memory cost, while also accelerating both prefill and decoding to achieve end-to-end speedup. The code and models will be available at https://github.com/XIANGLONGYAN/PT2-LLM.

Comment: Directly matches Model Compression and Efficiency: post-training ternarization (quantization) for LLMs with asymmetric ternary quantizer, iterative fitting, and activation-aware refinement.

Relevance: 10 Novelty: 8

13. MemMamba: Rethinking Memory Patterns in State Space Model

ArXiv ID: 2510.03279

Authors: Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun

Abstract: With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.

Comment: Strongly matches Model Architecture: theoretical analysis of Mamba’s memory decay and a new MemMamba architecture adding state summarization and cross-layer/token attention with linear complexity.

Relevance: 10 Novelty: 8

14. A Mathematical Explanation of Transformers for Large Language Models and GPTs

ArXiv ID: 2510.03989

Authors: Xue-Cheng Tai, Hao Liu, Lingfeng Li, Raymond H. Chan

Abstract: The Transformer architecture has revolutionized the field of sequence modeling and underpins the recent breakthroughs in large language models (LLMs). However, a comprehensive mathematical theory that explains its structure and operations remains elusive. In this work, we propose a novel continuous framework that rigorously interprets the Transformer as a discretization of a structured integro-differential equation. Within this formulation, the self-attention mechanism emerges naturally as a non-local integral operator, and layer normalization is characterized as a projection to a time-dependent constraint. This operator-theoretic and variational perspective offers a unified and interpretable foundation for understanding the architecture's core components, including attention, feedforward layers, and normalization. Our approach extends beyond previous theoretical analyses by embedding the entire Transformer operation in continuous domains for both token indices and feature dimensions. This leads to a principled and flexible framework that not only deepens theoretical insight but also offers new directions for architecture design, analysis, and control-based interpretations. This new interpretation provides a step toward bridging the gap between deep learning architectures and continuous mathematical modeling, and contributes a foundational perspective to the ongoing development of interpretable and theoretically grounded neural network models.

Comment: Model Architecture—provides a continuous operator-theoretic formulation of Transformers (self-attention as integral operator, layer norm as projection), deepening theoretical foundations.

Relevance: 10 Novelty: 8

15. Post-training quantization of vision encoders needs prefixing registers

ArXiv ID: 2510.04547

Authors: Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee

Abstract: Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

Comment: Model Compression and Efficiency—training-free post-training quantization for vision encoders via prefix registers (RegCache) to suppress activation outliers.

Relevance: 10 Novelty: 8

16. StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory

ArXiv ID: 2510.03246

Authors: Xinyuan Song, Guangji Bai, Liang Zhao

Abstract: Pruning is critical for scaling large language models (LLMs). Global pruning achieves strong performance but requires $\mathcal{O}(N)$ memory, which is infeasible for billion-parameter models. Local pruning reduces GPU memory usage to that of a single layer by pruning layers independently, but it neglects inter-layer dependencies and often leads to suboptimal performance in high-sparsity regimes. Unlike unstructured pruning, structured pruning produces regular sparsity patterns that align well with GPU kernels and library optimizations, making it more hardware-efficient. However, structured pruning typically relies on global pruning, since structured patterns are more prone to severe performance degradation under local optimization. To jointly achieve structured pruning and the memory efficiency of local pruning, we propose a divide-and-conquer strategy that decomposes the global pruning problem into coordinated subproblems across different modules, each of which fits within limited GPU memory. Building on this idea, we design \textbf{STRUPRUNE}, an ADMM-based framework that integrates structured sparsity into the pruning process, combining the memory efficiency of local pruning with the hardware compatibility of structured methods. We derive a closed-form analytical solution for structured pruning masks that provides an explicit rule for layer-wise sparsity allocation, and further develop an energy-based asymptotic framework yielding a softmax-form allocation scheme that simplifies optimization while adapting to heterogeneous layer importance. Experiments demonstrate that STRUPRUNE matches the perplexity of global structured pruning while reducing memory cost from $\mathcal{O}(N)$ to $\mathcal{O}(\sqrt{N})$, enabling practical deployment at the billion-parameter scale.

Comment: Model Compression/Efficiency—structured global pruning with O(sqrt(N)) memory via ADMM and derived layer-wise sparsity allocation.

Relevance: 10 Novelty: 8

17. SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

ArXiv ID: 2510.03275

Authors: Junhao Xia, Ming Zhao, Limin Xiao, Xiujun Zhang

Abstract: Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.

Comment: Model Compression and Efficiency: introduces Sigma-Delta 1-bit/1.58-bit quantization for LLMs with adjustable and fine-grained OSR allocation plus Hadamard-based weight smoothing.

Relevance: 10 Novelty: 8

18. On Structured State-Space Duality

ArXiv ID: 2510.04944

Authors: Jerry Yao-Chieh Hu, Xiwen Zhang, Weimin Wu, Han Liu

Abstract: Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a $1$-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time $O(T)$ recurrence or as a quadratic-time $O(T^2)$ attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to $1$-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

Comment: Matches Model Architecture: formalizes and generalizes the SSM–masked-attention duality, providing necessary/sufficient conditions and training complexity bounds; expands efficient Transformer/SSM design space.

Relevance: 10 Novelty: 8

19. On residual network depth

ArXiv ID: 2510.03470

Authors: Benoit Dherin, Michael Munn

Abstract: Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.

Comment: Model Architecture: Residual Expansion Theorem giving first-principles analysis of depth in residual networks and principled scaling to control combinatorial path growth.

Relevance: 9 Novelty: 9

20. Multilingual Routing in Mixture-of-Experts

ArXiv ID: 2510.04694

Authors: Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng

Abstract: Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

Comment: Mixture-of-Experts: analysis of multilingual routing dynamics with inference-time router steering to enhance cross-lingual expert utilization.

Relevance: 10 Novelty: 7

21. Quantization Range Estimation for Convolutional Neural Networks

ArXiv ID: 2510.04044

Authors: Bingtao Yang, Yujia Wang, Mengzhi Jiao, Hongwei Huo

Abstract: Post-training quantization for reducing the storage of deep neural network models has been demonstrated to be an effective way in various tasks. However, low-bit quantization while maintaining model accuracy is a challenging problem. In this paper, we present a range estimation method to improve the quantization performance for post-training quantization. We model the range estimation into an optimization problem of minimizing quantization errors by layer-wise local minima. We prove this problem is locally convex and present an efficient search algorithm to find the optimal solution. We propose the application of the above search algorithm to the transformed weights space to do further improvement in practice. Our experiments demonstrate that our method outperforms state-of-the-art performance generally on top-1 accuracy for image classification tasks on the ResNet series models and Inception-v3 model. The experimental results show that the proposed method has almost no loss of top-1 accuracy in 8-bit and 6-bit settings for image classifications, and the accuracy of 4-bit quantization is also significantly improved. The code is available at https://github.com/codeiscommitting/REQuant.

Comment: Strongly matches Model Compression/Efficiency: post-training quantization with provable local convexity and efficient range search.

Relevance: 10 Novelty: 7

22. From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

ArXiv ID: 2510.03293

Authors: Rana Shahout, Colin Cai, Yilun Du, Minlan Yu, Michael Mitzenmacher

Abstract: Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates directly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy. LASER adapts to the shape of the gate's score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable experts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping the accuracy changes negligible.

Comment: Model Architecture + HPC/Efficiency: MoE inference-time routing that adapts to gate score distributions to balance expert load and reduce latency without retraining.

Relevance: 10 Novelty: 7

23. On the Limitations and Capabilities of Position Embeddings for Length Generalization

ArXiv ID: 2510.04130

Authors: Yang Chen, Yitao Liang, Zhouchen Lin

Abstract: In Transformers, Position Embeddings (PEs) significantly influence Length Generalization (LG) performance, yet their fundamental role remains unclear. In this work, we investigate the limitations and capabilities of PEs in achieving LG. We theoretically analyze PEs in Position-Only Linear Attentions (POLAs), introducing Linear Representation Complexity (LRC) to characterize when PEs enable LG. Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions. Extending to practical Transformers, we propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales. We support this hypothesis with empirical evidence in various reasoning tasks. To enhance LG, we introduce Scale Hint, allowing flexible instance scaling, and a Learning-Based Position Embedding framework that automatically learns positional relations. Our work provides theoretical insights and practical strategies for improving LG in Transformers.

Comment: Matches Model Architecture (Transformers): theoretical analysis of position embeddings for length generalization (LRC/SRC) plus a learning-based PE framework and scale hints.

Relevance: 9 Novelty: 8

24. PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

ArXiv ID: 2510.03272

Authors: Yukun Zhang, Xueqing Zhou

Abstract: The Transformer architecture has revolutionized artificial intelligence, yet a principled theoretical understanding of its internal mechanisms remains elusive. This paper introduces a novel analytical framework that reconceptualizes the Transformer's discrete, layered structure as a continuous spatiotemporal dynamical system governed by a master Partial Differential Equation (PDE). Within this paradigm, we map core architectural components to distinct mathematical operators: self-attention as a non-local interaction, the feed-forward network as a local reaction, and, critically, residual connections and layer normalization as indispensable stabilization mechanisms. We do not propose a new model, but rather employ the PDE system as a theoretical probe to analyze the mathematical necessity of these components. By comparing a standard Transformer with a PDE simulator that lacks explicit stabilizers, our experiments provide compelling empirical evidence for our central thesis. We demonstrate that without residual connections, the system suffers from catastrophic representational drift, while the absence of layer normalization leads to unstable, explosive training dynamics. Our findings reveal that these seemingly heuristic "tricks" are, in fact, fundamental mathematical stabilizers required to tame an otherwise powerful but inherently unstable continuous system. This work offers a first-principles explanation for the Transformer's design and establishes a new paradigm for analyzing deep neural networks through the lens of continuous dynamics.

Comment: Model Architecture: PDE-based continuous dynamical system analysis of Transformer components (attention, FFN, residuals, layer norm) as stabilizers.

Relevance: 9 Novelty: 8

25. Learning without Global Backpropagation via Synergistic Information Distillation

ArXiv ID: 2510.03273

Authors: Chenhao Ye, Ming Tang

Abstract: Backpropagation (BP), while foundational to deep learning, imposes two critical scalability bottlenecks: update locking, where network modules remain idle until the entire backward pass completes, and high memory consumption due to storing activations for gradient computation. To address these limitations, we introduce Synergistic Information Distillation (SID), a novel training framework that reframes deep learning as a cascade of local cooperative refinement problems. In SID, a deep network is structured as a pipeline of modules, each imposed with a local objective to refine a probabilistic belief about the ground-truth target. This objective balances fidelity to the target with consistency to the belief from its preceding module. By decoupling the backward dependencies between modules, SID enables parallel training and hence eliminates update locking and drastically reduces memory requirements. Meanwhile, this design preserves the standard feed-forward inference pass, making SID a versatile drop-in replacement for BP. We provide a theoretical foundation, proving that SID guarantees monotonic performance improvement with network depth. Empirically, SID consistently matches or surpasses the classification accuracy of BP, exhibiting superior scalability and pronounced robustness to label noise.Code is available at: https://github.com/ychAlbert/sid-bp

Comment: High Performance Computing: training without global backprop via local synergistic distillation to remove update locking and reduce activation memory, enabling parallel module updates.

Relevance: 9 Novelty: 8

26. Paris: A Decentralized Trained Open-Weight Diffusion Model

ArXiv ID: 2510.03434

Authors: Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy

Abstract: We present Paris, the first publicly released diffusion model pre-trained entirely through decentralized computation. Paris demonstrates that high-quality text-to-image generation can be achieved without centrally coordinated infrastructure. Paris is open for research and commercial use. Paris required implementing our Distributed Diffusion Training framework from scratch. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization. Rather than requiring synchronized gradient updates across thousands of GPUs, we partition data into semantically coherent clusters where each expert independently optimizes its subset while collectively approximating the full distribution. A lightweight transformer router dynamically selects appropriate experts at inference, achieving generation quality comparable to centrally coordinated baselines. Eliminating synchronization enables training on heterogeneous hardware without specialized interconnects. Empirical validation confirms that Paris's decentralized training maintains generation quality while removing the dedicated GPU cluster requirement for large-scale diffusion models. Paris achieves this using 14$\times$ less training data and 16$\times$ less compute than the prior decentralized baseline.

Comment: Matches Model Architecture (MoE) and High Performance/Systems: decentralized training of independent experts with a router, eliminating synchronization.

Relevance: 9 Novelty: 8

27. Composite Optimization with Error Feedback: the Dual Averaging Approach

ArXiv ID: 2510.03507

Authors: Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian Stich

Abstract: Communication efficiency is a central challenge in distributed machine learning training, and message compression is a widely used solution. However, standard Error Feedback (EF) methods (Seide et al., 2014), though effective for smooth unconstrained optimization with compression (Karimireddy et al., 2019), fail in the broader and practically important setting of composite optimization, which captures, e.g., objectives consisting of a smooth loss combined with a non-smooth regularizer or constraints. The theoretical foundation and behavior of EF in the context of the general composite setting remain largely unexplored. In this work, we consider composite optimization with EF. We point out that the basic EF mechanism and its analysis no longer stand when a composite part is involved. We argue that this is because of a fundamental limitation in the method and its analysis technique. We propose a novel method that combines Dual Averaging with EControl (Gao et al., 2024), a state-of-the-art variant of the EF mechanism, and achieves for the first time a strong convergence analysis for composite optimization with error feedback. Along with our new algorithm, we also provide a new and novel analysis template for inexact dual averaging method, which might be of independent interest. We also provide experimental results to complement our theoretical findings.

Comment: Matches Model Compression and Efficiency and High-Performance Computing: communication-efficient distributed training with compression via a new EF–Dual Averaging method and convergence analysis for composite objectives.

Relevance: 9 Novelty: 8

28. Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

ArXiv ID: 2510.05040

Authors: Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi

Abstract: Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

Comment: Matches Model Architecture and Efficiency: reveals implicit Mixture-of-Experts–like specialization in diffusion LLMs and proposes a training-free test-time ensembling method (HEX) across generation schedules.

Relevance: 9 Novelty: 8

29. Arithmetic-Mean $\mu$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets

ArXiv ID: 2510.04327

Authors: Haosong Zhang, Shenxi Wu, Yichi Zhang, Wei Lin

Abstract: Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization ($\mu$P) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean $\mu$P (AM-$\mu$P), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot \mathrm{fan\text{-}in})$) - AM-$\mu$P yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $\eta^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects are constant-level as $N\gg k$. For standard residual networks with general conv+MLP blocks, we establish $\eta^\star(L)=\Theta(L^{-3/2})$, with $L$ the minimal depth. Empirical results across a range of depths confirm the $-3/2$ scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.

Comment: Matches Training Dynamics/Initialization and scaling laws: introduces AM-μP with provable learning-rate depth scaling (L^{-3/2}) for CNNs/ResNets enabling zero-shot LR transfer.

Relevance: 9 Novelty: 8

30. Platonic Transformers: A Solid Choice For Equivariance

ArXiv ID: 2510.03511

Authors: Mohammad Mohaiminul Islam, Rishabh Anand, David R. Wessels, Friso de Kruiff, Thijs P. Kuipers, Rex Ying, Clara I. S\'anchez, Sharvaree Vadgama, Georg B\"okman, Erik J. Bekkers

Abstract: While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

Comment: Model Architecture: introduces an equivariant Transformer via Platonic group–based attention and weight sharing; formally equivalent to dynamic group convolution and includes a linear-time convolutional variant (Efficiency).

Relevance: 9 Novelty: 8

31. Decrypt Modality Gap in Multimodal Contrastive Learning: From Convergent Representation to Pair Alignment

ArXiv ID: 2510.03268

Authors: Lingjie Yi, Raphael Douady, Chao Chen

Abstract: Multimodal contrastive learning (MCL) aims to embed data from different modalities in a shared embedding space. However, empirical evidence shows that representations from different modalities occupy completely separate regions of embedding space, a phenomenon referred to as the modality gap. Moreover, experimental findings on how the size of the modality gap influences downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) How does it affect downstream tasks? To address these questions, this paper introduces the first theoretical framework for analyzing the convergent optimal representations of MCL and the modality alignment when training is optimized. Specifically, we prove that without any constraint or under the cone constraint, the modality gap converges to zero. Under the subspace constraint (i.e., representations of two modalities fall into two distinct hyperplanes due to dimension collapse), the modality gap converges to the smallest angle between the two hyperplanes. This result identifies \emph{dimension collapse} as the fundamental origin of the modality gap. Furthermore, our theorems demonstrate that paired samples cannot be perfectly aligned under the subspace constraint. The modality gap influences downstream performance by affecting the alignment between sample pairs. We prove that, in this case, perfect alignment between two modalities can still be achieved via two ways: hyperplane rotation and shared space projection.

Comment: Matches Representation Learning: first theoretical framework explaining modality gap in multimodal contrastive learning via dimension collapse and alignment theory.

Relevance: 9 Novelty: 8

32. Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors

ArXiv ID: 2510.04758

Authors: Zhiwei Han, Stefan Matthes, Hao Shen

Abstract: In this work, we establish conditions under which nonlinear CCA recovers the ground-truth latent factors up to an orthogonal transform after whitening. Building on the classical result that linear mappings maximize canonical correlations under Gaussian priors, we prove affine identifiability for a broad class of latent distributions in the population setting. Central to our proof is a reparameterization result that transports the analysis from observation space to source space, where identifiability becomes tractable. We further show that whitening is essential for ensuring boundedness and well-conditioning, thereby underpinning identifiability. Beyond the population setting, we prove that ridge-regularized empirical CCA converges to its population counterpart, transferring these guarantees to the finite-sample regime. Experiments on a controlled synthetic dataset and a rendered image dataset validate our theory and demonstrate the necessity of its assumptions through systematic ablations.

Comment: Representation Learning: proves affine identifiability for nonlinear CCA under latent priors, with whitening necessity and finite-sample convergence guarantees.

Relevance: 9 Novelty: 8

33. Expand Neurons, Not Parameters

ArXiv ID: 2510.04500

Authors: Linghao Kong, Inimai Subramanian, Yonadav Shavit, Micah Adler, Dan Alistarh, Nir Shavit

Abstract: This work demonstrates how increasing the number of neurons in a network without increasing its number of non-zero parameters improves performance. We show that this gain corresponds with a decrease in interference between multiple features that would otherwise share the same neurons. To reduce such entanglement at a fixed non-zero parameter count, we introduce Fixed Parameter Expansion (FPE): replace a neuron with multiple children and partition the parent's weights disjointly across them, so that each child inherits a non-overlapping subset of connections. On symbolic tasks, specifically Boolean code problems, clause-aligned FPE systematically reduces polysemanticity metrics and yields higher task accuracy. Notably, random splits of neuron weights approximate these gains, indicating that reduced collisions, not precise assignment, are a primary driver. Consistent with the superposition hypothesis, the benefits of FPE grow with increasing interference: when polysemantic load is high, accuracy improvements are the largest. Transferring these insights to real models (classifiers over CLIP embeddings and deeper multilayer networks) we find that widening networks while maintaining a constant non-zero parameter count consistently increases accuracy. These results identify an interpretability-grounded mechanism to leverage width against superposition, improving performance without increasing the number of non-zero parameters. Such a direction is well matched to modern accelerators, where memory movement of non-zero parameters, rather than raw compute, is the dominant bottleneck.

Comment: Model Architecture and Efficiency—Fixed Parameter Expansion widens networks at constant non-zero parameters to reduce polysemanticity and improve accuracy.

Relevance: 9 Novelty: 8

34. What Scales in Cross-Entropy Scaling Law?

ArXiv ID: 2510.04067

Authors: Junxi Yan, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu

Abstract: The cross-entropy scaling law has long served as a key tool for guiding the development of large language models. It shows that cross-entropy loss decreases in a predictable power-law rate as the model size increases. However, recent evidence indicates that this law breaks down at very large scales: the loss decreases more slowly than expected, which causes significant trouble for developing large language models. In this paper, we hypothesize that the root cause lies in the fact that cross-entropy itself does not truly scale; instead, only one of its hidden components does. To investigate this, we introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence. We show both theoretically and empirically that this decomposition precisely captures the training dynamics and optimization objectives. Through extensive experiments on multiple datasets and 32 models spanning five orders of magnitude in size, we find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant. Moreover, error-entropy constitutes the dominant share of cross-entropy in small models but diminishes in proportion as models grow larger. This explains why the cross-entropy scaling law appears accurate at small scales but fails at very large ones. Our findings establish the error-entropy scaling law as a more accurate description of model behavior. We believe it will have wide applications in the training, understanding, and future development of large language models.

Comment: Representation Learning/Training Dynamics: theoretical decomposition of cross-entropy into error-entropy/self-alignment/confidence, identifying error-entropy as the true scaling component.

Relevance: 9 Novelty: 8

35. Understanding the Role of Training Data in Test-Time Scaling

ArXiv ID: 2510.03605

Authors: Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni

Abstract: Test-time scaling improves the reasoning capabilities of large language models (LLMs) by allocating extra compute to generate longer Chains-of-Thoughts (CoTs). This enables models to tackle more complex problem by breaking them down into additional steps, backtracking, and correcting mistakes. Despite its strong performance--demonstrated by OpenAI's o1 and DeepSeek R1, the conditions in the training data under which long CoTs emerge, and when such long CoTs improve the performance, remain unclear. In this paper, we study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression. Our analysis provides a theoretical explanation for several intriguing observations: First, at any fixed test error, increasing test-time compute allows us to reduce the number of in-context examples (context length) in training prompts. Second, if the skills required to solve a downstream task are not sufficiently present in the training data, increasing test-time compute can harm performance. Finally, we characterize task hardness via the smallest eigenvalue of its feature covariance matrix and show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling. We confirm our findings with experiments on large, nonlinear transformer architectures.

Comment: Representation Learning/Training Dynamics: theoretical analysis of test-time scaling for transformers, linking training data properties to benefits of long chain-of-thought.

Relevance: 9 Novelty: 8

36. HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks

ArXiv ID: 2510.04295

Authors: Nghiem T. Diep, Dung Le, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho

Abstract: Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique that adapts large pre-trained models by adding low-rank matrices to their weight updates. However, in the context of fine-tuning multi-head self-attention (MHA), LoRA has been employed to adapt each attention head separately, thereby overlooking potential synergies across different heads. To mitigate this issue, we propose a novel Hyper-shared Low-Rank Adaptation (HoRA) method, which utilizes joint hypernetworks to generate low-rank matrices across attention heads. By coupling their adaptation through a shared generator, HoRA encourages cross-head information sharing, and thus directly addresses the aforementioned limitation of LoRA. By comparing LoRA and HoRA through the lens of hierarchical mixture of experts, our theoretical findings reveal that the latter achieves superior sample efficiency to the former. Furthermore, through extensive experiments across diverse language and vision benchmarks, we demonstrate that HoRA outperforms LoRA and other PEFT methods while requiring only a marginal increase in the number of trainable parameters.

Comment: PEFT/Low-Rank: cross-head shared low-rank adapters generated by joint hypernetworks; theoretical sample-efficiency gains via a hierarchical MoE perspective.

Relevance: 9 Novelty: 8

37. Distributed Low-Communication Training with Decoupled Momentum Optimization

ArXiv ID: 2510.03371

Authors: Sasho Nedelkoski, Alexander Acker, Odej Kao, Soeren Becker, Dominik Scheinert

Abstract: The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.

Comment: High Performance Computing: reduces communication via decoupled momentum optimization and DCT-based momentum compression with infrequent syncs for distributed training.

Relevance: 9 Novelty: 7

38. DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks

ArXiv ID: 2510.04331

Authors: Nghiem T. Diep, Hien Dang, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho

Abstract: Parameter-efficient fine-tuning (PEFT) methods have become the standard paradigm for adapting large-scale models. Among these techniques, Weight-Decomposed Low-Rank Adaptation (DoRA) has been shown to improve both the learning capacity and training stability of the vanilla Low-Rank Adaptation (LoRA) method by explicitly decomposing pre-trained weights into magnitude and directional components. In this work, we propose DoRAN, a new variant of DoRA designed to further stabilize training and boost the sample efficiency of DoRA. Our approach includes two key stages: (i) injecting noise into the denominator of DoRA's weight decomposition, which serves as an adaptive regularizer to mitigate instabilities; and (ii) replacing static low-rank matrices with auxiliary networks that generate them dynamically, enabling parameter coupling across layers and yielding better sample efficiency in both theory and practice. Comprehensive experiments on vision and language benchmarks show that DoRAN consistently outperforms LoRA, DoRA, and other PEFT baselines. These results underscore the effectiveness of combining stabilization through noise-based regularization with network-based parameter generation, offering a promising direction for robust and efficient fine-tuning of foundation models.

Comment: Model Compression and Efficiency: stabilizes and enhances low-rank adaptation (DoRA) via noise injection and auxiliary networks that generate low-rank factors, improving PEFT.

Relevance: 9 Novelty: 7

39. Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

ArXiv ID: 2510.03659

Authors: Xu Wang, Yan Hu, Benyou Wang, Difan Zou

Abstract: Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: does higher interpretability indeed imply better steering utility? To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels, and evaluate their interpretability and steering utility based on SAEBench (arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a rank-agreement analysis via Kendall's rank coefficients (tau b). Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability utility gap may stem from the selection of SAE features, as not all of them are equally effective for steering. To further find features that truly steer the behavior of LLMs, we propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution. We show that our method improves the steering performance of three LLMs by 52.52 percent compared to the current best output score based criterion (arXiv:2503.34567). Strikingly, after selecting features with high Delta Token Confidence, the correlation between interpretability and utility vanishes (tau b approx 0), and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.

Comment: Matches Representation Learning and Sparse methods: analyzes Sparse Autoencoders’ interpretability vs. steering utility and proposes Delta Token Confidence for feature selection.

Relevance: 9 Novelty: 7

40. Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models

ArXiv ID: 2510.03274

Authors: Tianao Zhang, Zhiteng Li, Xianglong Yan, Haotong Qin, Yong Guo, Yulun Zhang

Abstract: Diffusion large language models (dLLMs), which offer bidirectional context and flexible masked-denoising generation, are emerging as a compelling alternative to autoregressive (AR) LLMs. However, like AR LLMs, their model sizes continue to grow, motivating weight compression for deployment. Although post-training quantization (PTQ) is effective for AR LLMs, directly transferring it to dLLMs at 2-bit leads to unsatisfactory performance. To tackle these challenges, we propose Quant-dLLM, an ultra-low-bit PTQ framework tailored to dLLMs. Since masked-denoising activations in dLLMs differ from the fully visible signals assumed by standard PTQ methods, we introduce Masked Calibration Simulation (MCS) to align calibration with the timestep-dependent masking, which yields more reliable calibrations. Moreover, we propose a Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit weight representations via an optimization algorithm. It performs iterative approximation guided by our simulated calibration data. In addition, under a strict 2-bit budget, we introduce Adaptive Blockwise Mixed Precision (ABMP), a sensitivity-based precision allocation scheme that adaptively assigns bit width across channel groups. When restricted to 2-bit precision, Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs. The code and models will be available at: https://github.com/ZTA2785/Quant-dLLM.

Comment: Matches Model Compression and Efficiency: proposes ultra-low-bit (2-bit) post-training quantization tailored to diffusion LLMs with masked calibration simulation and adaptive blockwise mixed precision.

Relevance: 9 Novelty: 7

41. Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

ArXiv ID: 2510.03339

Authors: Sofiane Ennadir, Levente Z\'olyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao

Abstract: Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.

Comment: Strongly matches Model Architecture: theoretical expressivity bounds and analysis of pooling mechanisms in Transformers, offering principled architectural guidance.

Relevance: 9 Novelty: 7

42. Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

ArXiv ID: 2510.04285

Authors: Karthik Viswanathan, Sang Eon Park

Abstract: We introduce a cumulant-expansion framework for quantifying how large language models (LLMs) internalize higher-order statistical structure during next-token prediction. By treating the softmax entropy of each layer's logit distribution as a perturbation around its "center" distribution, we derive closed-form cumulant observables that isolate successively higher-order correlations. Empirically, we track these cumulants in GPT-2 and Pythia models on Pile-10K prompts. (i) Structured prompts exhibit a characteristic rise-and-plateau profile across layers, whereas token-shuffled prompts remain flat, revealing the dependence of the cumulant profile on meaningful context. (ii) During training, all cumulants increase monotonically before saturating, directly visualizing the model's progression from capturing variance to learning skew, kurtosis, and higher-order statistical structures. (iii) Mathematical prompts show distinct cumulant signatures compared to general text, quantifying how models employ fundamentally different processing mechanisms for mathematical versus linguistic content. Together, these results establish cumulant analysis as a lightweight, mathematically grounded probe of feature-learning dynamics in high-dimensional neural networks.

Comment: Representation Learning: introduces cumulant-expansion probes of softmax entropy to quantify higher-order feature-learning dynamics across layers and training.

Relevance: 9 Novelty: 7

43. Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices

ArXiv ID: 2510.03425

Authors: Congzheng Song, Xinyu Tang

Abstract: Fine-tuning large language models (LLMs) with backpropagation\textemdash even for a subset of parameters such as LoRA\textemdash can be much more memory-consuming than inference and is often deemed impractical for resource-constrained mobile devices. Alternative methods, such as zeroth-order optimization (ZO), can greatly reduce the memory footprint but come at the cost of significantly slower model convergence (10$\times$ to 100$\times$ more steps than backpropagation). We propose a memory-efficient implementation of backpropagation (MeBP) on mobile devices that provides better trade-off between memory usage and compute time, while converging faster and achieving better performance than the ZO baseline. We verify the effectiveness of MeBP on an iPhone 15 Pro Max and show that various LLMs, ranging from 0.5B to 4B parameters, can be fine-tuned using less than 1GB of memory. We release an example of the MeBP implementation at https://github.com/apple/ml-mebp.

Comment: High Performance Computing: memory-efficient backpropagation enabling on-device fine-tuning of LLMs (<1GB), a systems-level memory optimization for training.

Relevance: 9 Novelty: 7

ArXiv ID: 2510.03346

Authors: Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire Jr., Dejan Kostic

Abstract: Large Language Models (LLMs) are increasingly deployed in multi-agent systems, where effective inter-model communication is crucial. Existing communication protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose KVComm, a novel communication framework that enables efficient communication between LLMs through selective sharing of KV pairs. KVComm leverages the rich information encoded in the KV pairs while avoiding the pitfalls of hidden states. We introduce a KV layer-wise selection strategy based on attention importance scores with a Gaussian prior to identify the most informative KV pairs for communication. Extensive experiments across diverse tasks and model pairs demonstrate that KVComm achieves comparable performance to the upper-bound method, which directly merges inputs to one model without any communication, while transmitting as few as 30\% of layers' KV pairs. Our study highlights the potential of KV pairs as an effective medium for inter-LLM communication, paving the way for scalable and efficient multi-agent systems.

Comment: Compression/Efficiency and Systems: selective KV sharing based on attention-importance with Gaussian prior reduces inter-LLM communication while retaining performance.

Relevance: 9 Novelty: 7

45. Optimal Scaling Needs Optimal Norm

ArXiv ID: 2510.03871

Authors: Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim

Abstract: Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(\eta^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(\eta, B)$ reach the optimal norm, only a unique $(\eta^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(\eta^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

Comment: Matches High Performance Computing/training dynamics: discovers an operator-norm invariance governing optimal LR/batch scaling for LLM training and reports distributed Scion implementation and large-scale scaling rules.

Relevance: 8 Novelty: 8

46. Sharp Lower Bounds for Linearized ReLU^k Approximation on the Sphere

ArXiv ID: 2510.04060

Authors: Tong Mao, Jinchao Xu

Abstract: We prove a saturation theorem for linearized shallow ReLU$^k$ neural networks on the unit sphere $\mathbb S^d$. For any antipodally quasi-uniform set of centers, if the target function has smoothness $r>\tfrac{d+2k+1}{2}$, then the best $\mathcal{L}^2(\mathbb S^d)$ approximation cannot converge faster than order $n^{-\frac{d+2k+1}{2d}}$. This lower bound matches existing upper bounds, thereby establishing the exact saturation order $\tfrac{d+2k+1}{2d}$ for such networks. Our results place linearized neural-network approximation firmly within the classical saturation framework and show that, although ReLU$^k$ networks outperform finite elements under equal degrees $k$, this advantage is intrinsically limited.

Comment: Model Architecture / Representation Learning: theoretical saturation bounds for linearized shallow ReLU^k networks, analyzing approximation capacity of the architecture.

Relevance: 8 Novelty: 8

47. How does the optimizer implicitly bias the model merging loss landscape?

ArXiv ID: 2510.04686

Authors: Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw

Abstract: Model merging methods combine models with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which linearly interpolates between model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization process affects the loss landscape geometry and its impact on merging success. We show that a single quantity -- the effective noise scale -- unifies the impact of optimizer and data choices on model merging. Across architectures and datasets, the effectiveness of merging success is a non-monotonic function of effective noise, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale, exhibiting the same qualitative trend. Unlike prior work that connects optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its downstream consequences for model merging, suggesting the possibility of further manipulating the training dynamics to improve merging effectiveness.

Comment: Representation Learning/Training Dynamics: shows how optimizer-induced effective noise shapes the global loss landscape and predicts model merging success.

Relevance: 8 Novelty: 8

48. Decision Potential Surface: A Theoretical and Practical Approximation of LLM's Decision Boundary

ArXiv ID: 2510.03271

Authors: Zi Liang, Zhiyao Wu, Haoyang Shang, Yulin Jin, Qingqing Ye, Huadi Zheng, Peizhao Hu, Haibo Hu

Abstract: Decision boundary, the subspace of inputs where a machine learning model assigns equal classification probabilities to two classes, is pivotal in revealing core model properties and interpreting behaviors. While analyzing the decision boundary of large language models (LLMs) has raised increasing attention recently, constructing it for mainstream LLMs remains computationally infeasible due to the enormous vocabulary-sequence sizes and the auto-regressive nature of LLMs. To address this issue, in this paper we propose Decision Potential Surface (DPS), a new notion for analyzing LLM decision boundary. DPS is defined on the confidences in distinguishing different sampling sequences for each input, which naturally captures the potential of decision boundary. We prove that the zero-height isohypse in DPS is equivalent to the decision boundary of an LLM, with enclosed regions representing decision regions. By leveraging DPS, for the first time in the literature, we propose an approximate decision boundary construction algorithm, namely $K$-DPS, which only requires K-finite times of sequence sampling to approximate an LLM's decision boundary with negligible error. We theoretically derive the upper bounds for the absolute error, expected error, and the error concentration between K-DPS and the ideal DPS, demonstrating that such errors can be trade-off with sampling times. Our results are empirically validated by extensive experiments across various LLMs and corpora.

Comment: Representation Learning: introduces Decision Potential Surface to approximate LLM decision boundaries with provable error bounds via K-sampling.

Relevance: 8 Novelty: 8

49. From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models

ArXiv ID: 2510.05095

Authors: Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia

Abstract: Large reasoning models (LRMs) generate intermediate reasoning traces before producing final answers, yielding strong gains on multi-step and mathematical tasks. Yet aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored. The statistically correct objective for preference alignment requires marginalizing over reasoning traces, but this computation is intractable in practice. A common workaround optimizes a single sampled trajectory, which introduces substantial gradient variance from stochastic trace sampling. To address this challenge, we frame preference optimization for LRMs through the lens of the bias--variance trade-off and propose Bias--Variance Optimized Preference Optimization (BVPO), a simple, drop-in method that mixes two gradient estimators: a high-variance trace-based estimator and a low-variance empty-trace estimator obtained by disabling reasoning trace generation. Our theory shows that BVPO strictly reduces trace-induced variance for any nontrivial mixture, provides a closed-form choice of the mixing weight that minimizes mean-squared error relative to the true marginal gradient, and under standard smoothness and step-size conditions, tightens classical convergence bounds for stochastic gradient descent. Empirically, BVPO improves alignment over the best baseline by up to 7.8 points on AlpacaEval~2 and 6.8 points on Arena-Hard. Despite being trained only on general conversational data, BVPO also boosts reasoning performance for base models by up to 4.0 points on the average of six math reasoning benchmarks. These results identify variance from trace sampling as a key bottleneck and demonstrate that directly optimizing the bias--variance trade-off yields more stable training and stronger overall performance.

Comment: Matches Representation Learning/training dynamics: a variance-optimized preference optimization method with theory for aligning large reasoning models.

Relevance: 8 Novelty: 8

50. Internal states before wait modulate reasoning patterns

ArXiv ID: 2510.04128

Authors: Dmitrii Troitskii, Koyena Pal, Chris Wendler, Callum Stuart McDougall, Neel Nanda

Abstract: Prior work has shown that a significant driver of performance in reasoning models is their ability to reason and self-correct. A distinctive marker in these reasoning traces is the token wait, which often signals reasoning behavior such as backtracking. Despite being such a complex behavior, little is understood of exactly why models do or do not decide to reason in this particular manner, which limits our understanding of what makes a reasoning model so effective. In this work, we address the question whether model's latents preceding wait tokens contain relevant information for modulating the subsequent reasoning process. We train crosscoders at multiple layers of DeepSeek-R1-Distill-Llama-8B and its base version, and introduce a latent attribution technique in the crosscoder setting. We locate a small set of features relevant for promoting/suppressing wait tokens' probabilities. Finally, through a targeted series of experiments analyzing max activating examples and causal interventions, we show that many of our identified features indeed are relevant for the reasoning process and give rise to different types of reasoning patterns such as restarting from the beginning, recalling prior knowledge, expressing uncertainty, and double-checking.

Comment: Matches Representation Learning/mechanistic interpretability: identifies latent features that modulate ‘wait’ tokens and causally links them to reasoning patterns in transformers.