Personalized Daily ArXiv Papers 2025-09-30

[gpt-5]	Prompt	Completion	Total
Token	153021	137039	290060
Cost	$0.19	$1.37	$1.56

Total arXiv papers: 1801

Total scanned papers: 1072

Total relevant papers: 107

Table of contents with paper titles:

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization Authors: Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions Authors: Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, Lenka Zdeborov\'a
The Impossibility of Inverse Permutation Learning in Transformer Models Authors: Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah
LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport Authors: Ashkan Shahbazi, Chayne Thrash, Yikun Bai, Keaton Hamm, Navid NaderiAlizadeh, Soheil Kolouri
Pretraining Large Language Models with NVFP4 Authors: NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu
Clebsch-Gordan Transformer: Fast and Global Equivariant Attention Authors: Owen Lewis Howell, Linfeng Zhao, Xupeng Zhu, Yaoyao Qian, Haojie Huang, Lingfeng Sun, Wil Thomason, Robert Platt, Robin Walters
On the Capacity of Self-Attention Authors: Micah Adler
Tracing the Representation Geometry of Language Models from Pretraining to Post-training Authors: Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, Blake A. Richards
Towards a Comprehensive Scaling Law of Mixture-of-Experts Authors: Guoliang Zhao, Yuhan Fu, Shuaipeng Li, Xingwu Sun, Ruobing Xie, An Wang, Weidong Han, Zhen Yang, Weixuan Sun, Yudong Zhang, Cheng-zhong Xu, Di Wang, Jie Jiang
Compute-Optimal Quantization-Aware Training Authors: Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun
ProxyAttn: Guided Sparse Attention via Representative Heads Authors: Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che
Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization Authors: Chris Kolb, Laetitia Frost, Bernd Bischl, David R\"ugamer
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights Authors: Lorenz K. M\"uller, Philippe Bich, Jiawei Zhuang, Ahmet \c{C}elik, Luca Benfenati, Lukas Cavigelli
Tequila: Trapping-free Ternary Quantization for Large Language Models Authors: Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, Dapeng Wu
InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation Authors: Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, Zhiyuan Liu
MoE-PHDS: One MoE checkpoint for flexible runtime sparsity Authors: Lauren. A Hannah, Soheil Zibakhsh, Kumari Nishu, Arnav Kundu, Mohammad Samragh Razlighi, Mehrdad Farajtabar, Minsik Cho
PATCH: Learnable Tile-level Hybrid Sparsity for LLMs Authors: Younes Hourri, Mohammad Mozaffari, Maryam Mehri Dehnavi
PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference Authors: Enda Yu, Zhaoning Zhang, Dezun Dong, Yongwei Wu, Xiangke Liao
Parameterized Hardness of Zonotope Containment and Neural Network Verification Authors: Vincent Froese, Moritz Grillo, Christoph Hertrich, Moritz Stargalla
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment Authors: Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son
LLaDA-MoE: A Sparse MoE Diffusion Language Model Authors: Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen
Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning Authors: Zhisheng Chen, Yingwei Zhang, Qizhen Lan, Tianyu Liu, Huacan Wang, Yi Ding, Ziyu Jia, Ronghao Chen, Kun Wang, Xinliang Zhou
Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms Authors: Jiahao Ying, Mingbao Lin, Qianru Sun, Yixin Cao
F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning Authors: Hangwei Zhang, Chun Kang, Yan Wang, Difan Zou
HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation Authors: Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen
AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models Authors: Jihu Guo, Tenghui Ma, Wei Gao, Peng Sun, Jiaxing Li, Xun Chen, Yuyang Jin, Dahua Lin
SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching Authors: Xinye Zhao, Spyridon Mastorakis
Short window attention enables long-term memorization Authors: Lo\"ic Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre-Emmanuel Mazar\'e, Gabriel Synnaeve, Herv\'e J\'egou
MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts Authors: Jiayu Liu, Zhenya Huang, Anya Sims, Enhong Chen, Yee Whye Teh, Ning Miao
Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression Authors: Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie
A multiscale analysis of mean-field transformers in the moderate interaction regime Authors: Giuseppe Bruno, Federico Pasqualotto, Andrea Agazzi
Scaling with Collapse: Efficient and Predictable Training of LLM Families Authors: Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal, Joel Hestness
Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models Authors: Jonas H\"ubotter, Patrik Wolf, Alexander Shevchenko, Dennis J\"uni, Andreas Krause, Gil Kur
Negative Pre-activations Differentiate Syntax Authors: Linghao Kong, Angelina Ning, Micah Adler, Nir Shavit
FS-KAN: Permutation Equivariant Kolmogorov-Arnold Networks via Function Sharing Authors: Ran Elbaz, Guy Bar-Shalom, Yam Eitan, Fabrizio Frasca, Haggai Maron
Signal Preserving Weight Initialization for Odd-Sigmoid Activations Authors: Hyunwoo Lee, Hayoung Choi, Hyunju Kim
Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory Authors: Pengxiao Lin, Zheng-An Chen, Zhi-Qin John Xu
Neighborhood Sampling Does Not Learn the Same Graph Neural Network Authors: Zehao Niu, Mihai Anitescu, Jie Chen
Conda: Column-Normalized Adam for Training Large Language Models Faster Authors: Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin
Statistical Learning Guarantees for Group-Invariant Barron Functions Authors: Yahong Yang, Wei Zhu
Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime Authors: Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborov\'a, Bruno Loureiro, Florent Krzakala
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
LLM Interpretability with Identifiable Temporal-Instantaneous Representation Authors: Xiangchen Song, Jiaqi Sun, Zijian Li, Yujia Zheng, Kun Zhang
High-Dimensional Analysis of Single-Layer Attention for Sparse-Token Classification Authors: Nicholas Barnfield, Hugo Cui, Yue M. Lu
CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers Authors: Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang
VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs Authors: Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee
Learning to Ponder: Adaptive Reasoning in Latent Space Authors: Yixin He, Lumingyuan Tang
Scalable GANs with Transformers Authors: Sangeek Hyun, MinKyu Lee, Jae-Pil Heo
Effective Quantization of Muon Optimizer States Authors: Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi
Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs Authors: Tanya Chowdhury, Atharva Nijasure, Yair Zick, James Allan
Query Circuits: Explaining How Language Models Answer User Prompts Authors: Tung-Yu Wu, Fazl Barez
A Second-Order Perspective on Pruning at Initialization and Knowledge Transfer Authors: Leonardo Iurada, Beatrice Occhiena, Tatiana Tommasi
Limit Analysis for Symbolic Multi-step Reasoning Tasks with Information Propagation Rules Based on Transformers Authors: Tian Qin, Yuhan Chen, Zhiwei Wang, Zhi-Qin John Xu
Measuring Sparse Autoencoder Feature Sensitivity Authors: Claire Tian, Katherine Tian, Nathan Hu
LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models Authors: Shubhang Bhatnagar, Andy Xu, Kar-Han Tan, Narendra Ahuja
Memory-Efficient Fine-Tuning via Low-Rank Activation Compression Authors: Jiang-Xin Shi, Wen-Da Wei, Jin-Fei Qi, Xuanyu Chen, Tong Wei, Yu-Feng Li
HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models Authors: Zhinan Xie, Peisong Wang, Jian Cheng
MonoCon: A general framework for learning ultra-compact high-fidelity representations using monotonicity constraints Authors: Shreyas Gokhale
Beyond Outliers: A Study of Optimizers Under Quantization Authors: Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh
Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM Authors: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang
Dense associative memory on the Bures-Wasserstein space Authors: Chandan Tankala, Krishnakumar Balasubramanian
LLM DNA: Tracing Model Evolution via Functional Representations Authors: Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He
Space Group Conditional Flow Matching Authors: Omri Puny, Yaron Lipman, Benjamin Kurt Miller
Define latent spaces by example: optimisation over the outputs of generative models Authors: Samuel Willis, Alexandru I. Stere, Dragos D. Margineantu, Henry T. Oldroyd, John A. Fozard, Carl Henrik Ek, Henry Moss, Erik Bodin
Sequential Diffusion Language Models Authors: Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
Hyperspherical Latents Improve Continuous-Token Autoregressive Generation Authors: Guolin Ke, Hui Xue
Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning Authors: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen
Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement Authors: Yu-Che Tsai, Kuan-Yu Chen, Yuan-Chi Li, Yuan-Hao Chen, Ching-Yu Tsai, Shou-De Lin
SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression Authors: Haoming Wen, Yushi Bai, Juanzi Li, Jie Tang
Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers Authors: Xianhang Li, Chen Huang, Chun-Liang Li, Eran Malach, Josh Susskind, Vimal Thilak, Etai Littwin
Adaptive Canonicalization with Application to Invariant Anisotropic Geometric Networks Authors: Ya-Wei Eileen Lin, Ron Levie
Model Merging Scaling Laws in Large Language Models Authors: Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang
Discrete Variational Autoencoding via Policy Search Authors: Michael Drolet, Firas Al-Hafez, Aditya Bhatt, Jan Peters, Oleg Arenz
Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding Authors: Lin Long, Changdae Oh, Seongheon Park, Yixuan Li
Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models Authors: Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu
Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs Authors: Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo
Understanding and Enhancing the Planning Capability of Language Models via Multi-Token Prediction Authors: Qimin Zhong, Hao Liao, Siwei Wang, Mingyang Zhou, Xiaoqun Wu, Rui Mao, Wei Chen
Concept activation vectors: a unifying view and adversarial attacks Authors: Ekkehard Schnoor, Malik Tiomoko, Jawher Said, Alex Jung, Wojciech Samek
Why Alignment Must Precede Distillation: A Minimal Working Explanation Authors: Sungmin Cha, Kyunghyun Cho
Pretraining Scaling Laws for Generative Evaluations of Language Models Authors: Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo
DRIFT-Net: A Spectral--Coupled Neural Operator for PDEs Learning Authors: Jiayi Li, Flora D. Salim
Multi-Scale Geometric Autoencoder Authors: Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Li Shen
Dynamic Orthogonal Continual Fine-tuning for Mitigating Catastrophic Forgettings Authors: Zhixin Zhang, Zeming Wei, Meng Sun
Characteristic Root Analysis and Regularization for Linear Time Series Forecasting Authors: Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li, Tobias Schlagenhauf
Scaling LLM Test-Time Compute with Mobile NPU on Smartphones Authors: Zixu Hao, Jianyu Wei, Tuowei Wang, Minxing Huang, Huiqiang Jiang, Shiqi Jiang, Ting Cao, Ju Ren
SpecExit: Accelerating Large Reasoning Model via Speculative Exit Authors: Rubing Yang, Huajun Bai, Song Liu, Guanghua Yu, Runzhi Fan, Yanbin Dang, Jiejing Zhang, Kai Liu, Jianchen Zhu, Peng Chen
Toward Preference-aligned Large Language Models via Residual-based Model Steering Authors: Lucio La Cava, Andrea Tagarelli
Sparse Deep Additive Model with Interactions: Enhancing Interpretability and Predictability Authors: Yi-Ting Hung, Li-Hsiang Lin, Vince D. Calhoun
Semantic Compression via Multimodal Representation Learning Authors: Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello
Graph Your Own Prompt Authors: Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao
What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples? Authors: Mohammed Sabry, Anya Belz
Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models Authors: Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, Pietro Mazzaglia
Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability Authors: Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Understanding Catastrophic Interference On the Identifibility of Latent Representations Authors: Yuke Li, Yujia Zheng, Tianyi Xiong, Zhenyi Wang, Heng Huang
Echo Flow Networks Authors: Hongbo Liu, Jia Xu
Uncovering Grounding IDs: How External Cues Shape Multi-Modal Binding Authors: Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah
Convolutional Set Transformer Authors: Federico Chinello, Giacomo Boracchi
Sketching Low-Rank Plus Diagonal Matrices Authors: Andres Fernandez, Felix Dangel, Philipp Hennig, Frank Schneider
Interpretable Kernel Representation Learning at Scale: A Unified Framework Utilizing Nystr\"om Approximation Authors: Maedeh Zarvandi, Michael Timothy, Theresa Wasserer, Debarghya Ghoshdastidar
HTMA-Net: Towards Multiplication-Avoiding Neural Networks via Hadamard Transform and In-Memory Computing Authors: Emadeldeen Hamdan, Ahmet Enis Cetin
Beyond Softmax: A Natural Parameterization for Categorical Random Variables Authors: Alessandro Manenti, Cesare Alippi
Intra-request branch orchestration for efficient LLM reasoning Authors: Weifan Jiang, Rana Shahout, Yilun Du, Michael Mitzenmacher, Minlan Yu
One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning Authors: Minh Le, Bao-Ngoc Dao, Huy Nguyen, Quyen Tran, Anh Nguyen, Nhat Ho
Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures Authors: Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini
BiHDTrans: binary hyperdimensional transformer for efficient multivariate time series classification Authors: Jingtao Zhang, Yi Liu, Qi Shen, Changhong Wang
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer Authors: Simon Schrodi, Elias Kempf, Fazl Barez, Thomas Brox

1. Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

ArXiv ID: 2509.23202

Authors: Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

Abstract: The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

Comment: Compression/Efficiency: FP4 post-training quantization tailored to MXFP4/NVFP4 (MR-GPTQ) with format-specific algorithms and high-performance GPU kernels for LLM inference.

Relevance: 10 Novelty: 9

2. Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions

ArXiv ID: 2509.24914

Authors: Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, Lenka Zdeborov\'a

Abstract: We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks, given by the recently introduced attention-indexed model. Using tools from random matrix theory, spin-glass physics, and approximate message passing, we derive sharp asymptotics for training and test errors, locate interpolation and recovery thresholds, and characterize the limiting spectral distribution of the learned weights. Weight decay induces an implicit nuclear-norm regularization, favoring low-rank query and key matrices. Leveraging this, we compare the standard factorized training of query and key matrices with a direct parameterization in which their product is trained element-wise, revealing the inductive bias introduced by the factorized form. Remarkably, the predicted spectral distribution echoes empirical trends reported in large-scale transformers, offering a theoretical perspective consistent with these phenomena.

Comment: Architecture analysis: theoretical study of single-head attention’s inductive bias and spectral properties, revealing implicit low-rank regularization via weight decay.

Relevance: 10 Novelty: 9

3. The Impossibility of Inverse Permutation Learning in Transformer Models

ArXiv ID: 2509.24125

Authors: Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah

Abstract: In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input withscratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

Comment: Model Architecture Theory: impossibility result for inverse permutation learning in decoder-only Transformers; shows fixes via encoder-decoder or scratch tokens.

Relevance: 10 Novelty: 9

4. LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

ArXiv ID: 2509.23436

Authors: Ashkan Shahbazi, Chayne Thrash, Yikun Bai, Keaton Hamm, Navid NaderiAlizadeh, Soheil Kolouri

Abstract: Transformers have proven highly effective across a wide range of modalities. However, the quadratic complexity of the standard softmax attention mechanism poses a fundamental barrier to scaling them to long context windows. A large body of work addresses this with linear attention, which reformulates attention as a kernel function and approximates it with finite feature maps to achieve linear-time computation. Orthogonal to computational scaling, most attention mechanisms -- both quadratic and linear -- produce row-normalized maps that can over-focus on a few tokens, degrading robustness and information flow. Enforcing doubly-stochastic attention alleviates this by balancing token participation across rows and columns, but existing doubly-stochastic attention mechanisms typically introduce substantial overhead, undermining scalability. We propose LOTFormer, a principled attention mechanism that is simultaneously linear-time and doubly-stochastic. Our approach exploits the connection between attention maps and transportation plans between query and key measures. The central idea is to constrain the transport plan to be low-rank by conditioning it on a learnable pivot measure with small support. Concretely, we solve two entropic optimal transport problems (queries $\to$ pivot and pivot $\to$ keys) and compose them into a conditional (glued) coupling. This yields an attention matrix that is provably doubly-stochastic, has rank at most $r \ll n$, and applies to values in $O(nr)$ time without forming the full $n \times n$ map. The pivot locations and masses are learned end-to-end. Empirically, LOTFormer achieves state-of-the-art results on the Long Range Arena benchmark, surpassing prior linear and transport-based attention methods in both accuracy and efficiency.

Comment: Efficiency: linear-time attention via low-rank entropic optimal transport with provably doubly-stochastic maps and O(nr) compute.

Relevance: 10 Novelty: 9

5. Pretraining Large Language Models with NVFP4

ArXiv ID: 2509.25149

Authors: NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu

Abstract: Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.

Comment: Model Compression and Efficiency + High Performance Computing: stable 4-bit (NVFP4) LLM pretraining using RHT, 2D quantization, stochastic rounding, and selective high-precision layers.

Relevance: 10 Novelty: 9

6. Clebsch-Gordan Transformer: Fast and Global Equivariant Attention

ArXiv ID: 2509.24093

Authors: Owen Lewis Howell, Linfeng Zhao, Xupeng Zhu, Yaoyao Qian, Haojie Huang, Lingfeng Sun, Wil Thomason, Robert Platt, Robin Walters

Abstract: The global attention mechanism is one of the keys to the success of transformer architecture, but it incurs quadratic computational costs in relation to the number of tokens. On the other hand, equivariant models, which leverage the underlying geometric structures of problem instance, often achieve superior accuracy in physical, biochemical, computer vision, and robotic tasks, at the cost of additional compute requirements. As a result, existing equivariant transformers only support low-order equivariant features and local context windows, limiting their expressiveness and performance. This work proposes Clebsch-Gordan Transformer, achieving efficient global attention by a novel Clebsch-Gordon Convolution on $\SO(3)$ irreducible representations. Our method enables equivariant modeling of features at all orders while achieving ${O}(N \log N)$ input token complexity. Additionally, the proposed method scales well with high-order irreducible features, by exploiting the sparsity of the Clebsch-Gordon matrix. Lastly, we also incorporate optional token permutation equivariance through either weight sharing or data augmentation. We benchmark our method on a diverse set of benchmarks including n-body simulation, QM9, ModelNet point cloud classification and a robotic grasping dataset, showing clear gains over existing equivariant transformers in GPU memory size, speed, and accuracy.

Comment: Matches Model Architecture and Efficiency: O(N log N) global equivariant attention via Clebsch–Gordan convolution supporting high-order features.

Relevance: 10 Novelty: 9

7. On the Capacity of Self-Attention

ArXiv ID: 2509.22840

Authors: Micah Adler

Abstract: While self-attention is known to learn relations among tokens, we lack a formal understanding of its capacity: how many distinct relations can a single layer reliably recover for a given budget? To formalize this, we introduce Relational Graph Recognition (RGR), where the key-query channel represents a graph on $m$ items with $m'$ directed edges, and, given a context of items, must recover the neighbors of each item. We measure resources by the total key dimension $D_K = h\,d_k$. Within this framework, we analytically derive a capacity scaling law and validate it empirically. We show that $D_K = \Theta(m' \log m' / d_{\text{model}})$ is both necessary (information-theoretic lower bound) and sufficient (explicit construction) in a broad class of graphs to recover $m'$ relations. This scaling law directly leads to a new, capacity-based rationale for multi-head attention that applies even when each item only attends to a single target. When embeddings are uncompressed ($m = d_{\text{model}}$) and the graph is a permutation, a single head suffices. However, compression ($m > d_{\text{model}}$) forces relations into overlapping subspaces, creating interference that a single large head cannot disentangle. Our analysis shows that allocating a fixed $D_K$ across many small heads mitigates this interference, increasing the number of recoverable relations. Controlled single-layer experiments mirror the theory, revealing a sharp performance threshold that matches the predicted capacity scaling and confirms the benefit of distributing $D_K$ across multiple heads. Altogether, these results provide a concrete scaling law for self-attention capacity and a principled design rule for allocating key-query budget across heads.

Comment: Matches Model Architecture (theory): capacity scaling law for self-attention and principled multi-head budget allocation.

Relevance: 10 Novelty: 9

8. Tracing the Representation Geometry of Language Models from Pretraining to Post-training

ArXiv ID: 2509.23024

Authors: Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, Blake A. Richards

Abstract: Standard training metrics like loss fail to explain the emergence of complex capabilities in large language models. We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training, measuring effective rank (RankMe) and eigenspectrum decay ($\alpha$-ReQ). With OLMo (1B-7B) and Pythia (160M-12B) models, we uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining. The initial "warmup" phase exhibits rapid representational collapse. This is followed by an "entropy-seeking" phase, where the manifold's dimensionality expands substantially, coinciding with peak n-gram memorization. Subsequently, a "compression-seeking" phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, a transition marked with significant improvement in downstream task performance. We show these phases can emerge from a fundamental interplay of cross-entropy optimization under skewed token frequencies and representational bottlenecks ($d \ll |V|$). Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data, improving in-distribution performance while degrading out-of-distribution robustness. Conversely, RLVR induces "compression-seeking", enhancing reward alignment but reducing generation diversity.

Comment: Matches Representation Learning criterion: spectral geometry (effective rank, eigenspectrum decay) tracing phases from pretraining to post-training.

Relevance: 10 Novelty: 9

9. Towards a Comprehensive Scaling Law of Mixture-of-Experts

ArXiv ID: 2509.23678

Authors: Guoliang Zhao, Yuhan Fu, Shuaipeng Li, Xingwu Sun, Ruobing Xie, An Wang, Weidong Han, Zhen Yang, Weixuan Sun, Yudong Zhang, Cheng-zhong Xu, Di Wang, Jie Jiang

Abstract: Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size ($D$), total model size ($N$), activated model size ($N_a$), number of active experts ($G$) and the ratio of shared experts ($S$)). Specifically, we design $446$ controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for $G$, $S$ and $N_a/N$ with detailed analyses. Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size. With the scaling of $N$, the optimal activation parameter ratio of $N_a/N$ becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.

Comment: Matches Mixture-of-Experts + Scaling Laws: comprehensive MoE scaling law over key factors (N, Na, G, S, D) with optimal configuration guidance.

Relevance: 10 Novelty: 9

10. Compute-Optimal Quantization-Aware Training

ArXiv ID: 2509.22935

Authors: Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun

Abstract: Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.

Comment: Model Compression and Efficiency: compute allocation scaling law for QAT vs FP, tokens-per-parameter-byte predictor, and fused cooldown+QAT for efficient quantized training.

Relevance: 10 Novelty: 8

11. ProxyAttn: Guided Sparse Attention via Representative Heads

ArXiv ID: 2509.24745

Authors: Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che

Abstract: The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

Comment: Compression/Efficiency: guided sparse attention via representative heads with dynamic budgets, yielding training-free acceleration of attention and prefill.

Relevance: 10 Novelty: 8

12. Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization

ArXiv ID: 2509.23898

Authors: Chris Kolb, Laetitia Frost, Bernd Bischl, David R\"ugamer

Abstract: Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$-regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.

Comment: Compression/Efficiency: differentiable structured sparsity via D-Gating, theoretically equivalent to non-smooth group penalties with improved optimization dynamics.

Relevance: 10 Novelty: 8

13. SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

ArXiv ID: 2509.22944

Authors: Lorenz K. M\"uller, Philippe Bich, Jiawei Zhuang, Ahmet \c{C}elik, Luca Benfenati, Lukas Cavigelli

Abstract: Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.

Comment: Compression/Efficiency: calibration-free low-precision LLM quantization using Sinkhorn-normalized per-row/column scaling to reduce matrix imbalance.

Relevance: 10 Novelty: 8

14. Tequila: Trapping-free Ternary Quantization for Large Language Models

ArXiv ID: 2509.23809

Authors: Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, Dapeng Wu

Abstract: Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.

Comment: Compression/Efficiency: ternary quantization for LLMs with trapping-free optimization by repurposing deadzone-trapped weights as dynamic biases.

Relevance: 10 Novelty: 8

15. InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

ArXiv ID: 2509.24663

Authors: Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, Zhiyuan Liu

Abstract: Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional \textit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.

Comment: Trainable sparse attention with dense–sparse switch; parameter reuse enables seamless short-to-long adaptation and 4x speedups.

Relevance: 10 Novelty: 8

16. MoE-PHDS: One MoE checkpoint for flexible runtime sparsity

ArXiv ID: 2509.23012

Authors: Lauren. A Hannah, Soheil Zibakhsh, Kumari Nishu, Arnav Kundu, Mohammad Samragh Razlighi, Mehrdad Farajtabar, Minsik Cho

Abstract: Sparse Mixtures of Experts (MoEs) are typically trained to operate at a fixed sparsity level, e.g. $k$ in a top-$k$ gating function. This global sparsity level determines an operating point on the accuracy/latency curve; currently, meeting multiple efficiency targets means training and maintaining multiple models. This practice complicates serving, increases training and maintenance costs, and limits flexibility in meeting diverse latency, efficiency, and energy requirements. We show that pretrained MoEs are more robust to runtime sparsity shifts than commonly assumed, and introduce MoE-PHDS ({\bf P}ost {\bf H}oc {\bf D}eclared {\bf S}parsity), a lightweight SFT method that turns a single checkpoint into a global sparsity control surface. PHDS mixes training across sparsity levels and anchors with a short curriculum at high sparsity, requiring no architectural changes. The result is predictable accuracy/latency tradeoffs from one model: practitioners can ``dial $k$'' at inference time without swapping checkpoints, changing architecture, or relying on token-level heuristics. Experiments on OLMoE-1B-7B-0125, Qwen1.5-MoE-A2.7B, and proprietary models fit on multiple operating points show that PHDS matches or exceeds well-specified oracle models, improves cross-sparsity agreement by up to 22\% vs. well-specified oracle models, and enables simplified, flexible runtime MoE deployment by making global sparsity a first-class serving primitive.

Comment: Model Architecture and Efficiency (MoE): PHDS enables flexible runtime sparsity (dial k) from a single MoE checkpoint via lightweight SFT, giving controllable accuracy/latency trade-offs.

Relevance: 10 Novelty: 8

17. PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

ArXiv ID: 2509.23410

Authors: Younes Hourri, Mohammad Mozaffari, Maryam Mehri Dehnavi

Abstract: Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.

Comment: Compression and Efficiency: hybrid tile-level sparsity mixing dense and 2:4 patterns via learnable masks for LLMs, enabling controllable sparsity with practical speedups.

Relevance: 10 Novelty: 8

18. PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference

ArXiv ID: 2509.23638

Authors: Enda Yu, Zhaoning Zhang, Dezun Dong, Yongwei Wu, Xiangke Liao

Abstract: Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Comment: Matches Model Compression and Efficiency + High-Performance Systems for MoE inference: prediction-driven expert prefetching/scheduling (LLaPor, PreSched, AsyncIO) to overcome PCIe/memory bottlenecks.

Relevance: 10 Novelty: 8

19. Parameterized Hardness of Zonotope Containment and Neural Network Verification

ArXiv ID: 2509.22849

Authors: Vincent Froese, Moritz Grillo, Christoph Hertrich, Moritz Stargalla

Abstract: Neural networks with ReLU activations are a widely used model in machine learning. It is thus important to have a profound understanding of the properties of the functions computed by such networks. Recently, there has been increasing interest in the (parameterized) computational complexity of determining these properties. In this work, we close several gaps and resolve an open problem posted by Froese et al. [COLT '25] regarding the parameterized complexity of various problems related to network verification. In particular, we prove that deciding positivity (and thus surjectivity) of a function $f\colon\mathbb{R}^d\to\mathbb{R}$ computed by a 2-layer ReLU network is W[1]-hard when parameterized by $d$. This result also implies that zonotope (non-)containment is W[1]-hard with respect to $d$, a problem that is of independent interest in computational geometry, control theory, and robotics. Moreover, we show that approximating the maximum within any multiplicative factor in 2-layer ReLU networks, computing the $L_p$-Lipschitz constant for $p\in(0,\infty]$ in 2-layer networks, and approximating the $L_p$-Lipschitz constant in 3-layer networks are NP-hard and W[1]-hard with respect to $d$. Notably, our hardness results are the strongest known so far and imply that the naive enumeration-based methods for solving these fundamental problems are all essentially optimal under the Exponential Time Hypothesis.

Comment: Foundational theory/verification: strongest parameterized hardness results for properties of ReLU networks (e.g., positivity, Lipschitz), directly analyzing neural network architecture capabilities.

Relevance: 9 Novelty: 9

20. Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

ArXiv ID: 2509.22745

Authors: Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son

Abstract: Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

Comment: Model Architecture (MoE): safety-preserving fine-tuning by aligning MoE routing weights to prevent harmful expert drift (safety routing alignment).

Relevance: 10 Novelty: 7

21. LLaDA-MoE: A Sparse MoE Diffusion Language Model

ArXiv ID: 2509.24389

Authors: Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen

Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

Comment: Mixture-of-Experts (sparse MoE) architecture for diffusion language models; efficient inference with few active parameters.

Relevance: 10 Novelty: 7

22. Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning

ArXiv ID: 2509.24222

Authors: Zhisheng Chen, Yingwei Zhang, Qizhen Lan, Tianyu Liu, Huacan Wang, Yi Ding, Ziyu Jia, Ronghao Chen, Kun Wang, Xinliang Zhou

Abstract: Foundation models pretrained on various and unlabeled data have demonstrated significant success in natural language and vision, but their application to electroencephalography (EEG) remains challenged due to the signal's unique properties. Existing brain foundation models that inherit architectures designed for text or images lead to three limitations in pre-training: 1) conflating time-domain waveform patterns with frequency-domain rhythmic features in a single processing stream, 2) ignoring the critical spatial topology of electrodes with different standards, and 3) reliance on the inflexible, dense network to process functionally distinct EEG patterns. To address these challenges, we introduce the Unified Neural Topological Foundation Model (Uni-NTFM), which is designed based on neuroscience principles to produce universal and interpretable representations. Uni-NTFM integrates three core innovations: 1) a decoupled architecture parallelly encodes time, frequency, and raw signal representations before performing cross-domain feature integration; 2) a topological embedding mechanism to unify electrodes from different international standards and generate structured input sequences for brain regions; and 3) a Mixture-of-Experts neural Transformer that efficiently scales model capacity by routing signal patterns to specialized subnetworks. The largest model, Uni-NTFM$_{large}$, has a record-breaking 1.9B parameters and was pretrained on over 28,000 hours of diverse EEG data via a dual-domain masked reconstruction objective. Uni-NTFM significantly outperforms existing task-specific methods and foundation models across nine distinct downstream tasks under both linear probing and fine-tuning settings, demonstrating a superior ability to learn universal representations of brain activity.

Comment: Model Architecture (MoE): Mixture-of-Experts Transformer to scale capacity via routing specialized subnetworks; decoupled time/frequency streams and topological embeddings.

Relevance: 10 Novelty: 7

23. Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms

ArXiv ID: 2509.23933

Authors: Jiahao Ying, Mingbao Lin, Qianru Sun, Yixin Cao

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference. However, current research remains largely performance-centric, with limited understanding of its internal mechanisms, thereby constraining broader progress. In this work, we use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors. Through systematic analyses of a wide range of publicly available MoE models, we uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal while MUI reveals deeper insights; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity. Together, these results demonstrate the potential of MUI as a complementary indicator to benchmark performance, offering new insights into the capacity, dynamics, and specialization of MoE models. Our project can be found at https://yingjiahao14.github.io/MoE-MUI/.

Comment: Model Architecture (MoE): introduces an internal metric to analyze routing and expert/neuron utilization, revealing specialization and training dynamics.

Relevance: 10 Novelty: 7

24. F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning

ArXiv ID: 2509.23173

Authors: Hangwei Zhang, Chun Kang, Yan Wang, Difan Zou

Abstract: Parameter-efficient fine-tuning (PEFT) of powerful pre-trained models for complex downstream tasks has proven effective in vision and language processing, yet this paradigm remains unexplored in scientific machine learning, where the objective is to model complex physical systems. We conduct the first systematic study of PEFT for pre-trained Large Operator Models (LOMs) obtained by scaling variants of Fourier Neural Operator. First, we observe that the widely used Low-Rank Adaptation (LoRA) yields markedly poorer performance on LOMs than Adapter tuning. Then, we further theoretically establish that stacked LoRA incurs a depth-amplified lower bound on approximation error within Fourier layers, whereas adapters retain universal approximation capacity and, by concentrating parameters on energy-dominant low-frequency modes, attain exponentially decaying error with bottleneck width in the Fourier domain. Motivated by the robust empirical gains of adapters and by our theoretical characterization of PDE solutions as spectrally sparse, we introduce Frequency-Adaptive Adapter (F-Adapter). F-Adapter allocates adapter capacity based on spectral complexity, assigning higher-dimension modules to low-frequency components and lower-dimension modules to high-frequency components. Our F-Adapters establish state-of-the-art (SOTA) results on multiple challenging 3D Navier-Stokes benchmarks, markedly enhancing both generalization and spectral fidelity over LoRA and other PEFT techniques commonly used in LLMs. To the best of our knowledge, this work is the first to explore PEFT for scientific machine-learning and establishes F-Adapter as an effective paradigm for this domain.

Comment: PEFT/Model Architecture: frequency-adaptive adapters for Fourier operator models with theory on LoRA’s approximation limits vs adapters; parameter-efficient fine-tuning advances.

Relevance: 9 Novelty: 8

25. HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

ArXiv ID: 2509.23736

Authors: Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen

Abstract: In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9\% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.

Comment: Model Architecture: multi-scale ViT-based tokenizer with scale-causal attention, improving latent representation quality for reconstruction/generation.

Relevance: 9 Novelty: 8

26. AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models

ArXiv ID: 2509.23722

Authors: Jihu Guo, Tenghui Ma, Wei Gao, Peng Sun, Jiaxing Li, Xun Chen, Yuyang Jin, Dahua Lin

Abstract: Pipeline parallelism is widely used to train large language models (LLMs). However, increasing heterogeneity in model architectures exacerbates pipeline bubbles, thereby reducing training efficiency. Existing approaches overlook the co-optimization of model partition, model placement, and workload scheduling, resulting in limited efficiency improvement or even performance degradation. To respond, we propose AdaPtis, an LLM training system that supports adaptive pipeline parallelism. First, we develop a pipeline performance model to accurately estimate training throughput. Second, AdaPtis jointly optimizes model partition, model placement, and workload scheduling policies guided by this performance model. Third, we design a unified pipeline executor that efficiently supports the execution of diverse pipeline strategies. Extensive experiments show that AdaPtis achieves an average speedup of 1.42x (up to 2.14x) over Megatron-LM I-1F1B across various LLM architectures and scales.

Comment: High Performance Computing: adaptive pipeline parallelism co-optimizing partition, placement, and scheduling guided by a performance model.

Relevance: 9 Novelty: 8

ArXiv ID: 2509.24832

Authors: Xinye Zhao, Spyridon Mastorakis

Abstract: As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt's cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42\% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

Comment: Efficiency/Cache: semantic KV-cache sharing across prompts using token-level LSH and RoPE-aware matching to reduce memory and computation.

Relevance: 9 Novelty: 8

28. Short window attention enables long-term memorization

ArXiv ID: 2509.24552

Authors: Lo\"ic Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre-Emmanuel Mazar\'e, Gabriel Synnaeve, Herv\'e J\'egou

Abstract: Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Comment: Model Architecture: hybrid sliding-window attention + xLSTM with stochastic window-size training; insight that short windows strengthen long-term memory.

Relevance: 9 Novelty: 8

29. MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts

ArXiv ID: 2509.25020

Authors: Jiayu Liu, Zhenya Huang, Anya Sims, Enhong Chen, Yee Whye Teh, Ning Miao

Abstract: The current paradigm for reasoning in large language models (LLMs) involves models "thinking out loud" via a sequence of tokens, known as chain-of-thought (CoT). This approach, while effective, has several significant drawbacks. Firstly, inference requires autoregressive generation of often thousands of CoT tokens, which is slow and computationally expensive. Secondly, it constrains reasoning to the discrete space of tokens, creating an information bottleneck across reasoning steps. Thirdly, it fundamentally entangles reasoning with token generation, forcing LLMs to "think while speaking," which causes potentially short-sighted reasoning. In light of these limitations, we re-imagine reasoning in LLMs and present a new paradigm: MARCOS. In our approach, rather than autoregressively generating tokens, we model reasoning as a hidden Markov chain of continuous, high-dimensional "thoughts". Each reasoning step involves a transition of the internal thoughts, where explicit reasoning steps (which may consist of hundreds of tokens) serve as observable variables, which are windows to peek into the implicit thoughts. Since this latent process is incompatible with the standard supervised learning, we further propose a two-phase variational training scheme. Our experiments on three benchmarks demonstrate that MARCOS outperforms existing continuous reasoning methods and, for the first time, achieves performance comparable to token-based CoT, even surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference. Beyond this, MARCOS offers additional advantages, such as step-level instead of token-level control over randomness, opening significant opportunities for reinforcement learning and reasoning in LLMs.

Comment: Model Architecture/Training Dynamics: replaces token CoT with a latent Markov chain of continuous thoughts and variational training for faster, decoupled reasoning.

Relevance: 9 Novelty: 8

30. Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression

ArXiv ID: 2509.23779

Authors: Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie

Abstract: State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba's in-context learning (ICL) capabilities competitive with Transformers, a critical capacity for large foundation models. However, theoretical understanding of Mamba's ICL remains limited, restricting deeper insights into its underlying mechanisms. Even fundamental tasks such as linear regression ICL, widely studied as a standard theoretical benchmark for Transformers, have not been thoroughly analyzed in the context of Mamba. To address this gap, we study the training dynamics of Mamba on the linear regression ICL task. By developing novel techniques tackling non-convex optimization with gradient descent related to Mamba's structure, we establish an exponential convergence rate to ICL solution, and derive a loss bound that is comparable to Transformer's. Importantly, our results reveal that Mamba can perform a variant of \textit{online gradient descent} to learn the latent function in context. This mechanism is different from that of Transformer, which is typically understood to achieve ICL through gradient descent emulation. The theoretical results are verified by experimental simulation.

Comment: Model Architecture/Training Dynamics: theoretical analysis showing Mamba performs online gradient descent for in-context linear regression.

Relevance: 9 Novelty: 8

31. A multiscale analysis of mean-field transformers in the moderate interaction regime

ArXiv ID: 2509.25040

Authors: Giuseppe Bruno, Federico Pasqualotto, Andrea Agazzi

Abstract: In this paper, we study the evolution of tokens through the depth of encoder-only transformer models at inference time by modeling them as a system of particles interacting in a mean-field way and studying the corresponding dynamics. More specifically, we consider this problem in the moderate interaction regime, where the number $N$ of tokens is large and the inverse temperature parameter $\beta$ of the model scales together with $N$. In this regime, the dynamics of the system displays a multiscale behavior: a fast phase, where the token empirical measure collapses on a low-dimensional space, an intermediate phase, where the measure further collapses into clusters, and a slow one, where such clusters sequentially merge into a single one. We provide a rigorous characterization of the limiting dynamics in each of these phases and prove convergence in the above mentioned limit, exemplifying our results with some simulations.

Comment: Transformer Theory: mean-field multiscale analysis of token dynamics through depth (fast/intermediate/slow phases).

Relevance: 9 Novelty: 8

32. Scaling with Collapse: Efficient and Predictable Training of LLM Families

ArXiv ID: 2509.25087

Authors: Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal, Joel Hestness

Abstract: Effective LLM training relies on consistency, meaning that key quantities -- such as final losses and optimal hyperparameters -- scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can collapse onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under practical scaling recipes, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, Celerity, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.

Comment: HPC/Training Efficiency: loss-curve collapse as a signature of optimal scaling; enables early diagnostics and hyperparameter tuning for LLM families.

Relevance: 9 Novelty: 8

33. Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

ArXiv ID: 2509.24510

Authors: Jonas H\"ubotter, Patrik Wolf, Alexander Shevchenko, Dennis J\"uni, Andreas Krause, Gil Kur

Abstract: Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

Comment: Representation Learning/Training Dynamics: theoretical framework explaining test-time training as specialization after generalization; supported by sparse autoencoder analyses.

Relevance: 9 Novelty: 8

34. Negative Pre-activations Differentiate Syntax

ArXiv ID: 2509.24198

Authors: Linghao Kong, Angelina Ning, Micah Adler, Nir Shavit

Abstract: A recently discovered class of entangled neurons, known as Wasserstein neurons, is disproportionately critical in large language models despite constituting only a very small fraction of the network: their targeted removal collapses the model, consistent with their unique role in differentiating similar inputs. Interestingly, in Wasserstein neurons immediately preceding smooth activation functions, such differentiation manifests in the negative pre-activation space, especially in early layers. Pairs of similar inputs are driven to highly distinct negative values, and these pairs involve syntactic tokens such as determiners and prepositions. We show that this negative region is functional rather than simply favorable for optimization. A minimal, sign-specific intervention that zeroes only the negative pre-activations of a small subset of entangled neurons significantly weakens overall model function and disrupts grammatical behavior, while both random and perplexity-matched controls leave grammatical performance largely unchanged. Part of speech analysis localizes the excess surprisal to syntactic scaffolding tokens, and layer-specific interventions reveal that small local degradations accumulate across depth. Over training checkpoints, the same ablation impairs grammatical behavior as Wasserstein neurons emerge and stabilize. Together, these results identify negative differentiation in a sparse subset of entangled neurons as a crucial mechanism that language models rely on for syntax.

Comment: Strong match to representation learning/interpretability: identifies a sparse mechanism (negative pre-activations in entangled neurons) underlying syntax processing with causal interventions.

Relevance: 9 Novelty: 8

ArXiv ID: 2509.24472

Authors: Ran Elbaz, Guy Bar-Shalom, Yam Eitan, Fabrizio Frasca, Haggai Maron

Abstract: Permutation equivariant neural networks employing parameter-sharing schemes have emerged as powerful models for leveraging a wide range of data symmetries, significantly enhancing the generalization and computational efficiency of the resulting models. Recently, Kolmogorov-Arnold Networks (KANs) have demonstrated promise through their improved interpretability and expressivity compared to traditional architectures based on MLPs. While equivariant KANs have been explored in recent literature for a few specific data types, a principled framework for applying them to data with permutation symmetries in a general context remains absent. This paper introduces Function Sharing KAN (FS-KAN), a principled approach to constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups, unifying and significantly extending previous work in this domain. We derive the basic construction of these FS-KAN layers by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup and provide a theoretical analysis demonstrating that FS-KANs have the same expressive power as networks that use standard parameter-sharing layers, allowing us to transfer well-known and important expressivity results from parameter-sharing networks to FS-KANs. Empirical evaluations on multiple data types and symmetry groups show that FS-KANs exhibit superior data efficiency compared to standard parameter-sharing layers, by a wide margin in certain cases, while preserving the interpretability and adaptability of KANs, making them an excellent architecture choice in low-data regimes.

Comment: Model Architecture: principled permutation-equivariant Kolmogorov–Arnold Networks via function sharing with theoretical expressivity guarantees and symmetry-aware design.

Relevance: 9 Novelty: 8

36. Signal Preserving Weight Initialization for Odd-Sigmoid Activations

ArXiv ID: 2509.23085

Authors: Hyunwoo Lee, Hayoung Choi, Hyunju Kim

Abstract: Activation functions critically influence trainability and expressivity, and recent work has therefore explored a broad range of nonlinearities. However, activations and weight initialization are interdependent: without an appropriate initialization method, nonlinearities can cause saturation, variance collapse, and increased learning rate sensitivity. We address this by defining an odd sigmoid function class and, given any activation f in this class, proposing an initialization method tailored to f. The method selects a noise scale in closed form so that forward activations remain well dispersed up to a target layer, thereby avoiding collapse to zero or saturation. Empirically, the approach trains reliably without normalization layers, exhibits strong data efficiency, and enables learning for activations under which standard initialization methods (Xavier, He, Orthogonal) often do not converge reliably.

Comment: Model Architecture/Training Dynamics: closed-form signal-preserving weight initialization tailored to odd-sigmoid activations, enabling stable training without normalization.

Relevance: 9 Novelty: 8

37. Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory

ArXiv ID: 2509.24653

Authors: Pengxiao Lin, Zheng-An Chen, Zhi-Qin John Xu

Abstract: Despite remarkable advances, large language models often fail at compositional reasoning tasks, a phenomenon exemplified by the ``curse of two-hop reasoning''. This paper introduces the Identity Bridge, a simple yet powerful mechanism that resolves this compositionality gap by supervising the model on a zero-hop identity task. We demonstrate empirically that this addition enables models to successfully perform out-of-distribution two-hop reasoning, a task they otherwise completely fail. To explain this phenomenon, we provide a theoretical analysis using a simplified Emb-MLP model, proving that identity supervision reshapes the model's latent geometry. We show this alignment is induced by an implicit nuclear-norm regularization during optimization, which favors low-rank solutions that share structure across tasks. For complex tasks, we use small initialization or weight decay to enhance the regularization effect, which enhances the latent space alignment effect and slows down the generalization decay. Finally, we extend our investigation to large-scale models, observing that they still achieve two-hop reasoning through the latent memory, which provides crucial inspiration for enhancing their implicit reasoning abilities.

Comment: Representation Learning: identity supervision aligns latent geometry (implicit nuclear-norm regularization) to enable compositional reasoning (two-hop), with theory and scaling evidence.

Relevance: 9 Novelty: 8

38. Neighborhood Sampling Does Not Learn the Same Graph Neural Network

ArXiv ID: 2509.22868

Authors: Zehao Niu, Mihai Anitescu, Jie Chen

Abstract: Neighborhood sampling is an important ingredient in the training of large-scale graph neural networks. It suppresses the exponential growth of the neighborhood size across network layers and maintains feasible memory consumption and time costs. While it becomes a standard implementation in practice, its systemic behaviors are less understood. We conduct a theoretical analysis by using the tool of neural tangent kernels, which characterize the (analogous) training dynamics of neural networks based on their infinitely wide counterparts -- Gaussian processes (GPs). We study several established neighborhood sampling approaches and the corresponding posterior GP. With limited samples, the posteriors are all different, although they converge to the same one as the sample size increases. Moreover, the posterior covariance, which lower-bounds the mean squared prediction error, is uncomparable, aligning with observations that no sampling approach dominates.

Comment: Representation Learning/Training Dynamics: NTK/GP-theoretic analysis of neighborhood sampling shows distinct posterior processes and errors versus full GNNs, clarifying systemic behaviors.

Relevance: 9 Novelty: 8

39. Conda: Column-Normalized Adam for Training Large Language Models Faster

ArXiv ID: 2509.24218

Authors: Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin

Abstract: Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose \textbf{Column-Normalized Adam (Conda)}, a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, \textbf{Conda achieves $2{\sim}2.5\times$ the convergence speed of AdamW, measured in both training steps and training time.} Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda

Comment: HPC/Optimization for large-scale training: Column-Normalized Adam improves spectral conditioning while retaining per-coordinate adaptivity, yielding 2–2.5x faster LLM pretraining.

Relevance: 9 Novelty: 8

40. Statistical Learning Guarantees for Group-Invariant Barron Functions

ArXiv ID: 2509.23474

Authors: Yahong Yang, Wei Zhu

Abstract: We investigate the generalization error of group-invariant neural networks within the Barron framework. Our analysis shows that incorporating group-invariant structures introduces a group-dependent factor $\delta_{G,\Gamma,\sigma} \le 1$ into the approximation rate. When this factor is small, group invariance yields substantial improvements in approximation accuracy. On the estimation side, we establish that the Rademacher complexity of the group-invariant class is no larger than that of the non-invariant counterpart, implying that the estimation error remains unaffected by the incorporation of symmetry. Consequently, the generalization error can improve significantly when learning functions with inherent group symmetries. We further provide illustrative examples demonstrating both favorable cases, where $\delta_{G,\Gamma,\sigma}\approx |G|^{-1}$, and unfavorable ones, where $\delta_{G,\Gamma,\sigma}\approx 1$. Overall, our results offer a rigorous theoretical foundation showing that encoding group-invariant structures in neural networks leads to clear statistical advantages for symmetric target functions.

Comment: Architecture/representation theory: statistical learning guarantees for group-invariant neural networks in the Barron framework; analyses of approximation and estimation errors.

Relevance: 9 Novelty: 8

41. Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

ArXiv ID: 2509.24882

Authors: Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborov\'a, Bruno Loureiro, Florent Krzakala

Abstract: Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.

Comment: Matches Representation Learning criterion: theoretical scaling laws and spectral analysis linking weight spectra to generalization.

Relevance: 9 Novelty: 8

42. SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

ArXiv ID: 2509.24006

Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen

Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Comment: Compression/Efficiency and Transformer Attention: Sparse-Linear Attention fuses sparse and linear attention with a custom GPU kernel, drastically reducing attention cost with minimal quality loss.

Relevance: 9 Novelty: 8

43. Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

ArXiv ID: 2509.23365

Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian

Abstract: Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages -- (i) a thought-generation stage that autoregressively expands the continuous thought, and (ii) a prediction stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model's local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.

Comment: Representation Learning/Training Dynamics: theory of how superposition emerges in continuous chain-of-thought within transformers through two-stage training dynamics.

Relevance: 9 Novelty: 8

44. LLM Interpretability with Identifiable Temporal-Instantaneous Representation

ArXiv ID: 2509.23323

Authors: Xiangchen Song, Jiaqi Sun, Zijian Li, Yujia Zheng, Kun Zhang

Abstract: Despite Large Language Models' remarkable capabilities, understanding their internal representations remains challenging. Mechanistic interpretability tools such as sparse autoencoders (SAEs) were developed to extract interpretable features from LLMs but lack temporal dependency modeling, instantaneous relation representation, and more importantly theoretical guarantees, undermining both the theoretical foundations and the practical confidence necessary for subsequent analyses. While causal representation learning (CRL) offers theoretically grounded approaches for uncovering latent concepts, existing methods cannot scale to LLMs' rich conceptual space due to inefficient computation. To bridge the gap, we introduce an identifiable temporal causal representation learning framework specifically designed for LLMs' high-dimensional concept space, capturing both time-delayed and instantaneous causal relations. Our approach provides theoretical guarantees and demonstrates efficacy on synthetic datasets scaled to match real-world complexity. By extending SAE techniques with our temporal causal framework, we successfully discover meaningful concept relationships in LLM activations. Our findings show that modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs.

Comment: Matches Representation Learning/Identifiable Interpretability: identifiable temporal-instantaneous causal representation framework scaling to LLM activations with theoretical guarantees.

Relevance: 9 Novelty: 8

45. High-Dimensional Analysis of Single-Layer Attention for Sparse-Token Classification

ArXiv ID: 2509.25153

Authors: Nicholas Barnfield, Hugo Cui, Yue M. Lu

Abstract: When and how can an attention mechanism learn to selectively attend to informative tokens, thereby enabling detection of weak, rare, and sparsely located features? We address these questions theoretically in a sparse-token classification model in which positive samples embed a weak signal vector in a randomly chosen subset of tokens, whereas negative samples are pure noise. In the long-sequence limit, we show that a simple single-layer attention classifier can in principle achieve vanishing test error when the signal strength grows only logarithmically in the sequence length $L$, whereas linear classifiers require $\sqrt{L}$ scaling. Moving from representational power to learnability, we study training at finite $L$ in a high-dimensional regime, where sample size and embedding dimension grow proportionally. We prove that just two gradient updates suffice for the query weight vector of the attention classifier to acquire a nontrivial alignment with the hidden signal, inducing an attention map that selectively amplifies informative tokens. We further derive an exact asymptotic expression for the test error and training loss of the trained attention-based classifier, and quantify its capacity -- the largest dataset size that is typically perfectly separable -- thereby explaining the advantage of adaptive token selection over nonadaptive linear baselines.

Comment: Matches Architecture/Training Dynamics: theoretical analysis of single-layer attention’s adaptive token selection and learnability in high dimensions.

Relevance: 9 Novelty: 8

46. CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

ArXiv ID: 2509.24416

Authors: Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang

Abstract: Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{https://github.com/Kai-Liu001/CLQ}{https://github.com/Kai-Liu001/CLQ}.

Comment: Compression/Efficiency: PTQ for Diffusion Transformers with cross-block calibration, orthogonal-based smoothing via Hadamard transforms, and cross-layer parameter search.

Relevance: 9 Novelty: 7

47. VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

ArXiv ID: 2506.22694

Authors: Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee

Abstract: In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.

Comment: Compression/Efficiency: vocabulary pruning of the drafter LM head for speculative decoding to reduce memory-bound drafting latency.

Relevance: 9 Novelty: 7

48. Learning to Ponder: Adaptive Reasoning in Latent Space

ArXiv ID: 2509.24238

Authors: Yixin He, Lumingyuan Tang

Abstract: Test-time compute has emerged as a key paradigm for enhancing LLM reasoning, yet prevailing approaches like Best-of-N and majority voting apply uniform depth across inputs, wasting computation on simple queries while potentially under-thinking complex ones. We present FR-Ponder, a single-graph, backbone-training-free framework that allocates instance-adaptive reasoning compute via latent steering. A less than 1M-param controller observes hidden states and decides to halt or apply a small ponder step by adding a pre-computed steering vector to frozen representations. Our method extracts the latent steering vector associated with deeper reasoning outputs and direct IO from LLM and re-applies it through a tunable scaling factor, allowing the model to adapt its reasoning depth to the complexity of each input. To balance performance and computational cost, we employ Group Relative Policy Optimization (GRPO) as a reward signal to adaptively regulate reasoning depth, achieving task accuracy while mitigating overreasoning. Through curriculum learning and careful reward engineering, FR-Ponder learns calibrated compute allocation correlated with problem difficulty. On GSM8K and MATH500, FR-Ponder improves the compute-accuracy frontier, delivering lower FLOPs with better matched accuracy and comparing favorably to early-exit baselines, without modifying backbone weights. Analyses visualize interpretable steering directions and show learned compute allocation correlates with problem difficulty.

Comment: Conditional/Dynamic Networks: instance-adaptive test-time compute via a controller that applies latent steering and learned halting without changing backbone.

Relevance: 9 Novelty: 7

49. Scalable GANs with Transformers

ArXiv ID: 2509.24935

Authors: Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

Abstract: Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.

Comment: Model Architecture/Scalability: purely transformer-based GANs trained in VAE latent space with scale-friendly stabilization (intermediate supervision, width-aware LR).

Relevance: 9 Novelty: 7

50. Effective Quantization of Muon Optimizer States

ArXiv ID: 2509.23106

Authors: Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi

Abstract: The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and up to 2x computational efficiency over AdamW in LLM pretraining. Like AdamW, Muon is stateful, requiring storage of both model weights and accumulated gradients. While 8-bit AdamW variants mitigate this overhead using blockwise quantization, they are typically stable only under dynamic quantization - which improves stability on linear quantization for extreme values. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic schemes. We demonstrate that 8-bit Muon maintains stability under both, while delivering $\sim$74\% reduction in memory footprint compared to full-precision Muon. In extensive experiments, 8-bit Muon closely matches the performance of Muon while outperforming AdamW and 8-bit AdamW in pre-training a 1.6B model on 4B FineWeb tokens. It also shows competitive results when fine-tuning the Llama 3.2 3B model on post-training data. We also provide a theoretical perspective to help explain this robustness under quantization.

Comment: Model Compression/Efficiency: 8-bit quantization of Muon optimizer states (blockwise, linear/dynamic) with robustness analysis, yielding large memory savings.

Relevance: 9 Novelty: 7

51. Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs

ArXiv ID: 2509.23684

Authors: Tanya Chowdhury, Atharva Nijasure, Yair Zick, James Allan

Abstract: Fine-tuned Large Language Models (LLMs) encode rich task-specific features, but the form of these representations, especially within MLP layers, remains unclear. Empirical inspection of LoRA updates shows that new features concentrate in mid-layer MLPs, yet the scale of these layers obscures meaningful structure. Prior probing suggests that statistical priors may strengthen, split, or vanish across depth, motivating the need to study how neurons work together rather than in isolation. We introduce a mechanistic interpretability framework based on coalitional game theory, where neurons mimic agents in a hedonic game whose preferences capture their synergistic contributions to layer-local computations. Using top-responsive utilities and the PAC-Top-Cover algorithm, we extract stable coalitions of neurons: groups whose joint ablation has non-additive effects. We then track their transitions across layers as persistence, splitting, merging, or disappearance. Applied to LLaMA, Mistral, and Pythia rerankers fine-tuned on scalar IR tasks, our method finds coalitions with consistently higher synergy than clustering baselines. By revealing how neurons cooperate to encode features, hedonic coalitions uncover higher-order structure beyond disentanglement and yield computational units that are functionally important, interpretable, and predictive across domains.

Comment: Representation Learning: mechanistic interpretability of Transformer MLPs via coalitional game theory to reveal synergistic neuron groups.

Relevance: 9 Novelty: 7

52. Query Circuits: Explaining How Language Models Answer User Prompts

ArXiv ID: 2509.24808

Authors: Tung-Yu Wu, Fazl Barez

Abstract: Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model's behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist extremely sparse query circuits within the model that can recover much of its performance on single queries. For example, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU questions. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs.

Comment: Representation Learning: input-level mechanistic explanations via sparse “query circuits” within the model, with a fidelity metric (NDF) and efficient discovery.

Relevance: 9 Novelty: 7

53. A Second-Order Perspective on Pruning at Initialization and Knowledge Transfer

ArXiv ID: 2509.24066

Authors: Leonardo Iurada, Beatrice Occhiena, Tatiana Tommasi

Abstract: The widespread availability of pre-trained vision models has enabled numerous deep learning applications through their transferable representations. However, their computational and storage costs often limit practical deployment. Pruning-at-Initialization has emerged as a promising approach to compress models before training, enabling efficient task-specific adaptation. While conventional wisdom suggests that effective pruning requires task-specific data, this creates a challenge when downstream tasks are unknown in advance. In this paper, we investigate how data influences the pruning of pre-trained vision models. Surprisingly, pruning on one task retains the model's zero-shot performance also on unseen tasks. Furthermore, fine-tuning these pruned models not only improves performance on original seen tasks but can recover held-out tasks' performance. We attribute this phenomenon to the favorable loss landscapes induced by extensive pre-training on large-scale datasets.

Comment: Compression/Efficiency: analyzes pruning-at-initialization for pre-trained models with second-order perspective; shows cross-task transferability.

Relevance: 9 Novelty: 7

54. Limit Analysis for Symbolic Multi-step Reasoning Tasks with Information Propagation Rules Based on Transformers

ArXiv ID: 2509.23178

Authors: Tian Qin, Yuhan Chen, Zhiwei Wang, Zhi-Qin John Xu

Abstract: Transformers are able to perform reasoning tasks, however the intrinsic mechanism remains widely open. In this paper we propose a set of information propagation rules based on Transformers and utilize symbolic reasoning tasks to theoretically analyze the limit reasoning steps. We show that the limit number of reasoning steps is between $O(3^{L-1})$ and $O(2^{L-1})$ for a model with $L$ attention layers in a single-pass.

Comment: Theoretical analysis of Transformer reasoning limits via information propagation rules (steps scale with number of layers).

Relevance: 9 Novelty: 7

55. Measuring Sparse Autoencoder Feature Sensitivity

ArXiv ID: 2509.23717

Authors: Claire Tian, Katherine Tian, Nathan Hu

Abstract: Sparse Autoencoder (SAE) features have become essential tools for mechanistic interpretability research. SAE features are typically characterized by examining their activating examples, which are often "monosemantic" and align with human interpretable concepts. However, these examples don't reveal feature sensitivity: how reliably a feature activates on texts similar to its activating examples. In this work, we develop a scalable method to evaluate feature sensitivity. Our approach avoids the need to generate natural language descriptions for features; instead we use language models to generate text with the same semantic properties as a feature's activating examples. We then test whether the feature activates on these generated texts. We demonstrate that sensitivity measures a new facet of feature quality and find that many interpretable features have poor sensitivity. Human evaluation confirms that when features fail to activate on our generated text, that text genuinely resembles the original activating examples. Lastly, we study feature sensitivity at the SAE level and observe that average feature sensitivity declines with increasing SAE width across 7 SAE variants. Our work establishes feature sensitivity as a new dimension for evaluating both individual features and SAE architectures.

Comment: Representation learning/interpretability: introduces sensitivity metric and evaluation protocol for Sparse Autoencoder features.

Relevance: 9 Novelty: 7

56. LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

ArXiv ID: 2509.23729

Authors: Shubhang Bhatnagar, Andy Xu, Kar-Han Tan, Narendra Ahuja

Abstract: Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.

Comment: Model Compression and Efficiency: ultra-low-bit (<4-bit) layerwise quantization for MLLMs with selective per-layer PTQ based on activation entropy.

Relevance: 9 Novelty: 7

57. Memory-Efficient Fine-Tuning via Low-Rank Activation Compression

ArXiv ID: 2509.23472

Authors: Jiang-Xin Shi, Wen-Da Wei, Jin-Fei Qi, Xuanyu Chen, Tong Wei, Yu-Feng Li

Abstract: The parameter-efficient fine-tuning paradigm has garnered significant attention with the advancement of foundation models. Although numerous methods have been proposed to reduce the number of trainable parameters, their substantial memory overhead remains a critical bottleneck that hinders practical deployment. In this paper, we observe that model activations constitute a major source of memory consumption, especially under large batch sizes and long context lengths; however, the rank of the activations remains consistently low. Motivated by this insight, we propose a memory-efficient fine-tuning approach Low-Rank Activation Compression (LoRAct). Unlike prior work, LoRAct provides a more flexible and versatile compressing strategy that can be applied online during the forward pass without the need for any calibration data. Moreover, LoRAct incorporates a novel sampling-based orthogonal decomposition algorithm specifically designed for low-rank matrices, offering improved computational efficiency and a tighter error bound compared to the widely used RSVD. Experiments on both vision and language tasks demonstrate the effectiveness of LoRAct. Notably, LoRAct further reduces activation memory by approximately 80% in comparison with the widely adopted LoRA method, while maintaining competitive performance. The source code is available at https://github.com/shijxcs/meft.

Comment: Model Compression and Efficiency: low-rank activation compression during fine-tuning with a novel sampling-based orthogonal decomposition for low-rank matrices.

Relevance: 9 Novelty: 7

58. HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

ArXiv ID: 2509.23928

Authors: Zhinan Xie, Peisong Wang, Jian Cheng

Abstract: Speculative decoding is an effective approach for accelerating inference in Large Language models (LLMs), but its adaptation to Vision-Language models (VLMs) remains challenging for additional visual tokens in multimodal inputs. First, owing to the fact that the drafter and the target VLM may derived from different families, the semantic representations of visual tokens in the target VLM are misaligned with those in the drafter, introducing bias into the KV-cache during the prefill stage. Second, the large number of visual tokens substantially slows down the drafter's self-attention during the decoding stage. We propose Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models (HiViS), an explicit-implicit input decomposition framework that alleviates the above inefficiency. All visual tokens are removed from the drafter's input, retaining only textual tokens as explicit inputs, while directly reusing the target VLM's corresponding last-layer hidden states as implicit visual information without additional processing. To train the drafter efficiently, we introduces multi-step self-feedback training strategy with dynamic data selection and sequential embedding supervision to simulate reasoning during training. Our approach compresses the prefill sequence length of the drafter to only 0.7%-1.3% of the target VLM's input, while maintaining lossless generation quality. Extensive experiments across diverse models and tasks demonstrate up to 2.65x speedup, confirming the effectiveness of HiViS in accelerating VLM inference.

Comment: Model Compression/Efficiency: removes visual tokens from the drafter and reuses target hidden states to shorten prefill and speed up speculative decoding in VLMs while preserving quality.

Relevance: 9 Novelty: 7

59. MonoCon: A general framework for learning ultra-compact high-fidelity representations using monotonicity constraints

ArXiv ID: 2509.22931

Authors: Shreyas Gokhale

Abstract: Learning high-quality, robust, efficient, and disentangled representations is a central challenge in artificial intelligence (AI). Deep metric learning frameworks tackle this challenge primarily using architectural and optimization constraints. Here, we introduce a third approach that instead relies on $\textit{functional}$ constraints. Specifically, we present MonoCon, a simple framework that uses a small monotonic multi-layer perceptron (MLP) head attached to any pre-trained encoder. Due to co-adaptation between encoder and head guided by contrastive loss and monotonicity constraints, MonoCon learns robust, disentangled, and highly compact embeddings at a practically negligible performance cost. On the CIFAR-100 image classification task, MonoCon yields representations that are nearly 9x more compact and 1.5x more robust than the fine-tuned encoder baseline, while retaining 99\% of the baseline's 5-NN classification accuracy. We also report a 3.4x more compact and 1.4x more robust representation on an SNLI sentence similarity task for a marginal reduction in the STSb score, establishing MonoCon as a general domain-agnostic framework. Crucially, these robust, ultra-compact representations learned via functional constraints offer a unified solution to critical challenges in disparate contexts ranging from edge computing to cloud-scale retrieval.

Comment: Model Compression/Efficiency + Representation Learning: monotonic MLP head imposes functional constraints to learn ultra-compact, robust embeddings across domains.

Relevance: 9 Novelty: 7

60. Beyond Outliers: A Study of Optimizers Under Quantization

ArXiv ID: 2509.23500

Authors: Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh

Abstract: As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer-quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines. We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.

Comment: Compression/Efficiency: systematic analysis of optimizer–quantization interactions (PTQ/QAT), with scaling laws and optimizer comparisons (e.g., Shampoo) for quantized training.

Relevance: 9 Novelty: 7

61. Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

ArXiv ID: 2509.22832

Authors: Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang

Abstract: Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98\% on Perlmutter(A100) and 9.38\% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.

Comment: High Performance Computing: fine-grained, hardware-aware GPU performance modeling for distributed LLM training across parallelism strategies; systems-level innovation enabling planning without on-cluster runs.

Relevance: 9 Novelty: 7

62. Dense associative memory on the Bures-Wasserstein space

ArXiv ID: 2509.23162

Authors: Chandan Tankala, Krishnakumar Balasubramanian

Abstract: Dense associative memories (DAMs) store and retrieve patterns via energy-functional fixed points, but existing models are limited to vector representations. We extend DAMs to probability distributions equipped with the 2-Wasserstein distance, focusing mainly on the Bures-Wasserstein class of Gaussian densities. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity, provide quantitative retrieval guarantees under Wasserstein perturbations, and validate the model on synthetic and real-world distributional tasks. This work elevates associative memory from vectors to full distributions, bridging classical DAMs with modern generative modeling and enabling distributional storage and retrieval in memory-augmented learning.

Comment: Model Architecture + Representation Learning: extends dense associative memory to the Bures–Wasserstein space with storage/retrieval theory, a foundational representational framework.

Relevance: 8 Novelty: 8

63. LLM DNA: Tracing Model Evolution via Functional Representations

ArXiv ID: 2509.24496

Authors: Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He

Abstract: The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.

Comment: Representation Learning: defines a low-dimensional, bi-Lipschitz functional representation (LLM DNA) of model behavior with theoretical guarantees and a training-free extraction pipeline.

Relevance: 8 Novelty: 8

64. Space Group Conditional Flow Matching

ArXiv ID: 2509.23822

Authors: Omri Puny, Yaron Lipman, Benjamin Kurt Miller

Abstract: Inorganic crystals are periodic, highly-symmetric arrangements of atoms in three-dimensional space. Their structures are constrained by the symmetry operations of a crystallographic \emph{space group} and restricted to lie in specific affine subspaces known as \emph{Wyckoff positions}. The frequency an atom appears in the crystal and its rough positioning are determined by its Wyckoff position. Most generative models that predict atomic coordinates overlook these symmetry constraints, leading to unrealistically high populations of proposed crystals exhibiting limited symmetry. We introduce Space Group Conditional Flow Matching, a novel generative framework that samples significantly closer to the target population of highly-symmetric, stable crystals. We achieve this by conditioning the entire generation process on a given space group and set of Wyckoff positions; specifically, we define a conditionally symmetric noise base distribution and a group-conditioned, equivariant, parametric vector field that restricts the motion of atoms to their initial Wyckoff position. Our form of group-conditioned equivariance is achieved using an efficient reformulation of \emph{group averaging} tailored for symmetric crystals. Importantly, it reduces the computational overhead of symmetrization to a negligible level. We achieve state of the art results on crystal structure prediction and de novo generation benchmarks. We also perform relevant ablations.

Comment: Model Architecture/Equivariance: group-conditioned flow matching with efficient group-averaged equivariant vector fields enforcing space-group/Wyckoff constraints.

Relevance: 8 Novelty: 8

65. Define latent spaces by example: optimisation over the outputs of generative models

ArXiv ID: 2509.23800

Authors: Samuel Willis, Alexandru I. Stere, Dragos D. Margineantu, Henry T. Oldroyd, John A. Fozard, Carl Henrik Ek, Henry Moss, Erik Bodin

Abstract: Modern generative AI models such as diffusion and flow matching can sample from rich data distributions, but many downstream tasks -- such as experimental design or creative content generation -- require a higher level of control than unconstrained sampling. The challenge is to efficiently identify outputs that are both probable under the model and satisfy task-specific constraints. We address this by introducing surrogate latent spaces: non-parametric, low-dimensional Euclidean embeddings that can be extracted from any generative model without additional training. The axes in the Euclidean space can be defined via examples, providing a simple and interpretable approach to define custom latent spaces that both express intended features and are convenient to use in downstream tasks. The representation is Euclidean and has controllable dimensionality, permitting direct application of standard optimisation algorithms to traverse the outputs of generative models. Our approach is architecture-agnostic, incurs almost no additional computational cost, and generalises across modalities, including images, audio, videos, and structured objects like proteins.

Comment: Representation Learning/Control: introduces surrogate Euclidean latent spaces extracted from generative models for constraint-aware optimization without retraining.

Relevance: 8 Novelty: 8

66. Sequential Diffusion Language Models

ArXiv ID: 2509.24007

Authors: Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang

Abstract: Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM

Comment: Model architecture and efficiency: introduces Next Sequence Prediction and SDLM to retrofit AR LMs for diffusion-style inference while preserving KV-cache compatibility and adaptive decoding.

Relevance: 8 Novelty: 8

67. Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

ArXiv ID: 2509.24335

Authors: Guolin Ke, Hui Xue

Abstract: Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

Comment: Model Architecture/Training Dynamics: stabilizes continuous-token AR generation via hyperspherical VAE latents to prevent variance collapse, improving AR image models.

Relevance: 8 Novelty: 8

68. Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

ArXiv ID: 2509.24372

Authors: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen

Abstract: Fine-tuning pre-trained large language models (LLMs) for down-stream tasks is a critical step in the AI deployment pipeline. Reinforcement learning (RL) is arguably the most prominent fine-tuning method, contributing to the birth of many state-of-the-art LLMs. In contrast, evolution strategies (ES), which once showed comparable performance to RL on models with a few million parameters, was neglected due to the pessimistic perception of its scalability to larger models. In this work, we report the first successful attempt to scale up ES for fine-tuning the full parameters of LLMs, showing the surprising fact that ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects, including sample efficiency, tolerance to long-horizon rewards, robustness to different base LLMs, less tendency to reward hacking, and more stable performance across runs. It therefore serves as a basis to unlock a new direction in LLM fine-tuning beyond what current RL techniques provide. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.

Comment: Matches high-performance training algorithms: scales Evolution Strategies to full-parameter LLM fine-tuning at billion-parameter scale, offering a systems-level alternative to RL.

Relevance: 8 Novelty: 8

69. Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

ArXiv ID: 2509.24291

Authors: Yu-Che Tsai, Kuan-Yu Chen, Yuan-Chi Li, Yuan-Hao Chen, Ching-Yu Tsai, Shou-De Lin

Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.

Comment: Matches Representation Learning: generative iterative refinement for contrastive sentence embeddings using autoregressive LLMs.

Relevance: 8 Novelty: 8

70. SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

ArXiv ID: 2509.25176

Authors: Haoming Wen, Yushi Bai, Juanzi Li, Jie Tang

Abstract: We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model's performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM's output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal "sweet spot" between the two. Our models are publicly available.

Comment: Model Compression/Efficiency: interleaved compression–expansion of rollout length in RL training to reduce redundant tokens and improve the performance–efficiency Pareto for LRMs.

Relevance: 8 Novelty: 8

71. Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

ArXiv ID: 2509.24317

Authors: Xianhang Li, Chen Huang, Chun-Liang Li, Eran Malach, Josh Susskind, Vimal Thilak, Etai Littwin

Abstract: Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel-reconstruction objective under V-JEPA masking, then (ii) freeze it and train a student to predict the teacher's latents on masked regions. This leads to a two-stage, unregularized scheme that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representation to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute-optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V-JEPA's accuracy-FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute-efficient alternative to EMA-based self-distillation for video representation learning.

Comment: Representation Learning + Efficiency: SALT replaces EMA with a frozen teacher for masked-latent prediction, improving compute-efficiency and scalability in video SSL.

Relevance: 8 Novelty: 8

72. Adaptive Canonicalization with Application to Invariant Anisotropic Geometric Networks

ArXiv ID: 2509.24886

Authors: Ya-Wei Eileen Lin, Ron Levie

Abstract: Canonicalization is a widely used strategy in equivariant machine learning, enforcing symmetry in neural networks by mapping each input to a standard form. Yet, it often introduces discontinuities that can affect stability during training, limit generalization, and complicate universal approximation theorems. In this paper, we address this by introducing \emph{adaptive canonicalization}, a general framework in which the canonicalization depends both on the input and the network. Specifically, we present the adaptive canonicalization based on prior maximization, where the standard form of the input is chosen to maximize the predictive confidence of the network. We prove that this construction yields continuous and symmetry-respecting models that admit universal approximation properties. We propose two applications of our setting: (i) resolving eigenbasis ambiguities in spectral graph neural networks, and (ii) handling rotational symmetries in point clouds. We empirically validate our methods on molecular and protein classification, as well as point cloud classification tasks. Our adaptive canonicalization outperforms the three other common solutions to equivariant machine learning: data augmentation, standard canonicalization, and equivariant architectures.

Comment: Model Architecture/Equivariance: adaptive canonicalization ensures continuity and symmetry with universal approximation; addresses eigenbasis/rotation ambiguities in geometric networks.

Relevance: 8 Novelty: 8

73. Model Merging Scaling Laws in Large Language Models

ArXiv ID: 2509.24244

Authors: Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang

Abstract: We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Comment: Matches Efficiency/Scaling Laws: empirical and theoretical scaling laws for model merging, enabling compute-efficient composition of specialists as an alternative to multitask training.

Relevance: 8 Novelty: 8

74. Discrete Variational Autoencoding via Policy Search

ArXiv ID: 2509.24716

Authors: Michael Drolet, Firas Al-Hafez, Aditya Bhatt, Jan Peters, Oleg Arenz

Abstract: Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces, achieving a 20% improvement on FID Score for ImageNet 256.

Comment: Matches Autoencoders/Discrete Representation Learning: new policy-search-based training for discrete VAEs avoiding reparameterization, with scaling to high-dimensional data.

Relevance: 8 Novelty: 8

75. Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

ArXiv ID: 2509.23050

Authors: Lin Long, Changdae Oh, Seongheon Park, Yixuan Li

Abstract: Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

Comment: Representation Learning: layer-wise chain-of-embedding analysis revealing a Visual Integration Point and TVI metric to quantify language prior in LVLMs.