Personalized Daily ArXiv Papers 2026-02-09

[gpt-5]	Prompt	Completion	Total
Token	50077	42089	92166
Cost	$0.06	$0.42	$0.48

Total arXiv papers: 591

Total scanned papers: 334

Total relevant papers: 41

Table of contents with paper titles:

To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training Authors: Meghana Madhyastha, Daniel Haziza, Jesse Cai, Newsha Ardalani, Zhiqi Bu, Carole-Jean Wu
MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models Authors: Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath
POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models Authors: Yi Chen, Wonjin Shin, Shuhong Liu, Tho Mai, Jeongmo Lee, Chuanbo Hua, Kun Wang, Jun Liu, Joo-Young Kim
Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning Authors: Nan Chen, Soledad Villar, Soufiane Hayou
NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models Authors: Hyochan Chong, Dongkyu Kim, Changdong Kim, Minseop Choi
Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations Authors: Alec S. Xu, Can Yaras, Matthew Asato, Qing Qu, Laura Balzano
SOCKET: SOft Collison Kernel EsTimator for Sparse Attention Authors: Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, Anshumali Shrivastava
Compressing LLMs with MoP: Mixture of Pruners Authors: Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Leandro Giusti Mugnaini, Keith Ando Ogawa, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao
Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings Authors: Gr\'egoire Dhimo\"ila, Thomas Fel, Victor Boutin, Agustin Picard
EUGens: Efficient, Unified, and General Dense Layers Authors: Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski
Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay Authors: Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu
Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization Authors: Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun
Disentanglement by means of action-induced representations Authors: Gorka Mu\~noz-Gil, Hendrik Poulsen Nautrup, Arunava Majumder, Paulin de Schoulepnikoff, Florian F\"urrutter, Marius Krumm, Hans J. Briegel
High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory Authors: Sota Nishiyama, Masaaki Imaizumi
Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning Authors: Deqian Kong, Minglu Zhao, Aoyang Qin, Bo Pang, Chenxin Tao, David Hartmann, Edouardo Honig, Dehong Xu, Amit Kumar, Matt Sarte, Chuan Li, Jianwen Xie, Ying Nian Wu
Learning a Generative Meta-Model of LLM Activations Authors: Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford, Jacob Steinhardt
From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers Authors: Ziming Liu, Sophia Sanborn, Surya Ganguli, Andreas Tolias
Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers Authors: Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, Dayiheng Liu
Deep networks learn to parse uniform-depth context-free languages from local statistics Authors: Jack T. Parley, Francesco Cagnetta, Matthieu Wyart
A Multiplicative Neural Network Architecture: Locality and Regularity of Appriximation Authors: Hee-Sun Choi, Beom-Seok Han
PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference Authors: Rui Ning, Wei Zhang, Fan Lai
Fine-Grained Model Merging via Modular Expert Recombination Authors: Haiyun Qiu, Xingyu Wu, Liang Feng, Kay Chen Tan
Revisiting the Shape Convention of Transformer Language Models Authors: Feng-Ting Liao, Meng-Hsi Chen, Guan-Ting Yi, Da-shan Shiu
Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization Authors: Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich
Algebraic Robustness Verification of Neural Networks Authors: Yulia Alexandr, Hao Duan, Guido Mont\'ufar
HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction Authors: Shengxuan Qiu, Haochen Huang, Shuzhang Zhong, Pengfei Zuo, Meng Li
Explaining Grokking in Transformers through the Lens of Inductive Bias Authors: Jaisidh Singh, Diganta Misra, Antonio Orvieto
Robustness Beyond Known Groups with Low-rank Adaptation Authors: Abinitha Gourabathina, Hyewon Jeong, Teya Bergamaschi, Marzyeh Ghassemi, Collin Stultz
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass Authors: Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System Authors: Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Yineng Zhang, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu
SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs Authors: Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti
Weisfeiler and Lehman Go Categorical Authors: Seongjin Choi, Gahee Kim, Se-Young Yun
Vision Transformer Finetuning Benefits from Non-Smooth Components Authors: Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko
Multi-Way Representation Alignment Authors: Akshit Achara, Tatiana Gaintseva, Mateo Mahaut, Pritish Chakraborty, Viktor Stenby Johansson, Melih Barsbey, Emanuele Rodol`a, Donato Crisostomi
Accelerating Vision Transformers on Brain Processing Unit Authors: Jinchi Tang, Yan Guo
Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity Authors: Bowen Zhang, Meiyi Wang, Harold Soh
Endogenous Resistance to Activation Steering in Language Models Authors: Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano
The Quantum Sieve Tracer: A Hybrid Framework for Layer-Wise Activation Tracing in Large Language Models Authors: Jonathan Pan
Diffeomorphism-Equivariant Neural Networks Authors: Josephine Elisabeth Oettinger, Zakhar Shumaylov, Johannes Bostelmann, Jan Lellmann, Carola-Bibiane Sch\"onlieb
Same Answer, Different Representations: Hidden instability in VLMs Authors: Farooq Ahmad Wani, Alessandro Suglia, Rohit Saxena, Aryo Pradipta Gema, Wai-Chung Kwan, Fazl Barez, Maria Sofia Bucarelli, Fabrizio Silvestri, Pasquale Minervini
Optimal rates for density and mode estimation with expand-and-sparsify representations Authors: Kaushik Sinha, Christopher Tosh

1. To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training

ArXiv ID: 2602.06183

Authors: Meghana Madhyastha, Daniel Haziza, Jesse Cai, Newsha Ardalani, Zhiqi Bu, Carole-Jean Wu

Abstract: Trainings of Large Language Models are generally bottlenecked by matrix multiplications. In the Transformer architecture, a large portion of these operations happens in the Feed Forward Network (FFN), and this portion increases for larger models, up to 50% of the total pretraining floating point operations. We show that we can leverage hardware-accelerated sparsity to accelerate all matrix multiplications in the FFN, with 2:4 sparsity for weights and v:n:m (Venom) sparsity for activations. Our recipe relies on sparse training steps to accelerate a large part of the pretraining, associated with regular dense training steps towards the end. Overall, models trained with this approach exhibit the same performance on our quality benchmarks, and can speed up training end-to-end by 1.4 to 1.7x. This approach is applicable to all NVIDIA GPUs starting with the A100 generation, and is orthogonal to common optimization techniques, such as, quantization, and can also be applied to mixture-of-experts model architectures.

Comment: Compression/Efficiency: combines 2:4 structured weight sparsity with v:n:m activation sparsity and sparse-to-dense training to accelerate LLM pretraining with maintained quality.

Relevance: 10 Novelty: 8

2. MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

ArXiv ID: 2602.06154

Authors: Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

Abstract: Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.

Comment: Model Architecture/Efficiency: Mixture of Slimmable Experts (MoSE) introduces slimmable experts within MoE for conditional widths, enabling continuous accuracy–compute trade-offs from a single pretrained model.

Relevance: 10 Novelty: 8

3. POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models

ArXiv ID: 2602.06822

Authors: Yi Chen, Wonjin Shin, Shuhong Liu, Tho Mai, Jeongmo Lee, Chuanbo Hua, Kun Wang, Jun Liu, Joo-Young Kim

Abstract: Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.

Comment: Compression/Efficiency: online structural pruning with context-conditioned dynamic masks for LLMs/MoEs/VLMs; plug-and-play efficient inference.

Relevance: 10 Novelty: 8

4. Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning

ArXiv ID: 2602.06204

Authors: Nan Chen, Soledad Villar, Soufiane Hayou

Abstract: Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($\mu$A), a theoretical framework that characterizes how the "optimal" learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $\mu$A is inspired from the Maximal-Update Parametrization ($\mu$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision--language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.

Comment: Low-Rank/Training Dynamics: learning-rate scaling laws across LoRA ranks (μA) with transfer to full finetuning; hyperparameter transfer theory for low-rank adaptation.

Relevance: 10 Novelty: 8

5. NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

ArXiv ID: 2602.06694

Authors: Hyochan Chong, Dongkyu Kim, Changdong Kim, Minseop Choi

Abstract: Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8$\times$ in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.

Comment: Compression/Efficiency: PTQ to sub-1-bit via low-rank binary factorization with ADMM initialization and reconstruction; state-of-the-art ultra-low-bit LLM quantization.

Relevance: 10 Novelty: 8

6. Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

ArXiv ID: 2602.06208

Authors: Alec S. Xu, Can Yaras, Matthew Asato, Qing Qu, Laura Balzano

Abstract: Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces. While this has inspired new research into low-rank training, compression, and adaptation, theoretical justification for these dynamics in nonlinear networks remains limited. %compared to deep linear settings. To address this gap, this paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under gradient descent (GD). We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training. Theoretically, we precisely characterize these invariant subspaces for two-layer networks with smooth nonlinear activations, providing insight into their emergence. Experimentally, we validate that this phenomenon extends beyond our theoretical assumptions. Leveraging these insights, we empirically show there exists a low-rank MLP parameterization that, when initialized within the appropriate subspaces, matches the classification performance of fully-parameterized counterparts on a variety of classification tasks.

Comment: Compression/Efficiency and Representation Learning: proves emergent low-rank/invariant subspace training dynamics in MLPs and motivates effective low-rank parameterizations.

Relevance: 10 Novelty: 8

7. SOCKET: SOft Collison Kernel EsTimator for Sparse Attention

ArXiv ID: 2602.06283

Authors: Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, Anshumali Shrivastava

Abstract: Exploiting sparsity during long-context inference is central to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends critically on efficient scoring and selection of relevant tokens at inference time. We revisit Locality-Sensitive Hashing (LSH) as a sparsification primitive and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Our key insight is that hard LSH produces discrete collision signals and is therefore poorly suited for ranking. In contrast, soft LSH aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among the true top-$k$ tokens. This transformation elevates LSH from a candidate-generation heuristic to a principled and mathematically grounded scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad-hoc voting mechanism, and matches or surpasses established sparse attention baselines across multiple long-context benchmarks using diverse set of models. With a custom CUDA kernel for scoring keys and a Flash Decode Triton backend for sparse attention, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention, making it an effective tool for long-context inference. Code is open-sourced at https://github.com/amarka8/SOCKET.

Comment: Compression/Efficiency: sparse attention via soft LSH scoring kernel for top-k token selection; systems-level acceleration with custom CUDA/Triton yielding up to 1.5× throughput over FlashAttention.

Relevance: 10 Novelty: 8

8. Compressing LLMs with MoP: Mixture of Pruners

ArXiv ID: 2602.06127

Authors: Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Leandro Giusti Mugnaini, Keith Ando Ogawa, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

Abstract: The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a single dimension-depth or width. We introduce MoP (Mixture of Pruners), an iterative framework that unifies these dimensions. At each iteration, MoP generates two branches-pruning in depth versus pruning in width-and selects a candidate to advance the path. On LLaMA-2 and LLaMA-3, MoP advances the frontier of structured pruning, exceeding the accuracy of competing methods across a broad set of compression regimes. It also consistently outperforms depth-only and width-only pruning. Furthermore, MoP translates structural pruning into real speedup, reducing end-to-end latency by 39% at 40% compression. Finally, extending MoP to the vision-language model LLaVA-1.5, we notably improve computational efficiency and demonstrate that text-only recovery fine-tuning can restore performance even on visual tasks.

Comment: Compression/Efficiency: structured pruning via a mixture-of-pruners combining depth and width pruning, yielding latency reductions and improved accuracy under fixed compression.

Relevance: 10 Novelty: 8

ArXiv ID: 2602.06218

Authors: Gr\'egoire Dhimo\"ila, Thomas Fel, Victor Boutin, Agustin Picard

Abstract: Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.

Comment: Representation Learning: aligned sparse autoencoder with an iso-energy inductive bias to analyze and disentangle VLM embedding geometry (bimodal vs. unimodal atoms).

Relevance: 9 Novelty: 8

10. EUGens: Efficient, Unified, and General Dense Layers

ArXiv ID: 2410.09771

Authors: Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski

Abstract: Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}\%) and memory efficiency (up to \textbf{30}\%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.

Comment: Model Architecture and Efficiency: introduces EUGens, a new dense layer class using random features to approximate FFLs, reducing inference from quadratic to linear time and enabling backprop-free layer-wise transfer.

Relevance: 9 Novelty: 8

11. Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

ArXiv ID: 2602.06797

Authors: Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu

Abstract: We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $\beta>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule follows a power decay to zero, $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$, where the peak learning rate scales as $\eta_{\mathrm{peak}} \eqsim N^{-\nu}$ for an explicit exponent $\nu = \nu(s,\beta)$. In contrast, in the hard-task regime $s < 1 - 1/\beta$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.

Comment: Training dynamics: optimal learning-rate schedules under functional scaling laws applicable to LLM pretraining; theory-driven.

Relevance: 9 Novelty: 8

12. Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

ArXiv ID: 2602.06385

Authors: Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun

Abstract: Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove "equal-rate" dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.

Comment: Low-Rank/Training Dynamics: theoretical analysis of SpecGF/Muon in LoRA-style matrix factorization showing uniform spectral growth and global convergence properties.

Relevance: 9 Novelty: 8

13. Disentanglement by means of action-induced representations

ArXiv ID: 2602.06741

Authors: Gorka Mu\~noz-Gil, Hendrik Poulsen Nautrup, Arunava Majumder, Paulin de Schoulepnikoff, Florian F\"urrutter, Marius Krumm, Hans J. Briegel

Abstract: Learning interpretable representations with variational autoencoders (VAEs) is a major goal of representation learning. The main challenge lies in obtaining disentangled representations, where each latent dimension corresponds to a distinct generative factor. This difficulty is fundamentally tied to the inability to perform nonlinear independent component analysis. Here, we introduce the framework of action-induced representations (AIRs) which models representations of physical systems given experiments (or actions) that can be performed on them. We show that, in this framework, we can provably disentangle degrees of freedom w.r.t. their action dependence. We further introduce a variational AIR architecture (VAIR) that can extract AIRs and therefore achieve provable disentanglement where standard VAEs fail. Beyond state representation, VAIR also captures the action dependence of the underlying generative factors, directly linking experiments to the degrees of freedom they influence.

Comment: Representation Learning: introduces action-induced representations with provable disentanglement and a variational AIR architecture (VAIR).

Relevance: 9 Novelty: 8

14. High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory

ArXiv ID: 2602.06320

Authors: Sota Nishiyama, Masaaki Imaizumi

Abstract: Modern machine learning models are typically trained via multi-pass stochastic gradient descent (SGD) with small batch sizes, and understanding their dynamics in high dimensions is of great interest. However, an analytical framework for describing the high-dimensional asymptotic behavior of multi-pass SGD with small batch sizes for nonlinear models is currently missing. In this study, we address this gap by analyzing the high-dimensional dynamics of a stochastic differential equation called a \emph{stochastic gradient flow} (SGF), which approximates multi-pass SGD in this regime. In the limit where the number of data samples $n$ and the dimension $d$ grow proportionally, we derive a closed system of low-dimensional and continuous-time equations and prove that it characterizes the asymptotic distribution of the SGF parameters. Our theory is based on the dynamical mean-field theory (DMFT) and is applicable to a wide range of models encompassing generalized linear models and two-layer neural networks. We further show that the resulting DMFT equations recover several existing high-dimensional descriptions of SGD dynamics as special cases, thereby providing a unifying perspective on prior frameworks such as online SGD and high-dimensional linear regression. Our proof builds on the existing DMFT technique for gradient flow and extends it to handle the stochasticity in SGF using tools from stochastic calculus.

Comment: Training Dynamics Theory: DMFT-based high-dimensional limit for stochastic gradient flow covering GLMs and two-layer nets; unifies prior SGD dynamics frameworks.

Relevance: 9 Novelty: 8

15. Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning

ArXiv ID: 2602.06584

Authors: Deqian Kong, Minglu Zhao, Aoyang Qin, Bo Pang, Chenxin Tao, David Hartmann, Edouardo Honig, Dehong Xu, Amit Kumar, Matt Sarte, Chuan Li, Jianwen Xie, Ying Nian Wu

Abstract: Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.

Comment: Model Architecture/Inference-time Computation: decouples reasoning into latent thought vectors and a decoder, enabling gradient-based refinement over a learned latent manifold.

Relevance: 9 Novelty: 8

16. Learning a Generative Meta-Model of LLM Activations

ArXiv ID: 2602.06964

Authors: Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford, Jacob Steinhardt

Abstract: Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.

Comment: Representation Learning/Interpretability: trains diffusion meta-models on LLM activations to learn a prior over internal states, improving intervention fidelity and sparsity of concepts.

Relevance: 9 Novelty: 8

17. From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers

ArXiv ID: 2602.06923

Authors: Ziming Liu, Sophia Sanborn, Surya Ganguli, Andreas Tolias

Abstract: Can general-purpose AI architectures go beyond prediction to discover the physical laws governing the universe? True intelligence relies on "world models" -- causal abstractions that allow an agent to not only predict future states but understand the underlying governing dynamics. While previous "AI Physicist" approaches have successfully recovered such laws, they typically rely on strong, domain-specific priors that effectively "bake in" the physics. Conversely, Vafa et al. recently showed that generic Transformers fail to acquire these world models, achieving high predictive accuracy without capturing the underlying physical laws. We bridge this gap by systematically introducing three minimal inductive biases. We show that ensuring spatial smoothness (by formulating prediction as continuous regression) and stability (by training with noisy contexts to mitigate error accumulation) enables generic Transformers to surpass prior failures and learn a coherent Keplerian world model, successfully fitting ellipses to planetary trajectories. However, true physical insight requires a third bias: temporal locality. By restricting the attention window to the immediate past -- imposing the simple assumption that future states depend only on the local state rather than a complex history -- we force the model to abandon curve-fitting and discover Newtonian force representations. Our results demonstrate that simple architectural choices determine whether an AI becomes a curve-fitter or a physicist, marking a critical step toward automated scientific discovery.

Comment: Inductive Bias/Architecture: shows how minimal biases (spatial smoothness, stability via noisy contexts, temporal locality via restricted attention) guide transformers from curve-fitting to learning Newtonian world models.

Relevance: 9 Novelty: 7

18. Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

ArXiv ID: 2602.06079

Authors: Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, Dayiheng Liu

Abstract: The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.

Comment: High Performance Computing: distributed training innovation for matrix-based optimizers with asynchronous scheduling and load-balanced partitioning.

Relevance: 9 Novelty: 7

19. Deep networks learn to parse uniform-depth context-free languages from local statistics

ArXiv ID: 2602.06065

Authors: Jack T. Parley, Francesco Cagnetta, Matthieu Wyart

Abstract: Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

Comment: Representation Learning: theoretical and empirical insights into how deep nets learn hierarchical structure from local statistics in PCFGs.

Relevance: 9 Novelty: 7

20. A Multiplicative Neural Network Architecture: Locality and Regularity of Appriximation

ArXiv ID: 2602.06374

Authors: Hee-Sun Choi, Beom-Seok Han

Abstract: We introduce a multiplicative neural network architecture in which multiplicative interactions constitute the fundamental representation, rather than appearing as auxiliary components within an additive model. We establish a universal approximation theorem for this architecture and analyze its approximation properties in terms of locality and regularity in Bessel potential spaces. To complement the theoretical results, we conduct numerical experiments on representative targets exhibiting sharp transition layers or pointwise loss of higher-order regularity. The experiments focus on the spatial structure of approximation errors and on regularity-sensitive quantities, in particular the convergence of Zygmund-type seminorms. The results show that the proposed multiplicative architecture yields residual error structures that are more tightly aligned with regions of reduced regularity and exhibits more stable convergence in regularity-sensitive metrics. These results demonstrate that adopting a multiplicative representation format has concrete implications for the localization and regularity behavior of neural network approximations, providing a direct connection between architectural design and analytical properties of the approximating functions.

Comment: Model Architecture: proposes a multiplicative neural network with universal approximation and locality/regularity analysis.

Relevance: 9 Novelty: 7

21. PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference

ArXiv ID: 2602.06072

Authors: Rui Ning, Wei Zhang, Fan Lai

Abstract: Attention efficiency is critical to large language model (LLM) inference. While prior advances optimize attention execution for individual requests (e.g., FlashAttention), production LLM serving relies on batching requests with highly heterogeneous sequence lengths for high serving throughput. This mismatch induces severe computation and I/O imbalance, exacerbates stragglers, and underutilizes GPU resources. We present PackInfer, a kernel-level attention framework that enables compute- and I/O-aware execution for heterogeneous batched inference. PackInfer orchestrates batched requests into load-balanced execution groups, effectively saturating GPU utilization by packing multiple requests into unified kernel launches. By constructing attention kernels directly over packed query-key regions, PackInfer eliminates redundant computation and balances thread-block execution. It then incorporates I/O-aware grouping that co-locates shared-prefix requests and reorganizes KV caches into group-contiguous layouts, reducing memory fragmentation and redundant data movement as generation evolves. Evaluations on real-world workloads show that PackInfer reduces inference latency by 13.0-20.1%, and improves throughput by 20% compared to the state-of-the-art FlashAttention.

Comment: High-Performance Computing: kernel-level attention packing and KV-cache reorganization for heterogeneous batched LLM inference (compute- and I/O-aware execution).

Relevance: 9 Novelty: 7

22. Fine-Grained Model Merging via Modular Expert Recombination

ArXiv ID: 2602.06552

Authors: Haiyun Qiu, Xingyu Wu, Liang Feng, Kay Chen Tan

Abstract: Model merging constructs versatile models by integrating task-specific models without requiring labeled data or expensive joint retraining. Although recent methods improve adaptability to heterogeneous tasks by generating customized merged models for each instance, they face two critical limitations. First, the instance-specific merged models lack reusability, restricting the exploitation of high-quality merging configurations and efficient batch inference. Second, these methods treat each task-specific model as a monolithic whole, overlooking the diverse mergeability of homologous components such as attention and multilayer perceptron layers, and the differing merging sensitivities across components. To address these limitations, we propose MERGE (\underline{M}odular \underline{E}xpert \underline{R}ecombination for fine-\underline{G}rained m\underline{E}rging), a method that enables component-wise model merging and input-aware, on-demand module recombination at inference. MERGE formulates component-wise merging as a bi-objective optimization problem that balances cross-task performance and storage efficiency, and develops a surrogate-assisted evolutionary algorithm to efficiently identify Pareto-optimal merging configurations. These high-quality configurations underpin a reusable modular expert library, from which a lightweight routing network dynamically activates and recombines modular experts to assemble input-specific models and enable efficient inference under storage constraints. Extensive experiments across various model scales, task types, and fine-tuning strategies demonstrate that MERGE consistently outperforms strong baselines and generalizes effectively.

Comment: Model Architecture: fine-grained, component-wise model merging with a reusable modular expert library and input-aware routing (conditional/dynamic networks).

Relevance: 9 Novelty: 7

23. Revisiting the Shape Convention of Transformer Language Models

ArXiv ID: 2602.06471

Authors: Feng-Ting Liao, Meng-Hsi Chen, Guan-Ting Yi, Da-shan Shiu

Abstract: Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.

Comment: Model Architecture/Efficiency: replaces Transformer FFN with deeper hourglass FFNs and rebalances attention vs FFN under fixed budgets, challenging the narrow–wide–narrow MLP convention.

Relevance: 9 Novelty: 7

24. Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

ArXiv ID: 2602.06880

Authors: Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich

Abstract: Adaptive methods like Adam have become the $\textit{de facto}$ standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces $\textbf{DeVA}$ ($\textbf{De}$coupled $\textbf{V}$ariance $\textbf{A}$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6\%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled-Variance-Adaptation

Comment: Optimization/Efficiency: decouples variance adaptation and scale-invariant terms (DeVA), bridging Adam-like methods with matrix spectral optimizers for faster large-scale training.

Relevance: 8 Novelty: 8

25. Algebraic Robustness Verification of Neural Networks

ArXiv ID: 2602.06105

Authors: Yulia Alexandr, Hao Duan, Guido Mont\'ufar

Abstract: We formulate formal robustness verification of neural networks as an algebraic optimization problem. We leverage the Euclidean Distance (ED) degree, which is the generic number of complex critical points of the distance minimization problem to a classifier's decision boundary, as an architecture-dependent measure of the intrinsic complexity of robustness verification. To make this notion operational, we define the associated ED discriminant, which characterizes input points at which the number of real critical points changes, distinguishing test instances that are easier or harder to verify. We provide an explicit algorithm for computing this discriminant. We further introduce the parameter discriminant of a neural network, identifying parameters where the ED degree drops and the decision boundary exhibits reduced algebraic complexity. We derive closed-form expressions for the ED degree for several classes of neural architectures, as well as formulas for the expected number of real critical points in the infinite-width limit. Finally, we present an exact robustness certification algorithm based on numerical homotopy continuation, establishing a concrete link between metric algebraic geometry and neural network verification.

Comment: Theory for Robustness Verification: formulates verification via ED degree/discriminants and provides an exact certification algorithm via homotopy; architecture-dependent complexity measure.

Relevance: 8 Novelty: 8

26. HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction

ArXiv ID: 2602.06527

Authors: Shengxuan Qiu, Haochen Huang, Shuzhang Zhong, Pengfei Zuo, Meng Li

Abstract: Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.

Comment: MoE Efficiency: training-free online expand–reduce control for multi-path decoding in mixture-of-experts LLMs, reallocating compute under fixed budgets.