Personalized Daily ArXiv Papers 2025-10-02

[gpt-5]	Prompt	Completion	Total
Token	73396	66962	140358
Cost	$0.09	$0.67	$0.76

Total arXiv papers: 759

Total scanned papers: 487

Total relevant papers: 43

Table of contents with paper titles:

A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws Authors: Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin
Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space Authors: Houjun Liu, Shikhar Murty, Christopher D. Manning, R\'obert Csord\'as
Cutting the Skip: Training Residual-Free Transformers Authors: Yiping Ji, James Martens, Jianqiao Zheng, Ziqin Zhou, Peyman Moghadam, Xinyu Zhang, Hemanth Saratchandran, Simon Lucey
PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning Authors: Xin Yu, Cong Xie, Ziyu Zhao, Tiantian Fan, Lingzhou Xue, Zhi Zhang
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features Authors: Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu
Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs Authors: Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul Whatmough
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time Authors: Blake Bordelon, Mary I. Letey, Cengiz Pehlevan
Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning Authors: Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama
Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space? Authors: Nandan Kumar Jha, Brandon Reagen
Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity Authors: Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri
LoRAFusion: Efficient LoRA Fine-Tuning for LLMs Authors: Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, Gennady Pekhimenko
Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models Authors: JingChuan Guan, Tomoyuki Kubota, Yasuo Kuniyoshi, Kohei Nakajima
On Predictability of Reinforcement Learning Dynamics for Large Language Models Authors: Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang
Composer: A Search Framework for Hybrid Neural Architecture Design Authors: Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J. Yadwadkar, Carole-Jean Wu
Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning Authors: Zeru Shi, Yingjia Wan, Zhenting Wang, Qifan Wang, Fan Yang, Elisa Kreiss, Ruixiang Tang
Randomized Matrix Sketching for Neural Network Training and Gradient Monitoring Authors: Harbir Antil, Deepanshu Verma
Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis Authors: Hongkang Li, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Meng Wang
Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs Authors: Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, Huan Zhang
Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling Authors: Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang
Feature Identification via the Empirical NTK Authors: Jennifer Lin
Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling Authors: Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych
The Pitfalls of KV Cache Compression Authors: Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel
Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks Authors: Hanjiang Hu, Bowei Li, Ziwei Wang, Tianhao Wei, Casidhe Hutchison, Eric Sample, Changliu Liu
ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models Authors: Dongqi Zheng
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models Authors: Leah Bar, Liron Mor Yosef, Shai Zucker, Neta Shoham, Inbar Seroussi, Nir Sochen
On the Benefits of Weight Normalization for Overparameterized Matrix Sensing Authors: Yudong Wei, Liang Zhang, Bingcong Li, Niao He
Rethinking Thinking Tokens: LLMs as Improvement Operators Authors: Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal
Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models Authors: Shutong Wu, Jiawei Zhang
Learning Inter-Atomic Potentials without Explicit Equivariance Authors: Ahmed A. Elhag, Arun Raja, Alex Morehead, Samuel M. Blau, Garrett M. Morris, Michael M. Bronstein
Nonparametric Identification of Latent Concepts Authors: Yujia Zheng, Shaoan Xie, Kun Zhang
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls Authors: Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Vi\'egas, Martin Wattenberg, Andrew Lee
GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling Authors: Jose I. Mestre, Alberto Fern\'andez-Hern\'andez, Cristian P\'erez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ort\'i
Generalized Parallel Scaling with Interdependent Generations Authors: Harry Dong, David Brandfonbrener, Eryk Helenowski, Yun He, Mrinal Kumar, Han Fang, Yuejie Chi, Karthik Abinav Sankararaman
Learning Energy-based Variational Latent Prior for VAEs Authors: Debottam Dutta, Chaitanya Amballa, Zhongweiyang Xu, Yu-Lin Wei, Romit Roy Choudhury
Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets Authors: David R. Johnson, Rishabh Anand, Smita Krishnaswamy, Michael Perlmutter
Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum Authors: Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong
Large Language Models Inference Engines based on Spiking Neural Networks Authors: Adarsha Balaji, Sandeep Madireddy
Continual Learning with Query-Only Attention Authors: Gautham Bekal, Ashish Pujari, Scott David Kelly
Exploring System 1 and 2 communication for latent reasoning in LLMs Authors: Julian Coda-Forno, Zhuokai Zhao, Qiang Zhang, Dipesh Tamboli, Weiwei Li, Xiangjun Fan, Lizhu Zhang, Eric Schulz, Hsiao-Ping Tseng
Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness Authors: Tsubasa Takahashi, Shojiro Yamabe, Futa Waseda, Kento Sasaki
Geometric Properties of Neural Multivariate Regression Authors: George Andriopoulos, Zixuan Dong, Bimarsha Adhikari, Keith Ross
BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner Authors: Hengkui Wu, Liujiang Liu, Jihua He, Qihao Wang, Keke Zhao, Shuyang Hu, Renle Fu, Dahao Liang, Lingyu Zeng, Bruce Liu, Yuan Liu, Jin Zhan, Jiaqiang Niu, Xinglong Jia, Yaqin Hu, Wenjun Ji, Panpan Chi, Ken Chen, Hengyuan Wu, Yingsi Xin, Yongfeng Zhu, Yuexin Wang, Manqi Ruan, Ningtao Bian, Xiaohua Wu, Weipeng Xu
ACON: Optimizing Context Compression for Long-horizon LLM Agents Authors: Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan

1. A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws

ArXiv ID: 2510.00504

Authors: Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin

Abstract: When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-\alpha}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-\alpha' \sqrt[m]{d})$.

Comment: Compression/Efficiency Theory: proves polylogarithmic compression of models and datasets, establishing a dynamical lottery ticket hypothesis and boosted scaling laws.

Relevance: 10 Novelty: 9

2. Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space

ArXiv ID: 2510.00219

Authors: Houjun Liu, Shikhar Murty, Christopher D. Manning, R\'obert Csord\'as

Abstract: Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a "bubble" of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.

Comment: Model Architecture: introduces adaptive parallel computation in transformers by forking/deleting residual streams learned during pretraining (dynamic networks).

Relevance: 10 Novelty: 9

3. Cutting the Skip: Training Residual-Free Transformers

ArXiv ID: 2510.00345

Authors: Yiping Ji, James Martens, Jianqiao Zheng, Ziqin Zhou, Peyman Moghadam, Xinyu Zhang, Hemanth Saratchandran, Simon Lucey

Abstract: Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without skip (residual) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong baselines, that incorporate skip connections, on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.

Comment: Model Architecture/Training Dynamics: enables stable training of residual-free transformers via principled initialization based on Jacobian conditioning analysis.

Relevance: 10 Novelty: 8

4. PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning

ArXiv ID: 2510.00192

Authors: Xin Yu, Cong Xie, Ziyu Zhao, Tiantian Fan, Lingzhou Xue, Zhi Zhang

Abstract: Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textit{PrunedLoRA}, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.

Comment: Model Compression and Efficiency—structured pruning within low-rank adapters (LoRA) with theoretical robustness analysis and dynamic rank allocation.

Relevance: 10 Novelty: 8

5. AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

ArXiv ID: 2510.00404

Authors: Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu

Abstract: Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the $\ell_0$ sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.

Comment: Representation Learning and Sparsity: derives SAEs from proximal gradient unrolling and introduces AbsTopK (|·|-TopK) to recover bidirectional features under an ℓ0-inspired sparsity constraint.

Relevance: 10 Novelty: 8

6. Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs

ArXiv ID: 2510.01185

Authors: Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul Whatmough

Abstract: Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.

Comment: MoE: router regularization via Dirichlet-prior shaping to improve expert balance and specialization in upcycled sparse MoEs.

Relevance: 10 Novelty: 8

7. Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time

ArXiv ID: 2510.01098

Authors: Blake Bordelon, Mary I. Letey, Cengiz Pehlevan

Abstract: We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shape as a function of compute. This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks.

Comment: Model Architecture/Representation Learning: theoretical scaling laws for deep linear self-attention (depth vs width vs context) and training dynamics.

Relevance: 9 Novelty: 9

8. Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning

ArXiv ID: 2510.00570

Authors: Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

Abstract: Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL). However, existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing during the transition from single-task to multi-task learning (STL to MTL). To address these limitations, we propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts. This design facilitates STL to MTL transition, enhances expert specialization, and cooperation. Furthermore, we incorporate fine-grained experts by increasing the number of LoRA experts while proportionally reducing their rank, enabling more effective knowledge sharing under a comparable parameter budget. Extensive experiments on the PASCAL-Context benchmark, under unified training settings, demonstrate that ASE consistently improves performance across diverse configurations and validates the effectiveness of fine-grained designs for MTL.

Comment: Model Architecture and Efficiency: MoE with adaptive shared experts and LoRA-based fine-grained low-rank experts for multi-task learning.

Relevance: 10 Novelty: 7

9. Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

ArXiv ID: 2510.00537

Authors: Nandan Kumar Jha, Brandon Reagen

Abstract: As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

Comment: Representation/Architecture Analysis: introduces spectral utilization diagnostics (hard/soft rank, concentration, SUI) revealing FFN latent-space scaling laws.

Relevance: 9 Novelty: 8

10. Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

ArXiv ID: 2510.00304

Authors: Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

Abstract: Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.

Comment: Representation Learning and Training Dynamics: provides a mathematical framework for loss of plasticity, identifying frozen units and cloned-unit manifolds and linking to low-rank/simplicity biases.

Relevance: 9 Novelty: 8

11. LoRAFusion: Efficient LoRA Fine-Tuning for LLMs

ArXiv ID: 2510.00206

Authors: Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, Gennady Pekhimenko

Abstract: Low-Rank Adaptation (LoRA) has become the leading Parameter-Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs), as it significantly reduces GPU memory usage while maintaining competitive fine-tuned model quality on downstream tasks. Despite these benefits, we identify two key inefficiencies in existing LoRA fine-tuning systems. First, they incur substantial runtime overhead due to redundant memory accesses on large activation tensors. Second, they miss the opportunity to concurrently fine-tune multiple independent LoRA adapters that share the same base model on the same set of GPUs. This leads to missed performance gains such as reduced pipeline bubbles, better communication overlap, and improved GPU load balance. To address these issues, we introduce LoRAFusion, an efficient LoRA fine-tuning system for LLMs. At the kernel level, we propose a graph-splitting method that fuses memory-bound operations. This design eliminates unnecessary memory accesses and preserves the performance of compute-bound GEMMs without incurring the cost of recomputation or synchronization. At the scheduling level, LoRAFusion introduces an adaptive batching algorithm for multi-job fine-tuning. It first splits LoRA adapters into groups to intentionally stagger batch execution across jobs, and then solves a bin-packing problem within each group to generate balanced, dependency-aware microbatches. LoRAFusion achieves up to $1.96\times$ ($1.47\times$ on average) end-to-end speedup compared to Megatron-LM, and up to $1.46\times$ ($1.29\times$ on average) improvement over mLoRA, the state-of-the-art multi-LoRA fine-tuning system. Our fused kernel achieves up to $1.39\times$ ($1.27\times$ on average) kernel performance improvement and can directly serve as a plug-and-play replacement in existing LoRA systems. We open-source LoRAFusion at https://github.com/CentML/lorafusion.

Comment: High Performance Computing and Efficiency: fused kernels for LoRA and adaptive multi-job scheduling for concurrent fine-tuning; systems-level innovation for PEFT.

Relevance: 9 Novelty: 8

12. Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models

ArXiv ID: 2510.00563

Authors: JingChuan Guan, Tomoyuki Kubota, Yasuo Kuniyoshi, Kohei Nakajima

Abstract: State space models (SSMs) have gained attention by showing potential to outperform Transformers. However, previous studies have not sufficiently addressed the mechanisms underlying their high performance owing to a lack of theoretical explanation of SSMs' learning dynamics. In this study, we provide such an explanation and propose an improved training strategy. The memory capacity of SSMs can be evaluated by examining how input time series are stored in their current state. Such an examination reveals a tradeoff between memory accuracy and length, as well as the theoretical equivalence between the structured state space sequence model (S4) and a simplified S4 with diagonal recurrent weights. This theoretical foundation allows us to elucidate the learning dynamics, proving the importance of initial parameters. Our analytical results suggest that successful learning requires the initial memory structure to be the longest possible even if memory accuracy may deteriorate or the gradient lose the teacher information. Experiments on tasks requiring long memory confirmed that extending memory is difficult, emphasizing the importance of initialization. Furthermore, we found that fixing recurrent weights can be more advantageous than adapting them because it achieves comparable or even higher performance with faster convergence. Our results provide a new theoretical foundation for SSMs and potentially offer a novel optimization strategy.

Comment: Model Architecture: theoretical analysis of SSM learning dynamics and an initialization/weight-freezing optimization strategy.

Relevance: 9 Novelty: 8

13. On Predictability of Reinforcement Learning Dynamics for Large Language Models

ArXiv ID: 2510.00553

Authors: Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang

Abstract: Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

Comment: Representation Learning/Training Dynamics: identifies low-rank (rank-1) structure in RL-induced parameter updates and exploits it for efficient training speedups.

Relevance: 9 Novelty: 8

14. Composer: A Search Framework for Hybrid Neural Architecture Design

ArXiv ID: 2510.00379

Authors: Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J. Yadwadkar, Carole-Jean Wu

Abstract: Hybrid model architectures that combine computational primitives (e.g., Attention, MLP) in different ratios have shown promising performance beyond Transformers. Some studies have shown that different interleavings of primitives can affect model quality as well. However, prior works explore the hybrid model architecture design space manually. Due to the large design space and training costs, discovering hybrid models that combine key computational primitives for pre-training is challenging. In this work, we take a principled approach in designing a modular hybrid model architecture search framework -- Composer. Composer explores model architectures at a small scale and extrapolates the top-performing model architectures to a larger scale using our proposed scaling strategies. Using Composer, we discover new hybrid LLM architectures that outperform Llama 3.2. Compared to Llama 3.2 and previous state-of-the-art baselines, the new model architectures consistently reduce validation loss at parameter scales of 350M-3B and improve evaluation accuracy on the downstream tasks by up to 2.8-8.3% (1.1-3.1% on average) while improving both training and inference efficiency.

Comment: Model Architecture and Efficiency: framework for searching hybrid Attention/MLP architectures with scalable extrapolation strategies for LLMs.

Relevance: 9 Novelty: 8

15. Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning

ArXiv ID: 2510.01032

Authors: Zeru Shi, Yingjia Wan, Zhenting Wang, Qifan Wang, Fan Yang, Elisa Kreiss, Ruixiang Tang

Abstract: Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance LLM reasoning performance, this work analyzes the underlying mechanism driving this phenomenon and based on these insights proposes a more principled method that allows for similar performance gains. First, we find that the improvements arise from a redistribution of activations in the LLM's MLP layers, where near zero activations become less frequent while large magnitude activations increase. This redistribution enhances the model's representational capacity by suppressing weak signals and promoting stronger, more informative ones. Building on this insight, we propose the Activation Redistribution Module (ARM), a lightweight inference-time technique that modifies activations directly without altering the input sequence. ARM adaptively identifies near-zero activations after the non-linear function and shifts them outward, implicitly reproducing the beneficial effects of meaningless tokens in a controlled manner. Extensive experiments across diverse benchmarks and model architectures clearly show that ARM consistently improves LLM performance on reasoning tasks while requiring only a few lines of simple code to implement. Our findings deliver both a clear mechanistic explanation for the unexpected benefits of meaningless tokens and a simple yet effective technique that harnesses activation redistribution to further improve LLM performance.

Comment: Model Architecture and Representation Learning: mechanistic analysis of MLP activation distributions and an inference-time activation redistribution module (ARM) that improves reasoning.

Relevance: 9 Novelty: 8

16. Randomized Matrix Sketching for Neural Network Training and Gradient Monitoring

ArXiv ID: 2510.00442

Authors: Harbir Antil, Deepanshu Verma

Abstract: Neural network training relies on gradient computation through backpropagation, yet memory requirements for storing layer activations present significant scalability challenges. We present the first adaptation of control-theoretic matrix sketching to neural network layer activations, enabling memory-efficient gradient reconstruction in backpropagation. This work builds on recent matrix sketching frameworks for dynamic optimization problems, where similar state trajectory storage challenges motivate sketching techniques. Our approach sketches layer activations using three complementary sketch matrices maintained through exponential moving averages (EMA) with adaptive rank adjustment, automatically balancing memory efficiency against approximation quality. Empirical evaluation on MNIST, CIFAR-10, and physics-informed neural networks demonstrates a controllable accuracy-memory tradeoff. We demonstrate a gradient monitoring application on MNIST showing how sketched activations enable real-time gradient norm tracking with minimal memory overhead. These results establish that sketched activation storage provides a viable path toward memory-efficient neural network training and analysis.

Comment: Model Compression and Efficiency: adapts matrix sketching to layer activations for memory-efficient backprop and gradient monitoring, enabling reduced activation storage.

Relevance: 9 Novelty: 8

17. Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis

ArXiv ID: 2510.00399

Authors: Hongkang Li, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Meng Wang

Abstract: The Mamba model has gained significant attention for its computational advantages over Transformer-based models, while achieving comparable performance across a wide range of language tasks. Like Transformers, Mamba exhibits in-context learning (ICL) capabilities, i.e., making predictions for new tasks based on a prompt containing input-label pairs and a query, without requiring fine-tuning. Despite its empirical success, the theoretical understanding of Mamba remains limited, largely due to the nonlinearity introduced by its gating mechanism. To the best of our knowledge, this paper presents the first theoretical analysis of the training dynamics of a one-layer Mamba model, which consists of a linear attention component followed by a nonlinear gating layer, and its ICL generalization on unseen binary classification tasks, even when the prompt includes additive outliers. Our analysis shows that Mamba leverages the linear attention layer to select informative context examples and uses the nonlinear gating layer to suppress the influence of outliers. By establishing and comparing to the analysis of linear Transformers under the same setting, we show that although Mamba may require more training iterations to converge, it maintains accurate predictions even when the proportion of outliers exceeds the threshold that a linear Transformer can tolerate. These theoretical findings are supported by empirical experiments.

Comment: Model Architecture/Theory: first theoretical analysis of one-layer Mamba’s ICL generalization with outliers, contrasting linear attention vs. nonlinear gating.

Relevance: 9 Novelty: 7

18. Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs

ArXiv ID: 2510.00419

Authors: Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, Huan Zhang

Abstract: Zeroth-order optimizers have recently emerged as a practical approach for fine-tuning large language models (LLMs), significantly reducing GPU memory consumption compared to traditional first-order methods. Yet, existing zeroth-order methods rely on hand-crafted, static sampling strategies that are not adaptable to model-specific structures. To address this, we propose ZO Fine-tuner, a learning-based zeroth-order optimizer for LLMs that automatically learns efficient perturbation strategies through a compact and memory-efficient design. Crucially, our approach is motivated by the observation that only a small number of foundation models and their derivatives are widely adopted in practice. Therefore, learning the optimizer once for a given LLM and reusing it across diverse downstream tasks is both feasible and highly desirable. Accordingly, ZO Fine-tuner is designed to scale learning to learn (L2L) to the foundation-model era by supporting one-time training per LLM with minimal overhead. Experiments on 4 LLMs and 7 datasets show that ZO Fine-tuner outperforms prior zeroth-order baselines in 82.1\% of task-model combinations, thereby demonstrating strong performance and scalability for efficient LLM fine-tuning. Our code is available at https://github.com/ASTRAL-Group/ZO_Fine_tuner.git.

Comment: High Performance/Efficiency—learning-based zeroth-order optimizer for LLM fine-tuning reducing memory with L2L-style perturbation strategies.

Relevance: 9 Novelty: 7

19. Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling

ArXiv ID: 2510.00028

Authors: Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang

Abstract: Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.

Comment: Model Compression and Efficiency: analyzes RoPE interpolation under post-training quantization and proposes an interpolation-aware, per-band weight rescaling (Q-ROAR) guided by new diagnostics.

Relevance: 9 Novelty: 7

20. Feature Identification via the Empirical NTK

ArXiv ID: 2510.00468

Authors: Jennifer Lin

Abstract: We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across two standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS) and a 1-layer MLP trained on modular addition, we find that the eNTK exhibits sharp spectral cliffs whose top eigenspaces align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK eigenspectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.

Comment: Representation Learning/Training dynamics: empirical NTK eigenanalysis surfaces learned features and tracks grokking phase changes.

Relevance: 9 Novelty: 7

21. Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling

ArXiv ID: 2510.01025

Authors: Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych

Abstract: The linear representation hypothesis states that language models (LMs) encode concepts as directions in their latent space, forming organized, multidimensional manifolds. Prior efforts focus on discovering specific geometries for specific features, and thus lack generalization. We introduce Supervised Multi-Dimensional Scaling (SMDS), a model-agnostic method to automatically discover feature manifolds. We apply SMDS to temporal reasoning as a case study, finding that different features form various geometric structures such as circles, lines, and clusters. SMDS reveals many insights on these structures: they consistently reflect the properties of the concepts they represent; are stable across model families and sizes; actively support reasoning in models; and dynamically reshape in response to context changes. Together, our findings shed light on the functional role of feature manifolds, supporting a model of entity-based reasoning in which LMs encode and transform structured representations.

Comment: Representation Learning: SMDS method to discover and analyze feature manifolds in LLM latent space.

Relevance: 9 Novelty: 7

22. The Pitfalls of KV Cache Compression

ArXiv ID: 2510.00231

Authors: Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel

Abstract: KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

Comment: Model Compression and Efficiency: critical analysis of KV cache compression with improved eviction policies for multi-instruction prompting in LLMs.

Relevance: 9 Novelty: 7

23. Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks

ArXiv ID: 2510.00083

Authors: Hanjiang Hu, Bowei Li, Ziwei Wang, Tianhao Wei, Casidhe Hutchison, Eric Sample, Changliu Liu

Abstract: Deep neural networks have been widely adopted in many vision and robotics applications with visual inputs. It is essential to verify its robustness against semantic transformation perturbations, such as brightness and contrast. However, current certified training and robustness certification methods face the challenge of over-parameterization, which hinders the tightness and scalability due to the over-complicated neural networks. To this end, we first analyze stability and variance of layers and neurons against input perturbation, showing that certifiable robustness can be indicated by a fundamental Unbiased and Smooth Neuron metric (USN). Based on USN, we introduce a novel neural network pruning method that removes neurons with low USN and retains those with high USN, thereby preserving model expressiveness without over-parameterization. To further enhance this pruning process, we propose a new Wasserstein distance loss to ensure that pruned neurons are more concentrated across layers. We validate our approach through extensive experiments on the challenging robust keypoint detection task, which involves realistic brightness and contrast perturbations, demonstrating that our method achieves superior robustness certification performance and efficiency compared to baselines.

Comment: Model Compression and Efficiency: robust pruning guided by an Unbiased and Smooth Neuron metric (USN) plus a Wasserstein loss to enhance certifiable robustness.

Relevance: 9 Novelty: 7

24. ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

ArXiv ID: 2510.00071

Authors: Dongqi Zheng

Abstract: Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.

Comment: Model Compression and Efficiency: training-free adaptive suppression of reasoning steps for LRLMs to reduce tokens/latency while preserving accuracy.

Relevance: 9 Novelty: 7

25. A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models

ArXiv ID: 2510.00666

Authors: Leah Bar, Liron Mor Yosef, Shai Zucker, Neta Shoham, Inbar Seroussi, Nir Sochen

Abstract: The foundational premise of generative AI for images is the assumption that images are inherently low-dimensional objects embedded within a high-dimensional space. Additionally, it is often implicitly assumed that thematic image datasets form smooth or piecewise smooth manifolds. Common approaches overlook the geometric structure and focus solely on probabilistic methods, approximating the probability distribution through universal approximation techniques such as the kernel method. In some generative models, the low dimensional nature of the data manifest itself by the introduction of a lower dimensional latent space. Yet, the probability distribution in the latent or the manifold coordinate space is considered uninteresting and is predefined or considered uniform. This study unifies the geometric and probabilistic perspectives by providing a geometric framework and a kernel-based probabilistic method simultaneously. The resulting framework demystifies diffusion models by interpreting them as a projection mechanism onto the manifold of ``good images''. This interpretation leads to the construction of a new deterministic model, the Manifold-Probabilistic Projection Model (MPPM), which operates in both the representation (pixel) space and the latent space. We demonstrate that the Latent MPPM (LMPPM) outperforms the Latent Diffusion Model (LDM) across various datasets, achieving superior results in terms of image restoration and generation.

Comment: Model Architecture and Representation Learning: introduces a deterministic Manifold-Probabilistic Projection Model unifying geometric manifold structure with kernel-based probabilistic modeling, reinterpreting diffusion as projection.

Relevance: 8 Novelty: 8

26. On the Benefits of Weight Normalization for Overparameterized Matrix Sensing

ArXiv ID: 2510.01175

Authors: Yudong Wei, Liang Zhang, Bingcong Li, Niao He

Abstract: While normalization techniques are widely used in deep learning, their theoretical understanding remains relatively limited. In this work, we establish the benefits of (generalized) weight normalization (WN) applied to the overparameterized matrix sensing problem. We prove that WN with Riemannian optimization achieves linear convergence, yielding an exponential speedup over standard methods that do not use WN. Our analysis further demonstrates that both iteration and sample complexity improve polynomially as the level of overparameterization increases. To the best of our knowledge, this work provides the first characterization of how WN leverages overparameterization for faster convergence in matrix sensing.

Comment: Training dynamics/optimization analysis of weight normalization showing faster convergence in overparameterized matrix sensing (Representation Learning / Model Architecture analysis).

Relevance: 8 Novelty: 8

27. Rethinking Thinking Tokens: LLMs as Improvement Operators

ArXiv ID: 2510.01123

Authors: Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal

Abstract: Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

Comment: Inference efficiency and dynamic refinement: Parallel-Distill-Refine orchestrates bounded workspace and parallelism to improve accuracy-latency trade-offs (HPC/algorithmic efficiency; conditional computation).

Relevance: 8 Novelty: 8

28. Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

ArXiv ID: 2510.00294

Authors: Shutong Wu, Jiawei Zhang

Abstract: Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Thanks to their bidirectional attention mechanism, DLLMs are more capable of capturing the connection of context, and thus show unique advantages in challenges like the famous "reversal curse" or learning under data-constrained scenarios. However, this bidirectional nature also brings an obstacle that DLLMs are not inherently compatible with KV Cache, and consequently, the inference efficiency is not competitive compared with autoregressive models. Taking advantage of their inherent capability of multi-token prediction, existing parallel decoding algorithms can speed up the DLLM inference, but at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (Freedave), a novel fast sampling algorithm tailored for DLLMs that achieves lossless parallel decoding. Specifically, we propose a pipeline of parallel-decoded candidate generation and verification, which is guaranteed to reproduce the same sequence generated by static sampling, without introducing extra model forward calls. By applying Freedave, the throughput of DLLMs can be boosted up to $2.8\times$ without performance degradation on math reasoning tasks.

Comment: Model efficiency/HPC: lossless parallel decoding for diffusion LLMs via draft-and-verify without extra forward passes; substantial inference speedup.

Relevance: 8 Novelty: 8

29. Learning Inter-Atomic Potentials without Explicit Equivariance

ArXiv ID: 2510.00027

Authors: Ahmed A. Elhag, Arun Raja, Alex Morehead, Samuel M. Blau, Garrett M. Morris, Michael M. Bronstein

Abstract: Accurate and scalable machine-learned inter-atomic potentials (MLIPs) are essential for molecular simulations ranging from drug discovery to new material design. Current state-of-the-art models enforce roto-translational symmetries through equivariant neural network architectures, a hard-wired inductive bias that can often lead to reduced flexibility, computational efficiency, and scalability. In this work, we introduce TransIP: Transformer-based Inter-Atomic Potentials, a novel training paradigm for interatomic potentials achieving symmetry compliance without explicit architectural constraints. Our approach guides a generic non-equivariant Transformer-based model to learn SO(3)-equivariance by optimizing its representations in the embedding space. Trained on the recent Open Molecules (OMol25) collection, a large and diverse molecular dataset built specifically for MLIPs and covering different types of molecules (including small organics, biomolecular fragments, and electrolyte-like species), TransIP attains comparable performance in machine-learning force fields versus state-of-the-art equivariant baselines. Further, compared to a data augmentation baseline, TransIP achieves 40% to 60% improvement in performance across varying OMol25 dataset sizes. More broadly, our work shows that learned equivariance can be a powerful and efficient alternative to equivariant or augmentation-based MLIP models.

Comment: Model Architecture/Representation Learning: learns SO(3) equivariance in a non-equivariant Transformer for inter-atomic potentials, avoiding hard-wired symmetry constraints.

Relevance: 8 Novelty: 8

30. Nonparametric Identification of Latent Concepts

ArXiv ID: 2510.00136

Authors: Yujia Zheng, Shaoan Xie, Kun Zhang

Abstract: We are born with the ability to learn concepts by comparing diverse observations. This helps us to understand the new world in a compositional manner and facilitates extrapolation, as objects naturally consist of multiple concepts. In this work, we argue that the cognitive mechanism of comparison, fundamental to human learning, is also vital for machines to recover true concepts underlying the data. This offers correctness guarantees for the field of concept learning, which, despite its impressive empirical successes, still lacks general theoretical support. Specifically, we aim to develop a theoretical framework for the identifiability of concepts with multiple classes of observations. We show that with sufficient diversity across classes, hidden concepts can be identified without assuming specific concept types, functional relations, or parametric generative models. Interestingly, even when conditions are not globally satisfied, we can still provide alternative guarantees for as many concepts as possible based on local comparisons, thereby extending the applicability of our theory to more flexible scenarios. Moreover, the hidden structure between classes and concepts can also be identified nonparametrically. We validate our theoretical results in both synthetic and real-world settings.

Comment: Representation Learning: provides a nonparametric identifiability theory for latent concepts from multi-class observations, offering foundational guarantees on recovering representations.

Relevance: 8 Novelty: 8

31. Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

ArXiv ID: 2510.00184

Authors: Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Vi\'egas, Martin Wattenberg, Andrew Lee

Abstract: Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to cache'' andretrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.

Comment: Representation Learning: reverse-engineers transformer mechanisms for long-range dependencies (attention DAG caching, Minkowski-sum digit geometry) and training dynamics with an auxiliary inductive-bias loss.