Personalized Daily ArXiv Papers 2025-09-24

[gpt-5]	Prompt	Completion	Total
Token	88113	89226	177339
Cost	$0.11	$0.89	$1.0

Total arXiv papers: 838

Total scanned papers: 487

Total relevant papers: 58

Table of contents with paper titles:

Circuit Complexity From Physical Constraints: Scaling Limitations of Attention Authors: Benjamin Prada, Ankur Mali
Qwen3-Omni Technical Report Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure Authors: Boao Kong, Junzhu Liang, Yuxi Liu, Renjia Deng, Kun Yuan
PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong
Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts Authors: Qi Wang, Hanyang Peng, Yue Yu
PiMoE: Token-Level Routing for Integrating High-Precision Computation and Reasoning Authors: Hengbo Xiao, Jingyuan Fan, Xin Tong, Jingzhao Zhang, Chao Lu, Guannan He
Learning From Simulators: A Theory of Simulation-Grounded Learning Authors: Carson Dudley, Marisa Eisenberg
Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Xuming Hu
Accurate and Efficient Low-Rank Model Merging in Core Space Authors: Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer
Global Minimizers of Sigmoid Contrastive Loss Authors: Kiril Bangachev, Guy Bresler, Iliyas Noman, Yury Polyanskiy
Evolution of Concepts in Language Model Pre-Training Authors: Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu
nDNA -- the Semantic Helix of Artificial Cognition Authors: Amitava Das
Otters: An Energy-Efficient SpikingTransformer via Optical Time-to-First-Spike Encoding Authors: Zhanglu Yan, Jiayi Mao, Qianhui Liu, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Weng-Fai Wong
FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction Authors: Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, Xi Chen
HyperAdapt: Simple High-Rank Adaptation Authors: Abel Gurung, Joseph Campbell
Theory of periodic convolutional neural network Authors: Yuqing Liu
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu
SBVR: Summation of BitVector Representation for Efficient LLM Quantization Authors: Wonjun Bang, Jongseok Park, Hongseung Yu, Kyungmin Bin, Kyunghan Lee
KANO: Kolmogorov-Arnold Neural Operator Authors: Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang
Understanding Post-Training Structural Changes in Large Language Models Authors: Xinyu He, Xianghui Cao
AdaMixT: Adaptive Weighted Mixture of Multi-Scale Expert Transformers for Time Series Forecasting Authors: Huanyao Zhang, Jiaye Lin, Wentao Zhang, Haitao Yuan, Guoliang Li
Self-Evolving LLMs via Continual Instruction Tuning Authors: Le Huang, Jiazheng Kang, Cheng Hou, Zhe Zhao, Zhenxiang Yan, Chuan Shi, Ting Bai
Adaptive Overclocking: Dynamic Control of Thinking Path Length via Real-Time Reasoning Signals Authors: Shuhao Jiang, Songbo Wang, Yang Qiao, Chun Xu, Chaoyang Zheng, Shengyi Zhou, Huanjun Wang, Fangming Li, Cong Zhang, Jiyu Wang
Confidence-gated training for efficient early-exit neural networks Authors: Saad Mokssit, Ouassim Karrakchou, Alejandro Mousist, Mounir Ghogho
Towards Provable Emergence of In-Context Reinforcement Learning Authors: Jiuqi Wang, Rohan Chandra, Shangtong Zhang
Sparse Training Scheme for Multimodal LLM Authors: Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang
Soft Tokens, Hard Truths Authors: Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, Yann Ollivier
Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws Authors: Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu
Pareto-optimal Tradeoffs Between Communication and Computation with Flexible Gradient Tracking Authors: Yan Huang, Jinming Xu, Li Chai, Jiming Chen, Karl H. Johansson
Interpreting vision transformers via residual replacement model Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization Authors: Nathan Egbuna, Saatvik Gaur, Sunishchal Dev, Ashwinee Panda, Maheep Chaudhary
Probabilistic Token Alignment for Large Language Model Fusion Authors: Runjia Zeng, James Chenhao Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huang, Tong Geng, Qifan Wang, Dongfang Liu
Statistical Insight into Meta-Learning via Predictor Subspace Characterization and Quantification of Task Diversity Authors: Saptati Datta, Nicolas W. Hengartner, Yulia Pimonova, Natalie E. Klein, Nicholas Lubbers
Probabilistic Geometric Principal Component Analysis with application to neural data Authors: Han-Lin Hsieh, Maryam M. Shanechi
The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs Authors: Hinata Tezuka, Naoya Inoue
Recovering Wasserstein Distance Matrices from Few Measurements Authors: Muhammad Rana, Abiy Tasissa, HanQin Cai, Yakov Gavriyelov, Keaton Hamm
Exploring Heterophily in Graph-level Tasks Authors: Qinhan Hou, Yilun Zheng, Xichun Zhang, Sitao Luan, Jing Tang
Self Identity Mapping Authors: Xiuding Cai, Yaoyao Zhu, Linjie Fu, Dong Miao, Yu Yao
ConceptFlow: Hierarchical and Fine-grained Concept-Based Explanation for Convolutional Neural Networks Authors: Xinyu Mu, Hui Dou, Furao Shen, Jian Zhao
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels Authors: Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan
Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling Authors: Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, HaiZhou Li
MeshODENet: A Graph-Informed Neural Ordinary Differential Equation Neural Network for Simulating Mesh-Based Physical Systems Authors: Kangzheng Liu, Leixin Ma
Variational Task Vector Composition Authors: Boyuan Zhang, Yingjun Du, Xiantong Zhen, Ling Shao
Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise Authors: Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
LightCode: Compiling LLM Inference for Photonic-Electronic Systems Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund
Shared-Weights Extender and Gradient Voting for Neural Network Expansion Authors: Nikolas Chatzis, Ioannis Kordonis, Manos Theodosis, Petros Maragos
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT Authors: Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, Anthony Hartshorn
Robust LLM Training Infrastructure at ByteDance Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe Authors: Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun
Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
Graph-based Clustering Revisited: A Relaxation of Kernel $k$-Means Perspective Authors: Wenlong Lyu, Yuheng Jia, Hui Liu, Junhui Hou
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment Authors: Teng Xiao, Zuchao Li, Lefei Zhang
Energy-convergence trade off for the training of neural networks on bio-inspired hardware Authors: Nikhil Garg, Paul Uriarte Vicandi, Yanming Zhang, Alexandre Baigol, Donato Francesco Falcone, Saketh Ram Mamidala, Bert Jan Offrein, Laura B\'egon-Lours
PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality Authors: Byeongho Yu, Changhun Lee, Jungyu Jin, Eunhyeok Park
Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering Authors: Paris A. Karakasis, Nicholas D. Sidiropoulos
Flow-Induced Diagonal Gaussian Processes Authors: Moule Lin, Andrea Patane, Weipeng Jing, Shuhao Guan, Goetz Botterweck
HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork Authors: Jindi Lv, Yuhao Zhou, Yuxin Tian, Qing Ye, Wentao Feng, Jiancheng Lv

1. Circuit Complexity From Physical Constraints: Scaling Limitations of Attention

ArXiv ID: 2509.19161

Authors: Benjamin Prada, Ankur Mali

Abstract: We argue that the standard circuit complexity measures derived from $NC, AC, TC$ provide limited practical information and are now insufficient to further differentiate model expressivity. To address these new limitations, we define a novel notion of local uniformity and a family of circuit complexity classes $RC(\cdot)$ that capture the fundamental constraints of scaling physical circuits. Through the lens of $RC(\cdot)$, we show that attention mechanisms with $\omega(n^{3/2})$ runtime cannot scale to accommodate the entropy of increasingly complex datasets. Our results simultaneously provide a methodology for defining meaningful bounds on transformer expressivity and naturally expose the restricted viability of attention.

Comment: Model Architecture/Theory: introduces new circuit complexity classes capturing physical constraints and derives scaling limits for attention-based Transformers.

Relevance: 10 Novelty: 9

2. Qwen3-Omni Technical Report

ArXiv ID: 2509.17765

Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin

Abstract: We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Comment: Model Architecture: MoE (Thinker–Talker) unifying text/image/audio/video; Efficiency: low-latency streaming replacing diffusion with causal ConvNet and codebook prediction.

Relevance: 10 Novelty: 8

3. MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

ArXiv ID: 2509.17238

Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

Abstract: The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

Comment: MoE inference-time method: training-free hyper-parallel token-level scaling (RoE) with efficient batching and specialized KV cache for better accuracy–compute tradeoffs.

Relevance: 10 Novelty: 8

4. CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

ArXiv ID: 2509.18993

Authors: Boao Kong, Junzhu Liang, Yuxi Liu, Renjia Deng, Kun Yuan

Abstract: Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

Comment: Cross-layer low-rank residual architecture with activation recomputation for memory/compute efficiency (Compression/Efficiency; Low-rank; Systems-level memory optimization).

Relevance: 10 Novelty: 8

5. PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

ArXiv ID: 2509.16989

Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

Abstract: Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments.

Comment: Model Compression and Efficiency: ternary trit-plane PTQ enabling multiplication-free inference with a progressive approximation guaranteeing global weight consistency.

Relevance: 10 Novelty: 8

6. Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

ArXiv ID: 2509.18542

Authors: Qi Wang, Hanyang Peng, Yue Yu

Abstract: Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To circumvent the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Llama2-Chat and Code Llama). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a single lightweight stage of router training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

Comment: Model Architecture (MoE): builds a coherent MoE from disparate pretrained models via training-free functional alignment and router learning.

Relevance: 10 Novelty: 8

7. PiMoE: Token-Level Routing for Integrating High-Precision Computation and Reasoning

ArXiv ID: 2509.18169

Authors: Hengbo Xiao, Jingyuan Fan, Xin Tong, Jingzhao Zhang, Chao Lu, Guannan He

Abstract: Complex systems typically rely on high-precision numerical computation to support decisions, but current large language models (LLMs) cannot yet incorporate such computations as an intrinsic and interpretable capability with existing architectures. Mainstream multi-agent approaches can leverage external experts, but inevitably introduce communication overhead and suffer from inefficient multimodal emergent capability and limited scalability. To this end, we propose PiMoE (Physically-isolated Mixture of Experts), a training and inference architecture for integrating computation and reasoning. Instead of the workflow paradigm of tool invocation, PiMoE endogenously integrates computational capabilities into neural networks after separately training experts, a text-to-computation module, and a router. At inference, the router directs computation and reasoning at the token level, thereby enabling iterative alternation within a single chain of thought. We evaluate PiMoE on two reasoning-computation tasks against LLM finetuning and the multi-agent system approaches. Results show that the PiMoE architecture achieves not only higher accuracy than directly finetuning LLMs but also significant improvements in response latency, token usage, and GPU energy consumption compared with mainstream multi-agent approaches. PiMoE offers an efficient, interpretable, and scalable paradigm for next-generation scientific or industrial intelligent systems.

Comment: Model Architecture (MoE): token-level routing in a physically-isolated Mixture-of-Experts integrating computation modules with reasoning within a single chain of thought.

Relevance: 10 Novelty: 8

8. Learning From Simulators: A Theory of Simulation-Grounded Learning

ArXiv ID: 2509.18990

Authors: Carson Dudley, Marisa Eisenberg

Abstract: Simulation-Grounded Neural Networks (SGNNs) are predictive models trained entirely on synthetic data from mechanistic simulations. They have achieved state-of-the-art performance in domains where real-world labels are limited or unobserved, but lack a formal underpinning. We present the foundational theory of simulation-grounded learning. We show that SGNNs implement amortized Bayesian inference under a simulation prior and converge to the Bayes-optimal predictor. We derive generalization bounds under model misspecification and prove that SGNNs can learn unobservable scientific quantities that empirical methods provably cannot. We also formalize a novel form of mechanistic interpretability uniquely enabled by SGNNs: by attributing predictions to the simulated mechanisms that generated them, SGNNs yield posterior-consistent, scientifically grounded explanations. We provide numerical experiments to validate all theoretical predictions. SGNNs recover latent parameters, remain robust under mismatch, and outperform classical tools: in a model selection task, SGNNs achieve half the error of AIC in distinguishing mechanistic dynamics. These results establish SGNNs as a principled and practical framework for scientific prediction in data-limited regimes.

Comment: Representation Learning: foundational theory showing SGNNs perform amortized Bayesian inference under a simulation prior, with generalization bounds and mechanistic interpretability.

Relevance: 9 Novelty: 9

9. Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation

ArXiv ID: 2509.16882

Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Xuming Hu

Abstract: Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.

Comment: Model Architecture: MoE adaptation with dynamic expert specialization and router distillation to avoid catastrophic forgetting; training schedule isolates domain-specific gradients.

Relevance: 10 Novelty: 7

10. Accurate and Efficient Low-Rank Model Merging in Core Space

ArXiv ID: 2509.17786

Authors: Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer

Abstract: In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.

Comment: Model Compression/Efficiency: low-rank (LoRA) model merging via a shared Core Space with proof of no information loss and complexity analysis.

Relevance: 9 Novelty: 8

11. Global Minimizers of Sigmoid Contrastive Loss

ArXiv ID: 2509.18552

Authors: Kiril Bangachev, Guy Bresler, Iliyas Noman, Yury Polyanskiy

Abstract: The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(\mathsf{m}, \mathsf{b}{\mathsf{rel}})$-Constellations. $(\mathsf{m}, \mathsf{b}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.}})$-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $\mathsf{m}$ and relative bias $\mathsf{b}_{\mathsf{rel}

Comment: Representation Learning/Theory: analyzes sigmoid contrastive loss with trainable temperature/bias (SigLIP), introduces constellation characterization and reparameterization.

Relevance: 9 Novelty: 8

12. Evolution of Concepts in Language Model Pre-Training

ArXiv ID: 2509.17196

Authors: Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu

Abstract: Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Comment: Representation Learning: analyzes training dynamics via sparse dictionary crosscoders to track interpretable features through pre-training.

Relevance: 9 Novelty: 8

13. nDNA -- the Semantic Helix of Artificial Cognition

ArXiv ID: 2509.18216

Authors: Amitava Das

Abstract: As AI foundation models grow in capability, a deeper question emerges: What shapes their internal cognitive identity -- beyond fluency and output? Benchmarks measure behavior, but the soul of a model resides in its latent geometry. In this work, we propose Neural DNA (nDNA) as a semantic-genotypic representation that captures this latent identity through the intrinsic geometry of belief. At its core, nDNA is synthesized from three principled and indispensable dimensions of latent geometry: spectral curvature, which reveals the curvature of conceptual flow across layers; thermodynamic length, which quantifies the semantic effort required to traverse representational transitions through layers; and belief vector field, which delineates the semantic torsion fields that guide a model's belief directional orientations. Like biological DNA, it encodes ancestry, mutation, and semantic inheritance, found in finetuning and alignment scars, cultural imprints, and architectural drift. In naming it, we open a new field: Neural Genomics, where models are not just tools, but digital semantic organisms with traceable inner cognition. Modeling statement. We read AI foundation models as semantic fluid--dynamics: meaning is transported through layers like fluid in a shaped conduit; nDNA is the physics-grade readout of that flow -- a geometry-first measure of how meaning is bent, paid for, and pushed -- yielding a stable, coordinate-free neural DNA fingerprint tied to on-input behavior; with this fingerprint we cross into biology: tracing lineages across pretraining, fine-tuning, alignment, pruning, distillation, and merges; measuring inheritance between checkpoints; detecting drift as traits shift under new data or objectives; and, ultimately, studying the evolution of artificial cognition to compare models, diagnose risks, and govern change over time.

Comment: Proposes intrinsic latent-geometry metrics (spectral curvature, thermodynamic length, belief vector field) to fingerprint models (Representation Learning theory).

Relevance: 9 Novelty: 8

14. Otters: An Energy-Efficient SpikingTransformer via Optical Time-to-First-Spike Encoding

ArXiv ID: 2509.18968

Authors: Zhanglu Yan, Jiayi Mao, Qianhui Liu, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Weng-Fai Wong

Abstract: Spiking neural networks (SNNs) promise high energy efficiency, particularly with time-to-first-spike (TTFS) encoding, which maximizes sparsity by emitting at most one spike per neuron. However, such energy advantage is often unrealized because inference requires evaluating a temporal decay function and subsequent multiplication with the synaptic weights. This paper challenges this costly approach by repurposing a physical hardware `bug', namely, the natural signal decay in optoelectronic devices, as the core computation of TTFS. We fabricated a custom indium oxide optoelectronic synapse, showing how its natural physical decay directly implements the required temporal function. By treating the device's analog output as the fused product of the synaptic weight and temporal decay, optoelectronic synaptic TTFS (named Otters) eliminates these expensive digital operations. To use the Otters paradigm in complex architectures like the transformer, which are challenging to train directly due to the sparsity issue, we introduce a novel quantized neural network-to-SNN conversion algorithm. This complete hardware-software co-design enables our model to achieve state-of-the-art accuracy across seven GLUE benchmark datasets and demonstrates a 1.77$\times$ improvement in energy efficiency over previous leading SNNs, based on a comprehensive analysis of compute, data movement, and memory access costs using energy measurements from a commercial 22nm process. Our work thus establishes a new paradigm for energy-efficient SNNs, translating fundamental device physics directly into powerful computational primitives. All codes and data are open source.

Comment: Model Architecture and Efficiency: spiking Transformer with optical TTFS encoding and QNN-to-SNN conversion for energy-efficient inference.

Relevance: 9 Novelty: 8

15. FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

ArXiv ID: 2509.18362

Authors: Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, Xi Chen

Abstract: As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This paper introduces FastMTP, a simple yet effective method that improves multi-step draft quality by aligning MTP training with its inference pattern, significantly enhancing speculative decoding performance. Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens and maintain high acceptance rates across multiple recursive draft steps. By integrating language-aware dynamic vocabulary compression into the MTP head, we further reduce computational overhead in the drafting process. Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%. FastMTP requires only lightweight training and seamlessly integrates with existing inference frameworks, offering a practical and rapidly deployable solution for accelerating LLM inference.

Comment: Model Compression and Efficiency/HPC: accelerates LLM inference via enhanced multi-token prediction aligned with inference and dynamic vocabulary compression for speculative decoding.

Relevance: 9 Novelty: 8

16. HyperAdapt: Simple High-Rank Adaptation

ArXiv ID: 2509.18629

Authors: Abel Gurung, Joseph Campbell

Abstract: Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only $n+m$ trainable parameters for an $n \times m$ matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt's updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.

Comment: Model compression and efficiency (PEFT): row/column-wise diagonal scaling inducing high-rank updates with only O(n+m) trainable parameters per matrix.

Relevance: 9 Novelty: 8

17. Theory of periodic convolutional neural network

ArXiv ID: 2509.18744

Authors: Yuqing Liu

Abstract: We introduce a novel convolutional neural network architecture, termed the \emph{periodic CNN}, which incorporates periodic boundary conditions into the convolutional layers. Our main theoretical contribution is a rigorous approximation theorem: periodic CNNs can approximate ridge functions depending on $d-1$ linear variables in a $d$-dimensional input space, while such approximation is impossible in lower-dimensional ridge settings ($d-2$ or fewer variables). This result establishes a sharp characterization of the expressive power of periodic CNNs. Beyond the theory, our findings suggest that periodic CNNs are particularly well-suited for problems where data naturally admits a ridge-like structure of high intrinsic dimension, such as image analysis on wrapped domains, physics-informed learning, and materials science. The work thus both expands the mathematical foundation of CNN approximation theory and highlights a class of architectures with surprising and practically relevant approximation capabilities.

Comment: Model architecture: introduces periodic CNNs and proves sharp approximation properties, characterizing expressive power.

Relevance: 9 Novelty: 8

18. ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching

ArXiv ID: 2509.16857

Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

Abstract: Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to up to 1.35x higher throughput.

Comment: HPC/systems: SmartNIC-offloaded, interference-free KV cache fetching for distributed prefix caching; pipeline and memory designs for serving efficiency.

Relevance: 9 Novelty: 8

19. SBVR: Summation of BitVector Representation for Efficient LLM Quantization

ArXiv ID: 2509.18172

Authors: Wonjun Bang, Jongseok Park, Hongseung Yu, Kyungmin Bin, Kyunghan Lee

Abstract: With the advent of large language models (LLMs), numerous Post-Training Quantization (PTQ) strategies have been proposed to alleviate deployment barriers created by their enormous parameter counts. Quantization achieves compression by limiting the number of representable points in the data. Therefore, the key to achieving efficient quantization is selecting the optimal combination of representation points, or codes, for the given data. Existing PTQ solutions adopt two major approaches to this problem: Round-To-Nearest (RTN)-based methods and codebook-based methods. RTN-based methods map LLM weights onto uniformly distributed integer grids, failing to account for the Gaussian-like weight distribution of LLM weights. Codebook-based methods mitigate this issue by constructing distribution-aware codebooks; however, they suffer from random and strided memory access patterns, resulting in degraded inference speed that is exacerbated by the limited size of GPU L1 cache. To overcome these limitations, we propose a novel LLM quantization method, SBVR (Summation of BitVector Representation), that enables Gaussian-like code representation in a hardware-friendly manner for fast inference. SBVR maps weight values to non-uniform representation points whose distribution follows the actual distribution of LLM weights, enabling more accurate compression. Additionally, we design a custom CUDA kernel that allows matrix-vector multiplication directly in the SBVR format without decompression, thereby enabling high-performance execution of SBVR-compressed models. Our evaluations of SBVR on various models demonstrate state-of-the-art perplexity and accuracy benchmark performance while delivering a 2.21x- 3.04x end-to-end token-generation speedup over naive FP16 models in the 4-bit quantization regime.

Comment: Strong match to Model Compression and Efficiency: non-uniform PTQ with hardware-friendly codes and custom CUDA enabling compute in quantized domain.

Relevance: 9 Novelty: 8

20. KANO: Kolmogorov-Arnold Neural Operator

ArXiv ID: 2509.16825

Authors: Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang

Abstract: We introduce Kolmogorov--Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over generic position-dependent dynamics for any physical input, whereas FNO stays practical only for spectrally sparse operators and strictly imposes a fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx 1.5\times10^{-2}$, by orders of magnitude.

Comment: Model Architecture: introduces a new neural operator (KANO) combining spectral and spatial bases with theoretical expressivity gains over FNO.

Relevance: 9 Novelty: 8

21. Understanding Post-Training Structural Changes in Large Language Models

ArXiv ID: 2509.17866

Authors: Xinyu He, Xianghui Cao

Abstract: Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two consistent and unexpected structural changes:(1) a near-uniform geometric scaling of singular values across layers, which theoretically modulates attention scores; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Disrupting this orthogonal consistency leads to catastrophic performance degradation. Based on these findings, we propose a simple yet effective framework that interprets post-training as a reparameterization of fixed subspaces in the pretrained parameter space. Further experiments reveal that singular value scaling behaves as a secondary effect, analogous to a temperature adjustment, whereas the core functional transformation lies in the coordinated rotation of singular vectors. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.

Comment: Representation Learning/Training Dynamics: SVD-based analysis reveals uniform singular value scaling and coordinated singular-vector rotations after post-training.

Relevance: 9 Novelty: 8

22. AdaMixT: Adaptive Weighted Mixture of Multi-Scale Expert Transformers for Time Series Forecasting

ArXiv ID: 2509.18107

Authors: Huanyao Zhang, Jiaye Lin, Wentao Zhang, Haitao Yuan, Guoliang Li

Abstract: Multivariate time series forecasting involves predicting future values based on historical observations. However, existing approaches primarily rely on predefined single-scale patches or lack effective mechanisms for multi-scale feature fusion. These limitations hinder them from fully capturing the complex patterns inherent in time series, leading to constrained performance and insufficient generalizability. To address these challenges, we propose a novel architecture named Adaptive Weighted Mixture of Multi-Scale Expert Transformers (AdaMixT). Specifically, AdaMixT introduces various patches and leverages both General Pre-trained Models (GPM) and Domain-specific Models (DSM) for multi-scale feature extraction. To accommodate the heterogeneity of temporal features, AdaMixT incorporates a gating network that dynamically allocates weights among different experts, enabling more accurate predictions through adaptive multi-scale fusion. Comprehensive experiments on eight widely used benchmarks, including Weather, Traffic, Electricity, ILI, and four ETT datasets, consistently demonstrate the effectiveness of AdaMixT in real-world scenarios.

Comment: Model Architecture (MoE): introduces an adaptive weighted mixture of multi-scale expert Transformers with a gating network for time series forecasting.

Relevance: 9 Novelty: 7

23. Self-Evolving LLMs via Continual Instruction Tuning

ArXiv ID: 2509.18133

Authors: Le Huang, Jiazheng Kang, Cheng Hou, Zhe Zhao, Zhenxiang Yan, Chuan Shi, Ting Bai

Abstract: In real-world industrial settings, large language models (LLMs) must learn continually to keep pace with diverse and evolving tasks, requiring self-evolution to refine knowledge under dynamic data distributions. However, existing continual learning (CL) approaches, such as replay and parameter isolation, often suffer from catastrophic forgetting: training on new tasks degrades performance on earlier ones by overfitting to the new distribution and weakening generalization.We propose MoE-CL, a parameter-efficient adversarial mixture-of-experts framework for industrial-scale, self-evolving continual instruction tuning of LLMs. MoE-CL uses a dual-expert design: (1) a dedicated LoRA expert per task to preserve task-specific knowledge via parameter independence, mitigating forgetting; and (2) a shared LoRA expert to enable cross-task transfer. To prevent transferring task-irrelevant noise through the shared pathway, we integrate a task-aware discriminator within a GAN. The discriminator encourages the shared expert to pass only task-aligned information during sequential training. Through adversarial learning, the shared expert acquires generalized representations that mimic the discriminator, while dedicated experts retain task-specific details, balancing knowledge retention and cross-task generalization and thereby supporting self-evolution.Extensive experiments on the public MTL5 benchmark and an industrial Tencent3 benchmark validate the effectiveness of MoE-CL for continual instruction tuning. In real-world A/B testing for content compliance review on the Tencent Video platform, MoE-CL reduced manual review costs by 15.3%. These results demonstrate that MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical.

Comment: Mixture-of-Experts design with dual LoRA experts and an adversarial discriminator for continual instruction tuning—parameter-efficient architecture (Model Architecture; Efficiency).

Relevance: 9 Novelty: 7

24. Adaptive Overclocking: Dynamic Control of Thinking Path Length via Real-Time Reasoning Signals

ArXiv ID: 2509.17000

Authors: Shuhao Jiang, Songbo Wang, Yang Qiao, Chun Xu, Chaoyang Zheng, Shengyi Zhou, Huanjun Wang, Fangming Li, Cong Zhang, Jiyu Wang

Abstract: Large Reasoning Models (LRMs) often suffer from computational inefficiency due to overthinking, where a fixed reasoning budget fails to match the varying complexity of tasks. To address this issue, we propose Adaptive Overclocking, a method that makes the overclocking hyperparameter $\alpha$ dynamic and context-aware. Our method adjusts reasoning speed in real time through two complementary signals: (1) token-level model uncertainty for fine-grained step-wise control, and (2) input complexity estimation for informed initialization. We implement this approach with three strategies: Uncertainty-Aware Alpha Scheduling (UA-$\alpha$S), Complexity-Guided Alpha Initialization (CG-$\alpha$I), and a Hybrid Adaptive Control (HAC) that combines both. Experiments on GSM8K, MATH, and SVAMP show that HAC achieves superior accuracy-latency trade-offs, reducing unnecessary computation on simple problems while allocating more resources to challenging ones. By mitigating overthinking, Adaptive Overclocking enhances both efficiency and overall reasoning performance.

Comment: Conditional/Dynamic Networks: real-time adaptive control of reasoning compute via uncertainty signals and input complexity for better accuracy-latency trade-offs.

Relevance: 9 Novelty: 7

25. Confidence-gated training for efficient early-exit neural networks

ArXiv ID: 2509.17885

Authors: Saad Mokssit, Ouassim Karrakchou, Alejandro Mousist, Mounir Ghogho

Abstract: Early-exit neural networks reduce inference cost by enabling confident predictions at intermediate layers. However, joint training often leads to gradient interference, with deeper classifiers dominating optimization. We propose Confidence-Gated Training (CGT), a paradigm that conditionally propagates gradients from deeper exits only when preceding exits fail. This encourages shallow classifiers to act as primary decision points while reserving deeper layers for harder inputs. By aligning training with the inference-time policy, CGT mitigates overthinking, improves early-exit accuracy, and preserves efficiency. Experiments on the Indian Pines and Fashion-MNIST benchmarks show that CGT lowers average inference cost while improving overall accuracy, offering a practical solution for deploying deep models in resource-constrained environments.

Comment: Model Compression and Efficiency / Conditional/Dynamic Networks: confidence-gated training for early-exit networks aligns training with inference to reduce compute while maintaining accuracy.

Relevance: 9 Novelty: 7

26. Towards Provable Emergence of In-Context Reinforcement Learning

ArXiv ID: 2509.18389

Authors: Jiuqi Wang, Rohan Chandra, Shangtong Zhang

Abstract: Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent's interaction history in the new task. The agent's performance increases as the information in the context increases, with the agent's parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.

Comment: Representation Learning/Training Dynamics: theoretical analysis showing when Transformer pretraining yields in-context RL.

Relevance: 9 Novelty: 7

27. Sparse Training Scheme for Multimodal LLM

ArXiv ID: 2509.18150

Authors: Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient due to the significantly longer input sequences introduced by multimodal data and the low utilization of inter-layer computations. To address this challenge, we shift the focus to the training process itself and propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). This scheme consists of two key components: the Visual Token Compressor, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by dynamically skipping unnecessary layers in the language model during both forward and backward passes. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

Comment: Model Compression/Efficiency: sparse training with visual token compression and dynamic layer skipping during forward/backward passes across MLLMs.

Relevance: 9 Novelty: 7

28. Soft Tokens, Hard Truths

ArXiv ID: 2509.19170

Authors: Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, Yann Ollivier

Abstract: The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

Comment: Representation Learning/Training Dynamics: learning continuous (soft) Chain-of-Thought tokens via RL to enhance reasoning diversity and performance.

Relevance: 8 Novelty: 8

29. Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws

ArXiv ID: 2509.19189

Authors: Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

Abstract: Scaling laws have played a cornerstone role in guiding the training of large language models (LLMs). However, most existing works on scaling laws primarily focus on the final-step loss, overlooking the loss dynamics during the training process and, crucially, the impact of learning rate schedule (LRS). In this paper, we aim to bridge this gap by studying a teacher-student kernel regression setup trained via online stochastic gradient descent (SGD). Leveraging a novel intrinsic time viewpoint and stochastic differential equation (SDE) modeling of SGD, we introduce the Functional Scaling Law (FSL), which characterizes the evolution of population risk during the training process for general LRSs. Remarkably, the impact of the LRSs is captured through an explicit convolution-type functional term, making their effects fully tractable. To illustrate the utility of FSL, we analyze three widely used LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- under both data-limited and compute-limited regimes. We provide theoretical justification for widely adopted empirical practices in LLMs pre-training such as (i) higher-capacity models are more data- and compute-efficient; (ii) learning rate decay can improve training efficiency; (iii) WSD-like schedules can outperform direct-decay schedules. Lastly, we explore the practical relevance of FSL as a surrogate model for fitting, predicting and optimizing the loss curves in LLM pre-training, with experiments conducted across model sizes ranging from 0.1B to 1B parameters. We hope our FSL framework can deepen the understanding of LLM pre-training dynamics and provide insights for improving large-scale model training.

Comment: Representation learning/training dynamics: theoretical analysis of SGD dynamics and learning-rate schedules via a functional scaling law.

Relevance: 8 Novelty: 8

30. Pareto-optimal Tradeoffs Between Communication and Computation with Flexible Gradient Tracking

ArXiv ID: 2509.18129

Authors: Yan Huang, Jinming Xu, Li Chai, Jiming Chen, Karl H. Johansson

Abstract: This paper addresses distributed optimization problems in non-i.i.d. scenarios, focusing on the interplay between communication and computation efficiency. To this end, we propose FlexGT, a flexible snapshot gradient tracking method with tunable numbers of local updates and neighboring communications in each round. Leveraging a unified convergence analysis framework, we prove that FlexGT achieves a linear or sublinear convergence rate depending on objective-specific properties--from (strongly) convex to nonconvex--and the above-mentioned tunable parameters. FlexGT is provably robust to the heterogeneity across nodes and attains the best-known communication and computation complexity among existing results. Moreover, we introduce an accelerated gossip-based variant, termed Acc-FlexGT, and show that with prior knowledge of the graph, it achieves a Pareto-optimal trade-off between communication and computation. Particularly, Acc-FlexGT achieves the optimal iteration complexity of $\tilde{\mathcal{O}} \left( L/\epsilon +L\sigma ^2/\left( n\epsilon^2 \sqrt{1-\sqrt{\rho _W}} \right) \right) $ for the nonconvex case, matching the existing lower bound up to a logarithmic factor, and improves the existing results for the strongly convex case by a factor of $\tilde{\mathcal{O}} \left( 1/\sqrt{\epsilon} \right)$, where $\epsilon$ is the targeted accuracy, $n$ the number of nodes, $L$ the Lipschitz constant, $\rho_W$ the spectrum gap of the graph, and $\sigma$ the stochastic gradient variance. Numerical examples are provided to demonstrate the effectiveness of the proposed methods.

Comment: High Performance Computing: distributed optimization algorithm with tunable comm/comp tradeoffs and near-optimal iteration complexity guarantees.

Relevance: 8 Novelty: 8

31. Interpreting vision transformers via residual replacement model

ArXiv ID: 2509.17401

Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang

Abstract: How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.

Comment: Representation Learning/Interpretability: interprets ViTs using sparse autoencoders and a residual replacement model to reveal feature circuits.