Personalized Daily ArXiv Papers 2025-09-23

[gpt-5]	Prompt	Completion	Total
Token	76776	79041	155817
Cost	$0.1	$0.79	$0.89

Total arXiv papers: 964

Total scanned papers: 548

Total relevant papers: 48

Table of contents with paper titles:

PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong
Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few Authors: Qishuai Wen, Zhiyuan Huang, Chun-Guang Li
The Complexity of Finding Local Optima in Contrastive Learning Authors: Jingming Yan, Yiyuan Luo, Vaggos Chatziafratis, Ioannis Panageas, Parnian Shahkar, Stelios Stavroulakis
Qwen3-Omni Technical Report Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho
Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data Authors: Tianyi Chen, Pengxiao Lin, Zhiwei Wang, Zhi-Qin John Xu
Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Xuming Hu
Understanding Post-Training Structural Changes in Large Language Models Authors: Xinyu He, Xianghui Cao
Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features Authors: Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun
Robust LLM Training Infrastructure at ByteDance Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang
Bias-variance Tradeoff in Tensor Estimation Authors: Shivam Kumar, Haotian Xu, Carlos Misael Madrid Padilla, Yuehaw Khoo, Oscar Hernan Madrid Padilla, Daren Wang
ViTCAE: ViT-based Class-conditioned Autoencoder Authors: Vahid Jebraeeli, Hamid Krim, Derya Cansever
LightCode: Compiling LLM Inference for Photonic-Electronic Systems Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund
Deep Hierarchical Learning with Nested Subspace Networks Authors: Paulius Rauba, Mihaela van der Schaar
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking Authors: Ting Han, Linara Adilova, Henning Petzka, Jens Kleesiek, Michael Kamp
Accurate and Efficient Low-Rank Model Merging in Core Space Authors: Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer
Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
Language Modeling with Learned Meta-Tokens Authors: Alok N. Shah, Khush Gupta, Keshav Ramji, Pratik Chaudhari
Flow-Induced Diagonal Gaussian Processes Authors: Moule Lin, Andrea Patane, Weipeng Jing, Shuhao Guan, Goetz Botterweck
Elucidating the Design Space of FP4 training Authors: Robert Hu, Carlo Luschi, Paul Balanca
DISCO: Disentangled Communication Steering for Large Language Models Authors: Max Torop, Aria Masoomi, Masih Eskandar, Jennifer Dy
Evolution of Concepts in Language Model Pre-Training Authors: Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu
Interpreting vision transformers via residual replacement model Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang
The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs Authors: Hinata Tezuka, Naoya Inoue
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu
SEQR: Secure and Efficient QR-based LoRA Routing Authors: William Fleshman, Benjamin Van Durme
KANO: Kolmogorov-Arnold Neural Operator Authors: Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang
Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization Authors: Jingfeng Wu, Peter L. Bartlett, Jason D. Lee, Sham M. Kakade, Bin Yu
Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise Authors: Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
$\boldsymbol{\lambda}$-Orthogonality Regularization for Compatible Representation Learning Authors: Simone Ricci, Niccol`o Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo
Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling Authors: Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, HaiZhou Li
Digging Into the Internal: Causality-Based Analysis of LLM Function Calling Authors: Zhenlan Ji, Daoyuan Wu, Wenxuan Wang, Pingchuan Ma, Shuai Wang, Lei Ma
EMPEROR: Efficient Moment-Preserving Representation of Distributions Authors: Xinran Liu, Shansita D. Sharma, Soheil Kolouri
Geometric Mixture Classifier (GMC): A Discriminative Per-Class Mixture of Hyperplanes Authors: Prasanth K K, Shubham Sharma
Low-Rank Adaptation of Evolutionary Deep Neural Networks for Efficient Learning of Time-Dependent PDEs Authors: Jiahao Zhang, Shiheng Zhang, Guang Lin
Adaptive Overclocking: Dynamic Control of Thinking Path Length via Real-Time Reasoning Signals Authors: Shuhao Jiang, Songbo Wang, Yang Qiao, Chun Xu, Chaoyang Zheng, Shengyi Zhou, Huanjun Wang, Fangming Li, Cong Zhang, Jiyu Wang
Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception Authors: Haobo Yang, Shiyan Zhang, Zhuoyi Yang, Jilong Guo, Jun Yang, Xinyu Zhang
PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality Authors: Byeongho Yu, Changhun Lee, Jungyu Jin, Eunhyeok Park
Probabilistic Token Alignment for Large Language Model Fusion Authors: Runjia Zeng, James Chenhao Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huang, Tong Geng, Qifan Wang, Dongfang Liu
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels Authors: Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan
Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe Authors: Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, Sanjiv Kumar
A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective Authors: Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu
GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2 Authors: Savini Kashmira, Jayanaka Dantanarayana, Thamirawaran Sathiyalogeswaran, Yichao Yuan, Nishil Talati, Krisztian Flautner, Lingjia Tang, Jason Mars
Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models Authors: Jinyeong Kim, Seil Kang, Jiwoo Park, Junhyeok Kim, Seong Jae Hwang
Causal Representation Learning from Multimodal Clinical Records under Non-Random Modality Missingness Authors: Zihan Liang, Ziwen Pan, Ruoxuan Xiong
Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization Authors: Manish Acharya, David Hyde
Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan
Unrolled Graph Neural Networks for Constrained Optimization Authors: Samar Hadou, Alejandro Ribeiro

1. PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

ArXiv ID: 2509.16989

Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

Abstract: Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments.

Comment: Compression/Efficiency: introduces ternary post-training quantization to trit-planes enabling multiplication-free inference with a progressive approximation algorithm.

Relevance: 10 Novelty: 9

2. Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few

ArXiv ID: 2509.16875

Authors: Qishuai Wen, Zhiyuan Huang, Chun-Guang Li

Abstract: Attention mechanisms in Transformers have gained significant empirical success. Nonetheless, the optimization objectives underlying their forward pass are still unclear. Additionally, the quadratic complexity of self-attention is increasingly prohibitive. Unlike the prior work on addressing the interpretability or efficiency issue separately, we propose a unified optimization objective to alleviate both issues simultaneously. By unrolling the optimization over the objective, we derive an inherently interpretable and efficient attention mechanism, which compresses all tokens into low-dimensional structures by contracting a few representative tokens and then broadcasting the contractions back. This Contract-and-Broadcast Self-Attention (CBSA) mechanism can not only scale linearly but also generalize existing attention mechanisms as its special cases. Experiments further demonstrate comparable performance and even superior advantages of CBSA on several visual tasks. Code is available at this https URL.

Comment: Model Architecture: derives an interpretable Contract-and-Broadcast Self-Attention that compresses tokens into low-dimensional structures and generalizes existing attention; Model Compression and Efficiency: achieves linear-time attention by contracting a few representative tokens.

Relevance: 10 Novelty: 9

3. The Complexity of Finding Local Optima in Contrastive Learning

ArXiv ID: 2509.16898

Authors: Jingming Yan, Yiyuan Luo, Vaggos Chatziafratis, Ioannis Panageas, Parnian Shahkar, Stelios Stavroulakis

Abstract: Contrastive learning is a powerful technique for discovering meaningful data representations by optimizing objectives based on $\textit{contrastive information}$, often given as a set of weighted triplets ${(x_i, y_i^+, z_{i}^-)}_{i = 1}^m$ indicating that an "anchor" $x_i$ is more similar to a "positive" example $y_i$ than to a "negative" example $z_i$. The goal is to find representations (e.g., embeddings in $\mathbb{R}^d$ or a tree metric) where anchors are placed closer to positive than to negative examples. While finding $\textit{global}$ optima of contrastive objectives is $\mathsf{NP}$-hard, the complexity of finding $\textit{local}$ optima -- representations that do not improve by local search algorithms such as gradient-based methods -- remains open. Our work settles the complexity of finding local optima in various contrastive learning problems by proving $\mathsf{PLS}$-hardness in discrete settings (e.g., maximize satisfied triplets) and $\mathsf{CLS}$-hardness in continuous settings (e.g., minimize Triplet Loss), where $\mathsf{PLS}$ (Polynomial Local Search) and $\mathsf{CLS}$ (Continuous Local Search) are well-studied complexity classes capturing local search dynamics in discrete and continuous optimization, respectively. Our results imply that no polynomial time algorithm (local search or otherwise) can find a local optimum for various contrastive learning problems, unless $\mathsf{PLS}\subseteq\mathsf{P}$ (or $\mathsf{CLS}\subseteq \mathsf{P}$ for continuous problems). Even in the unlikely scenario that $\mathsf{PLS}\subseteq\mathsf{P}$ (or $\mathsf{CLS}\subseteq \mathsf{P}$), our reductions imply that there exist instances where local search algorithms need exponential time to reach a local optimum, even for $d=1$ (embeddings on a line).

Comment: Representation Learning Theory: proves PLS/CLS-hardness for finding local optima in contrastive objectives (e.g., triplet loss), clarifying limits of local search/gradient methods.

Relevance: 10 Novelty: 9

4. Qwen3-Omni Technical Report

ArXiv ID: 2509.17765

Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin

Abstract: We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Comment: Model Architecture: Thinker–Talker Mixture-of-Experts unifying text/image/audio/video; Efficiency: replaces diffusion with causal ConvNet over discrete codebooks for low-latency streaming.

Relevance: 10 Novelty: 8

5. MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

ArXiv ID: 2509.17238

Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

Abstract: The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

Comment: Model Architecture & Efficiency: inference-time hyper-parallel scaling for MoE (RoE) with stochastic routing and specialized KV caching to boost token-level accuracy.

Relevance: 10 Novelty: 8

6. Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data

ArXiv ID: 2509.17514

Authors: Tianyi Chen, Pengxiao Lin, Zhiwei Wang, Zhi-Qin John Xu

Abstract: State Space Models (SSMs) have emerged as promising alternatives to attention mechanisms, with the Mamba architecture demonstrating impressive performance and linear complexity for processing long sequences. However, the fundamental differences between Mamba and Transformer architectures remain incompletely understood. In this work, we use carefully designed synthetic tasks to reveal Mamba's inherent limitations. Through experiments, we identify that Mamba's nonlinear convolution introduces an asymmetry bias that significantly impairs its ability to recognize symmetrical patterns and relationships. Using composite function and inverse sequence matching tasks, we demonstrate that Mamba strongly favors compositional solutions over symmetrical ones and struggles with tasks requiring the matching of reversed sequences. We show these limitations stem not from the SSM module itself but from the nonlinear convolution preceding it, which fuses token information asymmetrically. These insights provide a new understanding of Mamba's constraints and suggest concrete architectural improvements for future sequence models.

Comment: Matches Model Architecture: foundational analysis of Mamba’s nonlinear convolution revealing asymmetry bias and architectural limitations/suggested fixes.

Relevance: 10 Novelty: 8

7. Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation

ArXiv ID: 2509.16882

Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Xuming Hu

Abstract: Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.

Comment: Model architecture (MoE): dynamic expert routing/specialization with expert isolation and adaptive fine-tuning to mitigate catastrophic forgetting; training dynamics for MoE.

Relevance: 10 Novelty: 7

8. Understanding Post-Training Structural Changes in Large Language Models

ArXiv ID: 2509.17866

Authors: Xinyu He, Xianghui Cao

Abstract: Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two consistent and unexpected structural changes:(1) a near-uniform geometric scaling of singular values across layers, which theoretically modulates attention scores; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Disrupting this orthogonal consistency leads to catastrophic performance degradation. Based on these findings, we propose a simple yet effective framework that interprets post-training as a reparameterization of fixed subspaces in the pretrained parameter space. Further experiments reveal that singular value scaling behaves as a secondary effect, analogous to a temperature adjustment, whereas the core functional transformation lies in the coordinated rotation of singular vectors. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.

Comment: Representation Learning: training-dynamics analysis via SVD reveals structured singular value scaling and coordinated orthogonal rotations after post-training, offering foundational insight into parameter evolution.

Relevance: 9 Novelty: 8

9. Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features

ArXiv ID: 2509.16629

Authors: Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun

Abstract: Positional encoding is essential for supplementing transformer with positional information of tokens. Existing positional encoding methods demand predefined token/feature order, rendering them unsuitable for real-world data with non-sequential yet causally-related features. To address this limitation, we propose CAPE, a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG) using generalized structural equation modeling. The DAG is then embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach that effectively captures two important causal graph properties (causal strength & causal specificity). This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer's self-attention mechanism. Theoretical analysis reveals that CAPE-generated rotary positional encodings possess three valuable properties for enhanced self-attention, including causal distance-induced attenuation, causal generality-induced attenuation, and robustness to positional disturbances. We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. Our code is available at https://github.com/Catchxu/CAPE.

Comment: Model Architecture + Representation Learning: causality-induced positional encodings via DAG-to-hyperbolic embeddings integrated as rotary PEs for Transformers.

Relevance: 9 Novelty: 8

10. Robust LLM Training Infrastructure at ByteDance

ArXiv ID: 2509.16293

Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

Abstract: The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.

Comment: High Performance Computing: systems-level robust training infrastructure enabling large-scale, failure-tolerant LLM training across 200k+ GPUs.

Relevance: 9 Novelty: 8

11. Bias-variance Tradeoff in Tensor Estimation

ArXiv ID: 2509.17382

Authors: Shivam Kumar, Haotian Xu, Carlos Misael Madrid Padilla, Yuehaw Khoo, Oscar Hernan Madrid Padilla, Daren Wang

Abstract: We study denoising of a third-order tensor when the ground-truth tensor is not necessarily Tucker low-rank. Specifically, we observe $$ Y=X^\ast+Z\in \mathbb{R}^{p_{1} \times p_{2} \times p_{3}}, $$ where $X^\ast$ is the ground-truth tensor, and $Z$ is the noise tensor. We propose a simple variant of the higher-order tensor SVD estimator $\widetilde{X}$. We show that uniformly over all user-specified Tucker ranks $(r_{1},r_{2},r_{3})$, $$ | \widetilde{X} - X^* |{ \mathrm{F}}^2 = O \Big( \kappa^2 \Big{ r)$ decreases and the variance increases. As a byproduct we also obtain a convenient bias-variance decomposition for the vanilla low-rank SVD matrix estimators.}r_{2}r_{3}+\sum_{k=1}^{3} p_{k} r_{k} \Big} \; + \; \xi_{(r_{1},r_{2},r_{3})}^2\Big) \quad \text{ with high probability.} $$ Here, the bias term $\xi_{(r_1,r_2,r_3)}$ corresponds to the best achievable approximation error of $X^\ast$ over the class of tensors with Tucker ranks $(r_1,r_2,r_3)$; $\kappa^2$ quantifies the noise level; and the variance term $\kappa^2 {r_{1}r_{2}r_{3}+\sum_{k=1}^{3} p_{k} r_{k}}$ scales with the effective number of free parameters in the estimator $\widetilde{X}$. Our analysis achieves a clean rank-adaptive bias--variance tradeoff: as we increase the ranks of estimator $\widetilde{X}$, the bias $\xi(r_{1},r_{2},r_{3

Comment: Low-rank/tensor estimation theory: rank-adaptive bias–variance tradeoff for HOSVD-like estimator; foundational in low-rank representation learning.

Relevance: 9 Novelty: 8

12. ViTCAE: ViT-based Class-conditioned Autoencoder

ArXiv ID: 2509.16554

Authors: Vahid Jebraeeli, Hamid Krim, Derya Cansever

Abstract: Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms, limiting both generative control and optimization efficiency. This paper introduces ViTCAE, a framework that addresses these issues by re-purposing the Class token into a generative linchpin. In our architecture, the encoder maps the Class token to a global latent variable that dictates the prior distribution for local, patch-level latent variables, establishing a robust dependency where global semantics directly inform the synthesis of local details. Drawing inspiration from opinion dynamics, we treat each attention head as a dynamical system of interacting tokens seeking consensus. This perspective motivates a convergence-aware temperature scheduler that adaptively anneals each head's influence function based on its distributional stability. This process enables a principled head-freezing mechanism, guided by theoretically-grounded diagnostics like an attention evolution distance and a consensus/cluster functional. This technique prunes converged heads during training to significantly improve computational efficiency without sacrificing fidelity. By unifying a generative Class token with an adaptive attention mechanism rooted in multi-agent consensus theory, ViTCAE offers a more efficient and controllable approach to transformer-based generation.

Comment: Model Architecture: ViT-based class-conditioned autoencoder where the class token controls a global latent prior; Model Compression and Efficiency: convergence-aware attention temperature scheduler that freezes/prunes converged heads to cut compute during training.

Relevance: 9 Novelty: 8

13. LightCode: Compiling LLM Inference for Photonic-Electronic Systems

ArXiv ID: 2509.16443

Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund

Abstract: The growing demand for low-latency, energy-efficient inference in large language models (LLMs) has catalyzed interest in heterogeneous architectures. While GPUs remain dominant, they are poorly suited for integration with emerging domain-specific accelerators like the Photonic Tensor Units (PTUs), which offer low-power, high-throughput linear computation. This motivates hybrid compilation strategies that combine photonic and electronic resources. We present LightCode, a compiler framework and simulator for mapping LLM inference workloads across hybrid photonic-electronic systems. LightCode introduces the Stacked Graph, an intermediate representation that encodes multiple hardware-specific realizations of each tensor operation. Hardware assignment is formulated as a constrained subgraph selection problem optimized for latency or energy under parametric cost models. We evaluate LightCode on the prefill stage of GPT-2 and Llama-7B showing that under our workload and hardware assumptions, (i) Photonic hardware reduced energy by up to 50% in our simulated workloads at maximum sequence length; (ii) multiplexing and assignment strategy yielded latency improvements exceeding 10x; and (iii) Optimizing for latency or energy resulted in distinct hardware mappings in our simulations. LightCode offers a module, foundational framework and simulator for compiling LLMs to emerging photonic accelerators.

Comment: High Performance Computing/Systems: compiler IR (Stacked Graph) and hardware assignment for hybrid photonic–electronic LLM inference optimizing latency/energy.

Relevance: 9 Novelty: 8

14. Deep Hierarchical Learning with Nested Subspace Networks

ArXiv ID: 2509.17874

Authors: Paulius Rauba, Mihaela van der Schaar

Abstract: Large neural networks are typically trained for a fixed computational budget, creating a rigid trade-off between performance and efficiency that is ill-suited for deployment in resource-constrained or dynamic environments. Existing approaches to this problem present a difficult choice: training a discrete collection of specialist models is computationally prohibitive, while dynamic methods like slimmable networks often lack the flexibility to be applied to large, pre-trained foundation models. In this work, we propose Nested Subspace Networks (NSNs), a novel architectural paradigm that enables a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets at inference time. The core of our approach is to re-parameterize linear layers to satisfy a nested subspace property, such that the function computed at a given rank is a strict subspace of the function at any higher rank. We show that this entire hierarchy of models can be optimized jointly via an uncertainty-aware objective that learns to balance the contributions of different ranks based on their intrinsic difficulty. We demonstrate empirically that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier. For example, a single NSN-adapted model can achieve a 50% reduction in inference FLOPs with only a 5 percentage point loss in accuracy. Our findings establish NSNs as a powerful framework for creating the next generation of adaptive foundation models.

Comment: Model Architecture & Efficiency: Nested Subspace Networks reparameterize linear layers for rank-nested functions enabling continuous compute scaling in pre-trained LLMs.

Relevance: 9 Novelty: 8

15. Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking

ArXiv ID: 2509.17738

Authors: Ting Han, Linara Adilova, Henning Petzka, Jens Kleesiek, Michael Kamp

Abstract: Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization. Furthermore, we show theoretically that neural collapse implies relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.

Comment: Representation Learning: disentangles flatness vs. neural collapse via grokking, showing flatness predicts generalization; theoretical link between collapse and flatness.

Relevance: 9 Novelty: 8

16. Accurate and Efficient Low-Rank Model Merging in Core Space

ArXiv ID: 2509.17786

Authors: Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer

Abstract: In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.

Comment: Model Compression/Efficiency: efficient low-rank (LoRA) model merging in a shared core space with formal guarantees and improved accuracy/compute trade-offs.

Relevance: 9 Novelty: 8

17. Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

ArXiv ID: 2509.18085

Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli

Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.

Comment: Efficiency/HPC: lossless speculative decoding for diffusion LLMs with parallel verification and calibrated draft graphs; complements KV caching and multi-token methods.

Relevance: 9 Novelty: 8

18. Language Modeling with Learned Meta-Tokens

ArXiv ID: 2509.16278

Authors: Alok N. Shah, Khush Gupta, Keshav Ramji, Pratik Chaudhari

Abstract: While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention in addition to causal multi-head attention, and study the impact of these tokens on a suite of synthetic tasks. We find that data-efficient language model pre-training on fewer than 100B tokens utilizing meta-tokens and our meta-attention mechanism achieves strong performance on these tasks after fine-tuning. We suggest that these gains arise due to the meta-tokens sharpening the positional encoding. This enables them to operate as trainable, content-based landmarks, implicitly compressing preceding context and "caching" it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization up to 2$\times$ its context window, even after extension with YaRN. We provide further evidence of these behaviors by visualizing model internals to study the residual stream, and assessing the compression quality by information-theoretic analysis on the rate-distortion tradeoff. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into the nature of their behavior towards length generalization.

Comment: Model Architecture: introduces meta-attention and learned meta-tokens that compress/cache context; Efficiency: enables long-context length generalization via implicit context compression.

Relevance: 9 Novelty: 8

19. Flow-Induced Diagonal Gaussian Processes

ArXiv ID: 2509.17153

Authors: Moule Lin, Andrea Patane, Weipeng Jing, Shuhao Guan, Goetz Botterweck

Abstract: We present Flow-Induced Diagonal Gaussian Processes (FiD-GP), a compression framework that incorporates a compact inducing weight matrix to project a neural network's weight uncertainty into a lower-dimensional subspace. Critically, FiD-GP relies on normalising-flow priors and spectral regularisations to augment its expressiveness and align the inducing subspace with feature-gradient geometry through a numerically stable projection mechanism objective. Furthermore, we demonstrate how the prediction framework in FiD-GP can help to design a single-pass projection for Out-of-Distribution (OoD) detection. Our analysis shows that FiD-GP improves uncertainty estimation ability on various tasks compared with SVGP-based baselines, satisfies tight spectral residual bounds with theoretically guaranteed OoD detection, and significantly compresses the neural network's storage requirements at the cost of increased inference computation dependent on the number of inducing weights employed. Specifically, in a comprehensive empirical study spanning regression, image classification, semantic segmentation, and out-of-distribution detection benchmarks, it cuts Bayesian training cost by several orders of magnitude, compresses parameters by roughly 51%, reduces model size by about 75%, and matches state-of-the-art accuracy and uncertainty estimation.

Comment: Matches Model Compression and Efficiency: compresses NN weight uncertainty via low-dimensional inducing weights with normalizing-flow priors and spectral regularization.

Relevance: 9 Novelty: 8

20. Elucidating the Design Space of FP4 training

ArXiv ID: 2509.17791

Authors: Robert Hu, Carlo Luschi, Paul Balanca

Abstract: The increasing computational demands of foundation models have spurred research into low-precision training, with 4-bit floating-point (\texttt{FP4}) formats emerging as a frontier for maximizing hardware throughput. While numerous techniques have been proposed to stabilize \texttt{FP4} training, they often present isolated solutions with varying, and not always clear, computational overheads. This paper aims to provide a unified view of the design space of \texttt{FP4} training. We introduce a comprehensive, quantisation gradient-based framework for microscaling quantization that allows for a theoretical analysis of the computational costs associated with different stabilization methods on both the forward and backward passes. Using a simulator built on this framework, we conduct an extensive empirical study across a wide range of machine learning tasks, including regression, image classification, diffusion models, and language models. By systematically evaluating thousands of combinations of techniques, such as novel gradient approximations, rounding strategies, and scaling methods, we identify which configurations offer the most favourable performance-to-overhead trade-off. We find that the techniques enabling the best trade-off involve carefully combining Hadamard transformations, tensor scaling and stochastic rounding. We further find that using \texttt{UE5M3} as a scaling factor potentially offers a good compromise between range and precision with manageable computational overhead.

Comment: Matches Model Compression and Efficiency: unified FP4 low-precision training framework for microscaling quantization with cost analysis and stabilization techniques.

Relevance: 9 Novelty: 8

21. DISCO: Disentangled Communication Steering for Large Language Models

ArXiv ID: 2509.16820

Authors: Max Torop, Aria Masoomi, Masih Eskandar, Jennifer Dy

Abstract: A variety of recent methods guide large language model outputs via the inference-time addition of steering vectors to residual-stream or attention-head representations. In contrast, we propose to inject steering vectors directly into the query and value representation spaces within attention heads. We provide evidence that a greater portion of these spaces exhibit high linear discriminability of concepts --a key property motivating the use of steering vectors-- than attention head outputs. We analytically characterize the effect of our method, which we term DISentangled COmmunication (DISCO) Steering, on attention head outputs. Our analysis reveals that DISCO disentangles a strong but underutilized baseline, steering attention inputs, which implicitly modifies queries and values in a rigid manner. In contrast, DISCO's direct modulation of these components enables more granular control. We find that DISCO achieves superior performance over a number of steering vector baselines across multiple datasets on LLaMA 3.1 8B and Gemma 2 9B, with steering efficacy scoring up to 19.1% higher than the runner-up. Our results support the conclusion that the query and value spaces are powerful building blocks for steering vector methods.

Comment: Matches Model Architecture: steers models by directly modulating attention queries/values, enabling finer control than residual-stream steering.

Relevance: 9 Novelty: 8

22. Evolution of Concepts in Language Model Pre-Training

ArXiv ID: 2509.17196

Authors: Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu

Abstract: Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Comment: Representation Learning: tracks emergence of linear interpretable features during LLM pre-training via sparse dictionary learning; analyzes training dynamics.

Relevance: 9 Novelty: 7

23. Interpreting vision transformers via residual replacement model

ArXiv ID: 2509.17401

Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang

Abstract: How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.

Comment: Representation Learning/Interpretability: sparse autoencoder features and residual replacement model to explain ViT circuits across layers.

Relevance: 9 Novelty: 7

24. The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs

ArXiv ID: 2509.17030

Authors: Hinata Tezuka, Naoya Inoue

Abstract: Recent studies have suggested a processing framework for multilingual inputs in decoder-based LLMs: early layers convert inputs into English-centric and language-agnostic representations; middle layers perform reasoning within an English-centric latent space; and final layers generate outputs by transforming these representations back into language-specific latent spaces. However, the internal dynamics of such transformation and the underlying mechanism remain underexplored. Towards a deeper understanding of this framework, we propose and empirically validate The Transfer Neurons Hypothesis: certain neurons in the MLP module are responsible for transferring representations between language-specific latent spaces and a shared semantic latent space. Furthermore, we show that one function of language-specific neurons, as identified in recent studies, is to facilitate movement between latent spaces. Finally, we show that transfer neurons are critical for reasoning in multilingual LLMs.

Comment: Representation learning: identifies and validates MLP “transfer neurons” enabling transitions between language-specific and shared latent spaces in multilingual LLMs.

Relevance: 9 Novelty: 7

25. ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching

ArXiv ID: 2509.16857

Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

Abstract: Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to up to 1.35x higher throughput.

Comment: High Performance Computing: SmartNIC-offloaded, interference-free KV cache fetching and memory management for distributed prefix caching in LLM serving.

Relevance: 9 Novelty: 7

26. SEQR: Secure and Efficient QR-based LoRA Routing

ArXiv ID: 2509.18093

Authors: William Fleshman, Benjamin Van Durme

Abstract: Low-Rank Adaptation (LoRA) has become a standard technique for parameter-efficient fine-tuning of large language models, enabling large libraries of LoRAs, each for a specific task or domain. Efficiently selecting the correct LoRA adapter for a given input remains a challenge, particularly in secure environments where supervised training of routers may raise privacy concerns. Motivated by previous approaches, we formalize the goal of unsupervised LoRA routing in terms of activation norm maximization, providing a theoretical framework for analysis. We demonstrate the discriminative power of activation norms and introduce SEQR, an unsupervised LoRA routing algorithm designed to maximize efficiency while providing strict routing guarantees. SEQR provably identifies the norm-maximizing adapter with significantly greater efficiency, making it a highly scalable and effective solution for dynamic LoRA composition. We validate our results through experiments that demonstrate improved multi-task performance and efficiency.

Comment: Model Compression/Efficiency: unsupervised LoRA routing via activation-norm maximization (SEQR) with provable guarantees for efficient adapter selection.

Relevance: 9 Novelty: 7

27. KANO: Kolmogorov-Arnold Neural Operator

ArXiv ID: 2509.16825

Authors: Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang

Abstract: We introduce Kolmogorov--Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over generic position-dependent dynamics for any physical input, whereas FNO stays practical only for spectrally sparse operators and strictly imposes a fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx 1.5\times10^{-2}$, by orders of magnitude.

Comment: Model Architecture: introduces a dual-domain neural operator (KANO) with theoretical expressivity gains over FNO for position-dependent operators.

Relevance: 8 Novelty: 8

28. Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

ArXiv ID: 2509.17251

Authors: Jingfeng Wu, Peter L. Bartlett, Jason D. Lee, Sham M. Kakade, Bin Yu

Abstract: Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems -- those with fast and continuously decaying covariance spectra -- which includes all problems satisfying the standard capacity condition.

Comment: Representation Learning/Training Dynamics: instance-wise finite-sample risk comparisons showing implicit regularization (GD) dominates explicit (ridge) under broad conditions.

Relevance: 8 Novelty: 8

29. Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise

ArXiv ID: 2509.18001

Authors: Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le

Abstract: Sharpness-aware minimization (SAM) has emerged as a highly effective technique for improving model generalization, but its underlying principles are not fully understood. We investigated the phenomenon known as m-sharpness, where the performance of SAM improves monotonically as the micro-batch size for computing perturbations decreases. Leveraging an extended Stochastic Differential Equation (SDE) framework, combined with an analysis of the structure of stochastic gradient noise (SGN), we precisely characterize the dynamics of various SAM variants. Our findings reveal that the stochastic noise introduced during SAM perturbations inherently induces a variance-based sharpness regularization effect. Motivated by our theoretical insights, we introduce Reweighted SAM, which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate the effectiveness of our theoretical analysis and proposed method.

Comment: Matches Representation Learning: theoretical SDE analysis of SGN in SAM explains m-sharpness and motivates a new sharpness-weighted sampling method.

Relevance: 8 Novelty: 8

30. $\boldsymbol{\lambda}$-Orthogonality Regularization for Compatible Representation Learning

ArXiv ID: 2509.16664

Authors: Simone Ricci, Niccol`o Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo

Abstract: Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating communication between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while preserving the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely $\lambda$-orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model's zero-shot performance and ensures compatibility across model updates. Code available at: https://github.com/miccunifi/lambda_orthogonality

Comment: Representation Learning: λ-orthogonality regularization to learn near-orthogonal affine mappings that preserve latent geometry while aligning representations across independently trained models.