Previous Day 2025-09-22
Monthly Overview 2025-09
Next Day 2025-09-24

Personalized Daily ArXiv Papers 2025-09-23

[gpt-5] Prompt Completion Total
Token 76776 79041 155817
Cost $0.1 $0.79 $0.89

Total arXiv papers: 964

Total scanned papers: 548

Total relevant papers: 48

Table of contents with paper titles:

  1. PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

  2. Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few Authors: Qishuai Wen, Zhiyuan Huang, Chun-Guang Li

  3. The Complexity of Finding Local Optima in Contrastive Learning Authors: Jingming Yan, Yiyuan Luo, Vaggos Chatziafratis, Ioannis Panageas, Parnian Shahkar, Stelios Stavroulakis

  4. Qwen3-Omni Technical Report Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin

  5. MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

  6. Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data Authors: Tianyi Chen, Pengxiao Lin, Zhiwei Wang, Zhi-Qin John Xu

  7. Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Xuming Hu

  8. Understanding Post-Training Structural Changes in Large Language Models Authors: Xinyu He, Xianghui Cao

  9. Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features Authors: Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun

  10. Robust LLM Training Infrastructure at ByteDance Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

  11. Bias-variance Tradeoff in Tensor Estimation Authors: Shivam Kumar, Haotian Xu, Carlos Misael Madrid Padilla, Yuehaw Khoo, Oscar Hernan Madrid Padilla, Daren Wang

  12. ViTCAE: ViT-based Class-conditioned Autoencoder Authors: Vahid Jebraeeli, Hamid Krim, Derya Cansever

  13. LightCode: Compiling LLM Inference for Photonic-Electronic Systems Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund

  14. Deep Hierarchical Learning with Nested Subspace Networks Authors: Paulius Rauba, Mihaela van der Schaar

  15. Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking Authors: Ting Han, Linara Adilova, Henning Petzka, Jens Kleesiek, Michael Kamp

  16. Accurate and Efficient Low-Rank Model Merging in Core Space Authors: Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer

  17. Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli

  18. Language Modeling with Learned Meta-Tokens Authors: Alok N. Shah, Khush Gupta, Keshav Ramji, Pratik Chaudhari

  19. Flow-Induced Diagonal Gaussian Processes Authors: Moule Lin, Andrea Patane, Weipeng Jing, Shuhao Guan, Goetz Botterweck

  20. Elucidating the Design Space of FP4 training Authors: Robert Hu, Carlo Luschi, Paul Balanca

  21. DISCO: Disentangled Communication Steering for Large Language Models Authors: Max Torop, Aria Masoomi, Masih Eskandar, Jennifer Dy

  22. Evolution of Concepts in Language Model Pre-Training Authors: Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu

  23. Interpreting vision transformers via residual replacement model Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang

  24. The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs Authors: Hinata Tezuka, Naoya Inoue

  25. ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

  26. SEQR: Secure and Efficient QR-based LoRA Routing Authors: William Fleshman, Benjamin Van Durme

  27. KANO: Kolmogorov-Arnold Neural Operator Authors: Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang

  28. Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization Authors: Jingfeng Wu, Peter L. Bartlett, Jason D. Lee, Sham M. Kakade, Bin Yu

  29. Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise Authors: Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le

  30. $\boldsymbol{\lambda}$-Orthogonality Regularization for Compatible Representation Learning Authors: Simone Ricci, Niccol`o Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo

  31. Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling Authors: Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, HaiZhou Li

  32. Digging Into the Internal: Causality-Based Analysis of LLM Function Calling Authors: Zhenlan Ji, Daoyuan Wu, Wenxuan Wang, Pingchuan Ma, Shuai Wang, Lei Ma

  33. EMPEROR: Efficient Moment-Preserving Representation of Distributions Authors: Xinran Liu, Shansita D. Sharma, Soheil Kolouri

  34. Geometric Mixture Classifier (GMC): A Discriminative Per-Class Mixture of Hyperplanes Authors: Prasanth K K, Shubham Sharma

  35. Low-Rank Adaptation of Evolutionary Deep Neural Networks for Efficient Learning of Time-Dependent PDEs Authors: Jiahao Zhang, Shiheng Zhang, Guang Lin

  36. Adaptive Overclocking: Dynamic Control of Thinking Path Length via Real-Time Reasoning Signals Authors: Shuhao Jiang, Songbo Wang, Yang Qiao, Chun Xu, Chaoyang Zheng, Shengyi Zhou, Huanjun Wang, Fangming Li, Cong Zhang, Jiyu Wang

  37. Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception Authors: Haobo Yang, Shiyan Zhang, Zhuoyi Yang, Jilong Guo, Jun Yang, Xinyu Zhang

  38. PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality Authors: Byeongho Yu, Changhun Lee, Jungyu Jin, Eunhyeok Park

  39. Probabilistic Token Alignment for Large Language Model Fusion Authors: Runjia Zeng, James Chenhao Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huang, Tong Geng, Qifan Wang, Dongfang Liu

  40. Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels Authors: Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan

  41. Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe Authors: Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, Sanjiv Kumar

  42. A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective Authors: Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu

  43. GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2 Authors: Savini Kashmira, Jayanaka Dantanarayana, Thamirawaran Sathiyalogeswaran, Yichao Yuan, Nishil Talati, Krisztian Flautner, Lingjia Tang, Jason Mars

  44. Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models Authors: Jinyeong Kim, Seil Kang, Jiwoo Park, Junhyeok Kim, Seong Jae Hwang

  45. Causal Representation Learning from Multimodal Clinical Records under Non-Random Modality Missingness Authors: Zihan Liang, Ziwen Pan, Ruoxuan Xiong

  46. Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization Authors: Manish Acharya, David Hyde

  47. Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan

  48. Unrolled Graph Neural Networks for Constrained Optimization Authors: Samar Hadou, Alejandro Ribeiro


1. PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

ArXiv ID: 2509.16989

Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

Abstract: Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments.

Comment: Compression/Efficiency: introduces ternary post-training quantization to trit-planes enabling multiplication-free inference with a progressive approximation algorithm.

Relevance: 10 Novelty: 9


2. Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few

ArXiv ID: 2509.16875

Authors: Qishuai Wen, Zhiyuan Huang, Chun-Guang Li

Abstract: Attention mechanisms in Transformers have gained significant empirical success. Nonetheless, the optimization objectives underlying their forward pass are still unclear. Additionally, the quadratic complexity of self-attention is increasingly prohibitive. Unlike the prior work on addressing the interpretability or efficiency issue separately, we propose a unified optimization objective to alleviate both issues simultaneously. By unrolling the optimization over the objective, we derive an inherently interpretable and efficient attention mechanism, which compresses all tokens into low-dimensional structures by contracting a few representative tokens and then broadcasting the contractions back. This Contract-and-Broadcast Self-Attention (CBSA) mechanism can not only scale linearly but also generalize existing attention mechanisms as its special cases. Experiments further demonstrate comparable performance and even superior advantages of CBSA on several visual tasks. Code is available at this https URL.

Comment: Model Architecture: derives an interpretable Contract-and-Broadcast Self-Attention that compresses tokens into low-dimensional structures and generalizes existing attention; Model Compression and Efficiency: achieves linear-time attention by contracting a few representative tokens.

Relevance: 10 Novelty: 9


3. The Complexity of Finding Local Optima in Contrastive Learning

ArXiv ID: 2509.16898

Authors: Jingming Yan, Yiyuan Luo, Vaggos Chatziafratis, Ioannis Panageas, Parnian Shahkar, Stelios Stavroulakis

Abstract: Contrastive learning is a powerful technique for discovering meaningful data representations by optimizing objectives based on $\textit{contrastive information}$, often given as a set of weighted triplets ${(x_i, y_i^+, z_{i}^-)}_{i = 1}^m$ indicating that an "anchor" $x_i$ is more similar to a "positive" example $y_i$ than to a "negative" example $z_i$. The goal is to find representations (e.g., embeddings in $\mathbb{R}^d$ or a tree metric) where anchors are placed closer to positive than to negative examples. While finding $\textit{global}$ optima of contrastive objectives is $\mathsf{NP}$-hard, the complexity of finding $\textit{local}$ optima -- representations that do not improve by local search algorithms such as gradient-based methods -- remains open. Our work settles the complexity of finding local optima in various contrastive learning problems by proving $\mathsf{PLS}$-hardness in discrete settings (e.g., maximize satisfied triplets) and $\mathsf{CLS}$-hardness in continuous settings (e.g., minimize Triplet Loss), where $\mathsf{PLS}$ (Polynomial Local Search) and $\mathsf{CLS}$ (Continuous Local Search) are well-studied complexity classes capturing local search dynamics in discrete and continuous optimization, respectively. Our results imply that no polynomial time algorithm (local search or otherwise) can find a local optimum for various contrastive learning problems, unless $\mathsf{PLS}\subseteq\mathsf{P}$ (or $\mathsf{CLS}\subseteq \mathsf{P}$ for continuous problems). Even in the unlikely scenario that $\mathsf{PLS}\subseteq\mathsf{P}$ (or $\mathsf{CLS}\subseteq \mathsf{P}$), our reductions imply that there exist instances where local search algorithms need exponential time to reach a local optimum, even for $d=1$ (embeddings on a line).

Comment: Representation Learning Theory: proves PLS/CLS-hardness for finding local optima in contrastive objectives (e.g., triplet loss), clarifying limits of local search/gradient methods.

Relevance: 10 Novelty: 9


4. Qwen3-Omni Technical Report

ArXiv ID: 2509.17765

Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin

Abstract: We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Comment: Model Architecture: Thinker–Talker Mixture-of-Experts unifying text/image/audio/video; Efficiency: replaces diffusion with causal ConvNet over discrete codebooks for low-latency streaming.

Relevance: 10 Novelty: 8


5. MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

ArXiv ID: 2509.17238

Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

Abstract: The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

Comment: Model Architecture & Efficiency: inference-time hyper-parallel scaling for MoE (RoE) with stochastic routing and specialized KV caching to boost token-level accuracy.

Relevance: 10 Novelty: 8


6. Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data

ArXiv ID: 2509.17514

Authors: Tianyi Chen, Pengxiao Lin, Zhiwei Wang, Zhi-Qin John Xu

Abstract: State Space Models (SSMs) have emerged as promising alternatives to attention mechanisms, with the Mamba architecture demonstrating impressive performance and linear complexity for processing long sequences. However, the fundamental differences between Mamba and Transformer architectures remain incompletely understood. In this work, we use carefully designed synthetic tasks to reveal Mamba's inherent limitations. Through experiments, we identify that Mamba's nonlinear convolution introduces an asymmetry bias that significantly impairs its ability to recognize symmetrical patterns and relationships. Using composite function and inverse sequence matching tasks, we demonstrate that Mamba strongly favors compositional solutions over symmetrical ones and struggles with tasks requiring the matching of reversed sequences. We show these limitations stem not from the SSM module itself but from the nonlinear convolution preceding it, which fuses token information asymmetrically. These insights provide a new understanding of Mamba's constraints and suggest concrete architectural improvements for future sequence models.

Comment: Matches Model Architecture: foundational analysis of Mamba’s nonlinear convolution revealing asymmetry bias and architectural limitations/suggested fixes.

Relevance: 10 Novelty: 8


7. Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation

ArXiv ID: 2509.16882

Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Xuming Hu

Abstract: Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.

Comment: Model architecture (MoE): dynamic expert routing/specialization with expert isolation and adaptive fine-tuning to mitigate catastrophic forgetting; training dynamics for MoE.

Relevance: 10 Novelty: 7


8. Understanding Post-Training Structural Changes in Large Language Models

ArXiv ID: 2509.17866

Authors: Xinyu He, Xianghui Cao

Abstract: Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two consistent and unexpected structural changes:(1) a near-uniform geometric scaling of singular values across layers, which theoretically modulates attention scores; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Disrupting this orthogonal consistency leads to catastrophic performance degradation. Based on these findings, we propose a simple yet effective framework that interprets post-training as a reparameterization of fixed subspaces in the pretrained parameter space. Further experiments reveal that singular value scaling behaves as a secondary effect, analogous to a temperature adjustment, whereas the core functional transformation lies in the coordinated rotation of singular vectors. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.

Comment: Representation Learning: training-dynamics analysis via SVD reveals structured singular value scaling and coordinated orthogonal rotations after post-training, offering foundational insight into parameter evolution.

Relevance: 9 Novelty: 8


9. Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features

ArXiv ID: 2509.16629

Authors: Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun

Abstract: Positional encoding is essential for supplementing transformer with positional information of tokens. Existing positional encoding methods demand predefined token/feature order, rendering them unsuitable for real-world data with non-sequential yet causally-related features. To address this limitation, we propose CAPE, a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG) using generalized structural equation modeling. The DAG is then embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach that effectively captures two important causal graph properties (causal strength & causal specificity). This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer's self-attention mechanism. Theoretical analysis reveals that CAPE-generated rotary positional encodings possess three valuable properties for enhanced self-attention, including causal distance-induced attenuation, causal generality-induced attenuation, and robustness to positional disturbances. We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. Our code is available at https://github.com/Catchxu/CAPE.

Comment: Model Architecture + Representation Learning: causality-induced positional encodings via DAG-to-hyperbolic embeddings integrated as rotary PEs for Transformers.

Relevance: 9 Novelty: 8


10. Robust LLM Training Infrastructure at ByteDance

ArXiv ID: 2509.16293

Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

Abstract: The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.

Comment: High Performance Computing: systems-level robust training infrastructure enabling large-scale, failure-tolerant LLM training across 200k+ GPUs.

Relevance: 9 Novelty: 8


11. Bias-variance Tradeoff in Tensor Estimation

ArXiv ID: 2509.17382

Authors: Shivam Kumar, Haotian Xu, Carlos Misael Madrid Padilla, Yuehaw Khoo, Oscar Hernan Madrid Padilla, Daren Wang

Abstract: We study denoising of a third-order tensor when the ground-truth tensor is not necessarily Tucker low-rank. Specifically, we observe $$ Y=X^\ast+Z\in \mathbb{R}^{p_{1} \times p_{2} \times p_{3}}, $$ where $X^\ast$ is the ground-truth tensor, and $Z$ is the noise tensor. We propose a simple variant of the higher-order tensor SVD estimator $\widetilde{X}$. We show that uniformly over all user-specified Tucker ranks $(r_{1},r_{2},r_{3})$, $$ | \widetilde{X} - X^* |{ \mathrm{F}}^2 = O \Big( \kappa^2 \Big{ r)$ decreases and the variance increases. As a byproduct we also obtain a convenient bias-variance decomposition for the vanilla low-rank SVD matrix estimators.}r_{2}r_{3}+\sum_{k=1}^{3} p_{k} r_{k} \Big} \; + \; \xi_{(r_{1},r_{2},r_{3})}^2\Big) \quad \text{ with high probability.} $$ Here, the bias term $\xi_{(r_1,r_2,r_3)}$ corresponds to the best achievable approximation error of $X^\ast$ over the class of tensors with Tucker ranks $(r_1,r_2,r_3)$; $\kappa^2$ quantifies the noise level; and the variance term $\kappa^2 {r_{1}r_{2}r_{3}+\sum_{k=1}^{3} p_{k} r_{k}}$ scales with the effective number of free parameters in the estimator $\widetilde{X}$. Our analysis achieves a clean rank-adaptive bias--variance tradeoff: as we increase the ranks of estimator $\widetilde{X}$, the bias $\xi(r_{1},r_{2},r_{3

Comment: Low-rank/tensor estimation theory: rank-adaptive bias–variance tradeoff for HOSVD-like estimator; foundational in low-rank representation learning.

Relevance: 9 Novelty: 8


12. ViTCAE: ViT-based Class-conditioned Autoencoder

ArXiv ID: 2509.16554

Authors: Vahid Jebraeeli, Hamid Krim, Derya Cansever

Abstract: Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms, limiting both generative control and optimization efficiency. This paper introduces ViTCAE, a framework that addresses these issues by re-purposing the Class token into a generative linchpin. In our architecture, the encoder maps the Class token to a global latent variable that dictates the prior distribution for local, patch-level latent variables, establishing a robust dependency where global semantics directly inform the synthesis of local details. Drawing inspiration from opinion dynamics, we treat each attention head as a dynamical system of interacting tokens seeking consensus. This perspective motivates a convergence-aware temperature scheduler that adaptively anneals each head's influence function based on its distributional stability. This process enables a principled head-freezing mechanism, guided by theoretically-grounded diagnostics like an attention evolution distance and a consensus/cluster functional. This technique prunes converged heads during training to significantly improve computational efficiency without sacrificing fidelity. By unifying a generative Class token with an adaptive attention mechanism rooted in multi-agent consensus theory, ViTCAE offers a more efficient and controllable approach to transformer-based generation.

Comment: Model Architecture: ViT-based class-conditioned autoencoder where the class token controls a global latent prior; Model Compression and Efficiency: convergence-aware attention temperature scheduler that freezes/prunes converged heads to cut compute during training.

Relevance: 9 Novelty: 8


13. LightCode: Compiling LLM Inference for Photonic-Electronic Systems

ArXiv ID: 2509.16443

Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund

Abstract: The growing demand for low-latency, energy-efficient inference in large language models (LLMs) has catalyzed interest in heterogeneous architectures. While GPUs remain dominant, they are poorly suited for integration with emerging domain-specific accelerators like the Photonic Tensor Units (PTUs), which offer low-power, high-throughput linear computation. This motivates hybrid compilation strategies that combine photonic and electronic resources. We present LightCode, a compiler framework and simulator for mapping LLM inference workloads across hybrid photonic-electronic systems. LightCode introduces the Stacked Graph, an intermediate representation that encodes multiple hardware-specific realizations of each tensor operation. Hardware assignment is formulated as a constrained subgraph selection problem optimized for latency or energy under parametric cost models. We evaluate LightCode on the prefill stage of GPT-2 and Llama-7B showing that under our workload and hardware assumptions, (i) Photonic hardware reduced energy by up to 50% in our simulated workloads at maximum sequence length; (ii) multiplexing and assignment strategy yielded latency improvements exceeding 10x; and (iii) Optimizing for latency or energy resulted in distinct hardware mappings in our simulations. LightCode offers a module, foundational framework and simulator for compiling LLMs to emerging photonic accelerators.

Comment: High Performance Computing/Systems: compiler IR (Stacked Graph) and hardware assignment for hybrid photonic–electronic LLM inference optimizing latency/energy.

Relevance: 9 Novelty: 8


14. Deep Hierarchical Learning with Nested Subspace Networks

ArXiv ID: 2509.17874

Authors: Paulius Rauba, Mihaela van der Schaar

Abstract: Large neural networks are typically trained for a fixed computational budget, creating a rigid trade-off between performance and efficiency that is ill-suited for deployment in resource-constrained or dynamic environments. Existing approaches to this problem present a difficult choice: training a discrete collection of specialist models is computationally prohibitive, while dynamic methods like slimmable networks often lack the flexibility to be applied to large, pre-trained foundation models. In this work, we propose Nested Subspace Networks (NSNs), a novel architectural paradigm that enables a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets at inference time. The core of our approach is to re-parameterize linear layers to satisfy a nested subspace property, such that the function computed at a given rank is a strict subspace of the function at any higher rank. We show that this entire hierarchy of models can be optimized jointly via an uncertainty-aware objective that learns to balance the contributions of different ranks based on their intrinsic difficulty. We demonstrate empirically that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier. For example, a single NSN-adapted model can achieve a 50% reduction in inference FLOPs with only a 5 percentage point loss in accuracy. Our findings establish NSNs as a powerful framework for creating the next generation of adaptive foundation models.

Comment: Model Architecture & Efficiency: Nested Subspace Networks reparameterize linear layers for rank-nested functions enabling continuous compute scaling in pre-trained LLMs.

Relevance: 9 Novelty: 8


15. Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking

ArXiv ID: 2509.17738

Authors: Ting Han, Linara Adilova, Henning Petzka, Jens Kleesiek, Michael Kamp

Abstract: Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization. Furthermore, we show theoretically that neural collapse implies relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.

Comment: Representation Learning: disentangles flatness vs. neural collapse via grokking, showing flatness predicts generalization; theoretical link between collapse and flatness.

Relevance: 9 Novelty: 8


16. Accurate and Efficient Low-Rank Model Merging in Core Space

ArXiv ID: 2509.17786

Authors: Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer

Abstract: In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.

Comment: Model Compression/Efficiency: efficient low-rank (LoRA) model merging in a shared core space with formal guarantees and improved accuracy/compute trade-offs.

Relevance: 9 Novelty: 8


17. Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

ArXiv ID: 2509.18085

Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli

Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.

Comment: Efficiency/HPC: lossless speculative decoding for diffusion LLMs with parallel verification and calibrated draft graphs; complements KV caching and multi-token methods.

Relevance: 9 Novelty: 8


18. Language Modeling with Learned Meta-Tokens

ArXiv ID: 2509.16278

Authors: Alok N. Shah, Khush Gupta, Keshav Ramji, Pratik Chaudhari

Abstract: While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention in addition to causal multi-head attention, and study the impact of these tokens on a suite of synthetic tasks. We find that data-efficient language model pre-training on fewer than 100B tokens utilizing meta-tokens and our meta-attention mechanism achieves strong performance on these tasks after fine-tuning. We suggest that these gains arise due to the meta-tokens sharpening the positional encoding. This enables them to operate as trainable, content-based landmarks, implicitly compressing preceding context and "caching" it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization up to 2$\times$ its context window, even after extension with YaRN. We provide further evidence of these behaviors by visualizing model internals to study the residual stream, and assessing the compression quality by information-theoretic analysis on the rate-distortion tradeoff. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into the nature of their behavior towards length generalization.

Comment: Model Architecture: introduces meta-attention and learned meta-tokens that compress/cache context; Efficiency: enables long-context length generalization via implicit context compression.

Relevance: 9 Novelty: 8


19. Flow-Induced Diagonal Gaussian Processes

ArXiv ID: 2509.17153

Authors: Moule Lin, Andrea Patane, Weipeng Jing, Shuhao Guan, Goetz Botterweck

Abstract: We present Flow-Induced Diagonal Gaussian Processes (FiD-GP), a compression framework that incorporates a compact inducing weight matrix to project a neural network's weight uncertainty into a lower-dimensional subspace. Critically, FiD-GP relies on normalising-flow priors and spectral regularisations to augment its expressiveness and align the inducing subspace with feature-gradient geometry through a numerically stable projection mechanism objective. Furthermore, we demonstrate how the prediction framework in FiD-GP can help to design a single-pass projection for Out-of-Distribution (OoD) detection. Our analysis shows that FiD-GP improves uncertainty estimation ability on various tasks compared with SVGP-based baselines, satisfies tight spectral residual bounds with theoretically guaranteed OoD detection, and significantly compresses the neural network's storage requirements at the cost of increased inference computation dependent on the number of inducing weights employed. Specifically, in a comprehensive empirical study spanning regression, image classification, semantic segmentation, and out-of-distribution detection benchmarks, it cuts Bayesian training cost by several orders of magnitude, compresses parameters by roughly 51%, reduces model size by about 75%, and matches state-of-the-art accuracy and uncertainty estimation.

Comment: Matches Model Compression and Efficiency: compresses NN weight uncertainty via low-dimensional inducing weights with normalizing-flow priors and spectral regularization.

Relevance: 9 Novelty: 8


20. Elucidating the Design Space of FP4 training

ArXiv ID: 2509.17791

Authors: Robert Hu, Carlo Luschi, Paul Balanca

Abstract: The increasing computational demands of foundation models have spurred research into low-precision training, with 4-bit floating-point (\texttt{FP4}) formats emerging as a frontier for maximizing hardware throughput. While numerous techniques have been proposed to stabilize \texttt{FP4} training, they often present isolated solutions with varying, and not always clear, computational overheads. This paper aims to provide a unified view of the design space of \texttt{FP4} training. We introduce a comprehensive, quantisation gradient-based framework for microscaling quantization that allows for a theoretical analysis of the computational costs associated with different stabilization methods on both the forward and backward passes. Using a simulator built on this framework, we conduct an extensive empirical study across a wide range of machine learning tasks, including regression, image classification, diffusion models, and language models. By systematically evaluating thousands of combinations of techniques, such as novel gradient approximations, rounding strategies, and scaling methods, we identify which configurations offer the most favourable performance-to-overhead trade-off. We find that the techniques enabling the best trade-off involve carefully combining Hadamard transformations, tensor scaling and stochastic rounding. We further find that using \texttt{UE5M3} as a scaling factor potentially offers a good compromise between range and precision with manageable computational overhead.

Comment: Matches Model Compression and Efficiency: unified FP4 low-precision training framework for microscaling quantization with cost analysis and stabilization techniques.

Relevance: 9 Novelty: 8


21. DISCO: Disentangled Communication Steering for Large Language Models

ArXiv ID: 2509.16820

Authors: Max Torop, Aria Masoomi, Masih Eskandar, Jennifer Dy

Abstract: A variety of recent methods guide large language model outputs via the inference-time addition of steering vectors to residual-stream or attention-head representations. In contrast, we propose to inject steering vectors directly into the query and value representation spaces within attention heads. We provide evidence that a greater portion of these spaces exhibit high linear discriminability of concepts --a key property motivating the use of steering vectors-- than attention head outputs. We analytically characterize the effect of our method, which we term DISentangled COmmunication (DISCO) Steering, on attention head outputs. Our analysis reveals that DISCO disentangles a strong but underutilized baseline, steering attention inputs, which implicitly modifies queries and values in a rigid manner. In contrast, DISCO's direct modulation of these components enables more granular control. We find that DISCO achieves superior performance over a number of steering vector baselines across multiple datasets on LLaMA 3.1 8B and Gemma 2 9B, with steering efficacy scoring up to 19.1% higher than the runner-up. Our results support the conclusion that the query and value spaces are powerful building blocks for steering vector methods.

Comment: Matches Model Architecture: steers models by directly modulating attention queries/values, enabling finer control than residual-stream steering.

Relevance: 9 Novelty: 8


22. Evolution of Concepts in Language Model Pre-Training

ArXiv ID: 2509.17196

Authors: Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu

Abstract: Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Comment: Representation Learning: tracks emergence of linear interpretable features during LLM pre-training via sparse dictionary learning; analyzes training dynamics.

Relevance: 9 Novelty: 7


23. Interpreting vision transformers via residual replacement model

ArXiv ID: 2509.17401

Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang

Abstract: How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.

Comment: Representation Learning/Interpretability: sparse autoencoder features and residual replacement model to explain ViT circuits across layers.

Relevance: 9 Novelty: 7


24. The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs

ArXiv ID: 2509.17030

Authors: Hinata Tezuka, Naoya Inoue

Abstract: Recent studies have suggested a processing framework for multilingual inputs in decoder-based LLMs: early layers convert inputs into English-centric and language-agnostic representations; middle layers perform reasoning within an English-centric latent space; and final layers generate outputs by transforming these representations back into language-specific latent spaces. However, the internal dynamics of such transformation and the underlying mechanism remain underexplored. Towards a deeper understanding of this framework, we propose and empirically validate The Transfer Neurons Hypothesis: certain neurons in the MLP module are responsible for transferring representations between language-specific latent spaces and a shared semantic latent space. Furthermore, we show that one function of language-specific neurons, as identified in recent studies, is to facilitate movement between latent spaces. Finally, we show that transfer neurons are critical for reasoning in multilingual LLMs.

Comment: Representation learning: identifies and validates MLP “transfer neurons” enabling transitions between language-specific and shared latent spaces in multilingual LLMs.

Relevance: 9 Novelty: 7


25. ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching

ArXiv ID: 2509.16857

Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

Abstract: Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to up to 1.35x higher throughput.

Comment: High Performance Computing: SmartNIC-offloaded, interference-free KV cache fetching and memory management for distributed prefix caching in LLM serving.

Relevance: 9 Novelty: 7


26. SEQR: Secure and Efficient QR-based LoRA Routing

ArXiv ID: 2509.18093

Authors: William Fleshman, Benjamin Van Durme

Abstract: Low-Rank Adaptation (LoRA) has become a standard technique for parameter-efficient fine-tuning of large language models, enabling large libraries of LoRAs, each for a specific task or domain. Efficiently selecting the correct LoRA adapter for a given input remains a challenge, particularly in secure environments where supervised training of routers may raise privacy concerns. Motivated by previous approaches, we formalize the goal of unsupervised LoRA routing in terms of activation norm maximization, providing a theoretical framework for analysis. We demonstrate the discriminative power of activation norms and introduce SEQR, an unsupervised LoRA routing algorithm designed to maximize efficiency while providing strict routing guarantees. SEQR provably identifies the norm-maximizing adapter with significantly greater efficiency, making it a highly scalable and effective solution for dynamic LoRA composition. We validate our results through experiments that demonstrate improved multi-task performance and efficiency.

Comment: Model Compression/Efficiency: unsupervised LoRA routing via activation-norm maximization (SEQR) with provable guarantees for efficient adapter selection.

Relevance: 9 Novelty: 7


27. KANO: Kolmogorov-Arnold Neural Operator

ArXiv ID: 2509.16825

Authors: Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong, Murphy Yuezhen Niu, Zheng Zhang

Abstract: We introduce Kolmogorov--Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over generic position-dependent dynamics for any physical input, whereas FNO stays practical only for spectrally sparse operators and strictly imposes a fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx 1.5\times10^{-2}$, by orders of magnitude.

Comment: Model Architecture: introduces a dual-domain neural operator (KANO) with theoretical expressivity gains over FNO for position-dependent operators.

Relevance: 8 Novelty: 8


28. Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

ArXiv ID: 2509.17251

Authors: Jingfeng Wu, Peter L. Bartlett, Jason D. Lee, Sham M. Kakade, Bin Yu

Abstract: Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems -- those with fast and continuously decaying covariance spectra -- which includes all problems satisfying the standard capacity condition.

Comment: Representation Learning/Training Dynamics: instance-wise finite-sample risk comparisons showing implicit regularization (GD) dominates explicit (ridge) under broad conditions.

Relevance: 8 Novelty: 8


29. Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise

ArXiv ID: 2509.18001

Authors: Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le

Abstract: Sharpness-aware minimization (SAM) has emerged as a highly effective technique for improving model generalization, but its underlying principles are not fully understood. We investigated the phenomenon known as m-sharpness, where the performance of SAM improves monotonically as the micro-batch size for computing perturbations decreases. Leveraging an extended Stochastic Differential Equation (SDE) framework, combined with an analysis of the structure of stochastic gradient noise (SGN), we precisely characterize the dynamics of various SAM variants. Our findings reveal that the stochastic noise introduced during SAM perturbations inherently induces a variance-based sharpness regularization effect. Motivated by our theoretical insights, we introduce Reweighted SAM, which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate the effectiveness of our theoretical analysis and proposed method.

Comment: Matches Representation Learning: theoretical SDE analysis of SGN in SAM explains m-sharpness and motivates a new sharpness-weighted sampling method.

Relevance: 8 Novelty: 8


30. $\boldsymbol{\lambda}$-Orthogonality Regularization for Compatible Representation Learning

ArXiv ID: 2509.16664

Authors: Simone Ricci, Niccol`o Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo

Abstract: Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating communication between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while preserving the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely $\lambda$-orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model's zero-shot performance and ensures compatibility across model updates. Code available at: https://github.com/miccunifi/lambda_orthogonality

Comment: Representation Learning: λ-orthogonality regularization to learn near-orthogonal affine mappings that preserve latent geometry while aligning representations across independently trained models.

Relevance: 8 Novelty: 7


31. Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling

ArXiv ID: 2509.17186

Authors: Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, HaiZhou Li

Abstract: The explosive growth in sequence length has intensified the demand for effective and efficient long sequence modeling. Benefiting from intrinsic oscillatory membrane dynamics, Resonate-and-Fire (RF) neurons can efficiently extract frequency components from input signals and encode them into spatiotemporal spike trains, making them well-suited for long sequence modeling. However, RF neurons exhibit limited effective memory capacity and a trade-off between energy efficiency and training speed on complex temporal tasks. Inspired by the dendritic structure of biological neurons, we propose a Dendritic Resonate-and-Fire (D-RF) model, which explicitly incorporates a multi-dendritic and soma architecture. Each dendritic branch encodes specific frequency bands by utilizing the intrinsic oscillatory dynamics of RF neurons, thereby collectively achieving comprehensive frequency representation. Furthermore, we introduce an adaptive threshold mechanism into the soma structure that adjusts the threshold based on historical spiking activity, reducing redundant spikes while maintaining training efficiency in long sequence tasks. Extensive experiments demonstrate that our method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training. These results underscore its potential as an effective and efficient solution for long sequence modeling on edge platforms.

Comment: Model Architecture + Efficiency/Sparsity: introduces a multi-dendritic resonate-and-fire neuron with adaptive thresholds for long-sequence modeling, ensuring sparse spikes without sacrificing training efficiency.

Relevance: 8 Novelty: 7


32. Digging Into the Internal: Causality-Based Analysis of LLM Function Calling

ArXiv ID: 2509.16268

Authors: Zhenlan Ji, Daoyuan Wu, Wenxuan Wang, Pingchuan Ma, Shuai Wang, Lei Ma

Abstract: Function calling (FC) has emerged as a powerful technique for facilitating large language models (LLMs) to interact with external systems and perform structured tasks. However, the mechanisms through which it influences model behavior remain largely under-explored. Besides, we discover that in addition to the regular usage of FC, this technique can substantially enhance the compliance of LLMs with user instructions. These observations motivate us to leverage causality, a canonical analysis method, to investigate how FC works within LLMs. In particular, we conduct layer-level and token-level causal interventions to dissect FC's impact on the model's internal computational logic when responding to user queries. Our analysis confirms the substantial influence of FC and reveals several in-depth insights into its mechanisms. To further validate our findings, we conduct extensive experiments comparing the effectiveness of FC-based instructions against conventional prompting methods. We focus on enhancing LLM safety robustness, a critical LLM application scenario, and evaluate four mainstream LLMs across two benchmark datasets. The results are striking: FC shows an average performance improvement of around 135% over conventional prompting methods in detecting malicious inputs, demonstrating its promising potential to enhance LLM reliability and capability in practical applications.

Comment: Representation/Training Dynamics: layer- and token-level causal interventions to analyze how function calling alters internal computation and compliance.

Relevance: 8 Novelty: 7


33. EMPEROR: Efficient Moment-Preserving Representation of Distributions

ArXiv ID: 2509.16379

Authors: Xinran Liu, Shansita D. Sharma, Soheil Kolouri

Abstract: We introduce EMPEROR (Efficient Moment-Preserving Representation of Distributions), a mathematically rigorous and computationally efficient framework for representing high-dimensional probability measures arising in neural network representations. Unlike heuristic global pooling operations, EMPEROR encodes a feature distribution through its statistical moments. Our approach leverages the theory of sliced moments: features are projected onto multiple directions, lightweight univariate Gaussian mixture models (GMMs) are fit to each projection, and the resulting slice parameters are aggregated into a compact descriptor. We establish determinacy guarantees via Carleman's condition and the Cram\'er-Wold theorem, ensuring that the GMM is uniquely determined by its sliced moments, and we derive finite-sample error bounds that scale optimally with the number of slices and samples. Empirically, EMPEROR captures richer distributional information than common pooling schemes across various data modalities, while remaining computationally efficient and broadly applicable.

Comment: Representation Learning: moment-preserving distributional descriptors (sliced GMM moments) with identifiability and finite-sample guarantees; efficient alternative to global pooling.

Relevance: 8 Novelty: 7


34. Geometric Mixture Classifier (GMC): A Discriminative Per-Class Mixture of Hyperplanes

ArXiv ID: 2509.16769

Authors: Prasanth K K, Shubham Sharma

Abstract: Many real world categories are multimodal, with single classes occupying disjoint regions in feature space. Classical linear models (logistic regression, linear SVM) use a single global hyperplane and perform poorly on such data, while high-capacity methods (kernel SVMs, deep nets) fit multimodal structure but at the expense of interpretability, heavier tuning, and higher computational cost. We propose the Geometric Mixture Classifier (GMC), a discriminative model that represents each class as a mixture of hyperplanes. Within each class, GMC combines plane scores via a temperature-controlled soft-OR (log-sum-exp), smoothly approximating the max; across classes, standard softmax yields probabilistic posteriors. GMC optionally uses Random Fourier Features (RFF) for nonlinear mappings while keeping inference linear in the number of planes and features. Our practical training recipe: geometry-aware k-means initialization, silhouette-based plane budgeting, alpha annealing, usage-aware L2 regularization, label smoothing, and early stopping, makes GMC plug-and-play. Across synthetic multimodal datasets (moons, circles, blobs, spirals) and tabular/image benchmarks (iris, wine, WDBC, digits), GMC consistently outperforms linear baselines and k-NN, is competitive with RBF-SVM, Random Forests, and small MLPs, and provides geometric introspection via per-plane and class responsibility visualizations. Inference scales linearly in planes and features, making GMC CPU-friendly, with single-digit microsecond latency per example, often faster than RBF-SVM and compact MLPs. Post-hoc temperature scaling reduces ECE from about 0.06 to 0.02. GMC thus strikes a favorable balance of accuracy, interpretability, and efficiency: it is more expressive than linear models and lighter, more transparent, and faster than kernel or deep models.

Comment: Model Architecture: introduces a discriminative per-class mixture-of-hyperplanes with temperature-controlled soft-OR (log-sum-exp), yielding an interpretable conditional mixture model with linear-time inference.

Relevance: 8 Novelty: 7


35. Low-Rank Adaptation of Evolutionary Deep Neural Networks for Efficient Learning of Time-Dependent PDEs

ArXiv ID: 2509.16395

Authors: Jiahao Zhang, Shiheng Zhang, Guang Lin

Abstract: We study the Evolutionary Deep Neural Network (EDNN) framework for accelerating numerical solvers of time-dependent partial differential equations (PDEs). We introduce a Low-Rank Evolutionary Deep Neural Network (LR-EDNN), which constrains parameter evolution to a low-rank subspace, thereby reducing the effective dimensionality of training while preserving solution accuracy. The low-rank tangent subspace is defined layer-wise by the singular value decomposition (SVD) of the current network weights, and the resulting update is obtained by solving a well-posed, tractable linear system within this subspace. This design augments the underlying numerical solver with a parameter efficient EDNN component without requiring full fine-tuning of all network weights. We evaluate LR-EDNN on representative PDE problems and compare it against corresponding baselines. Across cases, LR-EDNN achieves comparable accuracy with substantially fewer trainable parameters and reduced computational cost. These results indicate that low-rank constraints on parameter velocities, rather than full-space updates, provide a practical path toward scalable, efficient, and reproducible scientific machine learning for PDEs.

Comment: Model Compression and Efficiency: constrains updates to a low-rank subspace (layer-wise SVD), cutting trainable parameters and compute while preserving accuracy.

Relevance: 8 Novelty: 7


36. Adaptive Overclocking: Dynamic Control of Thinking Path Length via Real-Time Reasoning Signals

ArXiv ID: 2509.17000

Authors: Shuhao Jiang, Songbo Wang, Yang Qiao, Chun Xu, Chaoyang Zheng, Shengyi Zhou, Huanjun Wang, Fangming Li, Cong Zhang, Jiyu Wang

Abstract: Large Reasoning Models (LRMs) often suffer from computational inefficiency due to overthinking, where a fixed reasoning budget fails to match the varying complexity of tasks. To address this issue, we propose Adaptive Overclocking, a method that makes the overclocking hyperparameter $\alpha$ dynamic and context-aware. Our method adjusts reasoning speed in real time through two complementary signals: (1) token-level model uncertainty for fine-grained step-wise control, and (2) input complexity estimation for informed initialization. We implement this approach with three strategies: Uncertainty-Aware Alpha Scheduling (UA-$\alpha$S), Complexity-Guided Alpha Initialization (CG-$\alpha$I), and a Hybrid Adaptive Control (HAC) that combines both. Experiments on GSM8K, MATH, and SVAMP show that HAC achieves superior accuracy-latency trade-offs, reducing unnecessary computation on simple problems while allocating more resources to challenging ones. By mitigating overthinking, Adaptive Overclocking enhances both efficiency and overall reasoning performance.

Comment: Conditional/Dynamic Networks: real-time, uncertainty- and complexity-driven adaptive compute to control reasoning path length for efficiency.

Relevance: 8 Novelty: 7


37. Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception

ArXiv ID: 2509.16277

Authors: Haobo Yang, Shiyan Zhang, Zhuoyi Yang, Jilong Guo, Jun Yang, Xinyu Zhang

Abstract: Deep perception networks in autonomous driving traditionally rely on data-intensive training regimes and post-hoc anomaly detection, often disregarding fundamental information-theoretic constraints governing stable information processing. We reconceptualize deep neural encoders as hierarchical communication chains that incrementally compress raw sensory inputs into task-relevant latent features. Within this framework, we establish two theoretically justified design principles for robust perception: (D1) smooth variation of mutual information between consecutive layers, and (D2) monotonic decay of latent entropy with network depth. Our analysis shows that, under realistic architectural assumptions, particularly blocks comprising repeated layers of similar capacity, enforcing smooth information flow (D1) naturally encourages entropy decay (D2), thus ensuring stable compression. Guided by these insights, we propose Eloss, a novel entropy-based regularizer designed as a lightweight, plug-and-play training objective. Rather than marginal accuracy improvements, this approach represents a conceptual shift: it unifies information-theoretic stability with standard perception tasks, enabling explicit, principled detection of anomalous sensor inputs through entropy deviations. Experimental validation on large-scale 3D object detection benchmarks (KITTI and nuScenes) demonstrates that incorporating Eloss consistently achieves competitive or improved accuracy while dramatically enhancing sensitivity to anomalies, amplifying distribution-shift signals by up to two orders of magnitude. This stable information-compression perspective not only improves interpretability but also establishes a solid theoretical foundation for safer, more robust autonomous driving perception systems.

Comment: Representation Learning/Training Dynamics: entropy-based regularizer enforcing smooth mutual information flow and monotonic entropy decay across layers.

Relevance: 8 Novelty: 7


38. PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality

ArXiv ID: 2509.16598

Authors: Byeongho Yu, Changhun Lee, Jungyu Jin, Eunhyeok Park

Abstract: To mitigate the hallucination problem in large language models, DoLa exploits early exit logits from the same model as a contrastive prior. However, we found that these early exit logits tend to be flat, low in magnitude, and fail to reflect meaningful contrasts. To address this, we propose PruneCD, a novel contrastive decoding method that constructs the amateur model via layer pruning rather than early exit. This design leads to more informative and well-aligned logits, enabling more effective contrastive decoding. Through qualitative and quantitative analyses, we demonstrate that PruneCD consistently improves factuality with minimal inference overhead, offering a robust and practical approach to mitigating hallucinations in LLMs.

Comment: Model Compression/Efficiency: uses layer pruning to construct an auxiliary model for contrastive decoding, improving factuality with minimal overhead.

Relevance: 8 Novelty: 7


39. Probabilistic Token Alignment for Large Language Model Fusion

ArXiv ID: 2509.17276

Authors: Runjia Zeng, James Chenhao Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huang, Tong Geng, Qifan Wang, Dongfang Liu

Abstract: Training large language models (LLMs) from scratch can yield models with unique functionalities and strengths, but it is costly and often leads to redundant capabilities. A more cost-effective alternative is to fuse existing pre-trained LLMs with different architectures into a more powerful model. However, a key challenge in existing model fusion is their dependence on manually predefined vocabulary alignment, which may not generalize well across diverse contexts, leading to performance degradation in several evaluation. To solve this, we draw inspiration from distribution learning and propose the probabilistic token alignment method as a general and soft mapping for alignment, named as PTA-LLM. Our approach innovatively reformulates token alignment into a classic mathematical problem: optimal transport, seamlessly leveraging distribution-aware learning to facilitate more coherent model fusion. Apart from its inherent generality, PTA-LLM exhibits interpretability from a distributional perspective, offering insights into the essence of the token alignment. Empirical results demonstrate that probabilistic token alignment enhances the target model's performance across multiple capabilities. Our code is avaliable at https://runjia.tech/neurips_pta-llm/.

Comment: Model Architecture/Systems: probabilistic token alignment via optimal transport for fusing heterogeneous LLMs; Efficiency: reduces manual alignment and retraining costs.

Relevance: 8 Novelty: 7


40. Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels

ArXiv ID: 2509.16596

Authors: Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan

Abstract: Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model's knowledge remains underexplored, limiting our ability to control knowledge change behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families. Surprisingly, models fine-tuned on 1,920 samples perform up to 14% worse than those fine-tuned on only 240 samples. Furthermore, varying the level of knowledge mastery in the fine-tuning data leads to performance fluctuations of over 12%. To investigate these effects, we analyze model behavior at both the token and parameter levels. Our analysis reveals that up to 90% of parameter updates during SFT do not contribute to knowledge enhancement. Restoring these updates can improve performance on the CBQA task, depending on the characteristics of the fine-tuning data. These insights offer practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.

Comment: Representation Learning/Training Dynamics: token- and parameter-level analysis of SFT’s impact on model knowledge; identifies redundant parameter updates.

Relevance: 8 Novelty: 7


41. Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe

ArXiv ID: 2509.16411

Authors: Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, Sanjiv Kumar

Abstract: Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval due to their simplicity and scalability. However, the Euclidean geometry of the embedding space limits the expressive power of DEs, which may compromise their quality. This paper investigates such limitations in the context of hierarchical retrieval (HR), where the document set has a hierarchical structure and the matching documents for a query are all of its ancestors. We first prove that DEs are feasible for HR as long as the embedding dimension is linear in the depth of the hierarchy and logarithmic in the number of documents. Then we study the problem of learning such embeddings in a standard retrieval setup where DEs are trained on samples of matching query and document pairs. Our experiments reveal a lost-in-the-long-distance phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-finetune recipe that significantly improves long-distance retrieval without sacrificing performance on closer documents. We experiment on a realistic hierarchy from WordNet for retrieving documents at various levels of abstraction, and show that pretrain-finetune boosts the recall on long-distance pairs from 19% to 76%. Finally, we demonstrate that our method improves retrieval of relevant products on a shopping queries dataset.

Comment: Matches Representation Learning: analyzes embedding geometry limits of dual encoders in hierarchical retrieval and proposes a pretrain-finetune training recipe.

Relevance: 8 Novelty: 7


42. A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

ArXiv ID: 2509.16499

Authors: Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu

Abstract: The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse -- a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

Comment: Matches Representation Learning: analyzes training dynamics (generalization→memorization) and proposes entropy-based data selection to mitigate collapse.

Relevance: 8 Novelty: 7


43. GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2

ArXiv ID: 2509.16248

Authors: Savini Kashmira, Jayanaka Dantanarayana, Thamirawaran Sathiyalogeswaran, Yichao Yuan, Nishil Talati, Krisztian Flautner, Lingjia Tang, Jason Mars

Abstract: This paper presents GraphMend, a high-level compiler that eliminates FX graph breaks in PyTorch 2 programs. Although PyTorch 2 introduced TorchDynamo and TorchInductor to enable just-in-time graph compilation, unresolved dynamic control flow and unsupported Python constructs often fragment models into multiple FX graphs. These fragments force frequent fallbacks to eager mode, incur costly CPU-to-GPU synchronizations, and reduce optimization opportunities. GraphMend addresses this limitation by analyzing and transforming source code before execution. Built on the Jac compilation framework, GraphMend introduces two code transformations that remove graph breaks due to dynamic control flow and Python I/O functions. This design allows PyTorch's compilation pipeline to capture larger, uninterrupted FX graphs without requiring manual refactoring by developers. Evaluation across eight Hugging Face models shows that GraphMend removes all fixable graph breaks due to dynamic control flow and Python I/O functions, driving the break count to 0 in 6 models and reducing it from 5 to 2 in another model. On NVIDIA RTX 3090 and A40 GPUs, GraphMend achieves up to 75% latency reductions and up to 8% higher end-to-end throughput. These results demonstrate that high-level code transformation is an effective complement to PyTorch's dynamic JIT compilation pipeline, substantially improving both usability and performance.

Comment: Matches High Performance Computing: compiler-level transformations eliminate FX graph breaks, enabling larger JIT graphs and reducing CPU–GPU syncs.

Relevance: 8 Novelty: 7


44. Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models

ArXiv ID: 2509.17588

Authors: Jinyeong Kim, Seil Kang, Jiwoo Park, Junhyeok Kim, Seong Jae Hwang

Abstract: Large Vision-Language Models (LVLMs) answer visual questions by transferring information from images to text through a series of attention heads. While this image-to-text information flow is central to visual question answering, its underlying mechanism remains difficult to interpret due to the simultaneous operation of numerous attention heads. To address this challenge, we propose head attribution, a technique inspired by component attribution methods, to identify consistent patterns among attention heads that play a key role in information transfer. Using head attribution, we investigate how LVLMs rely on specific attention heads to identify and answer questions about the main object in an image. Our analysis reveals that a distinct subset of attention heads facilitates the image-to-text information flow. Remarkably, we find that the selection of these heads is governed by the semantic content of the input image rather than its visual appearance. We further examine the flow of information at the token level and discover that (1) text information first propagates to role-related tokens and the final token before receiving image information, and (2) image information is embedded in both object-related and background tokens. Our work provides evidence that image-to-text information flow follows a structured process, and that analysis at the attention-head level offers a promising direction toward understanding the mechanisms of LVLMs.

Comment: Matches Representation Learning: attention-head attribution to elucidate image-to-text information flow in LVLMs.

Relevance: 8 Novelty: 7


45. Causal Representation Learning from Multimodal Clinical Records under Non-Random Modality Missingness

ArXiv ID: 2509.17228

Authors: Zihan Liang, Ziwen Pan, Ruoxuan Xiong

Abstract: Clinical notes contain rich patient information, such as diagnoses or medications, making them valuable for patient representation learning. Recent advances in large language models have further improved the ability to extract meaningful representations from clinical texts. However, clinical notes are often missing. For example, in our analysis of the MIMIC-IV dataset, 24.5% of patients have no available discharge summaries. In such cases, representations can be learned from other modalities such as structured data, chest X-rays, or radiology reports. Yet the availability of these modalities is influenced by clinical decision-making and varies across patients, resulting in modality missing-not-at-random (MMNAR) patterns. We propose a causal representation learning framework that leverages observed data and informative missingness in multimodal clinical records. It consists of: (1) an MMNAR-aware modality fusion component that integrates structured data, imaging, and text while conditioning on missingness patterns to capture patient health and clinician-driven assignment; (2) a modality reconstruction component with contrastive learning to ensure semantic sufficiency in representation learning; and (3) a multitask outcome prediction model with a rectifier that corrects for residual bias from specific modality observation patterns. Comprehensive evaluations across MIMIC-IV and eICU show consistent gains over the strongest baselines, achieving up to 13.8% AUC improvement for hospital readmission and 13.1% for ICU admission.

Comment: Representation Learning: MMNAR-aware multimodal fusion with causal modeling and contrastive reconstruction to learn robust patient representations.

Relevance: 7 Novelty: 7


46. Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization

ArXiv ID: 2509.17405

Authors: Manish Acharya, David Hyde

Abstract: The sliced Wasserstein distance (SW) reduces optimal transport on $\mathbb{R}^d$ to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, novel approach: learning directions with Bayesian optimization (BO), particularly in settings where SW appears inside an optimization loop (e.g., gradient flows). We introduce a family of drop-in selectors for projection directions: BOSW, a one-shot BO scheme on the unit sphere; RBOSW, a periodic-refresh variant; ABOSW, an adaptive hybrid that seeds from competitive QSW sets and performs a few lightweight BO refinements; and ARBOSW, a restarted hybrid that periodically relearns directions during optimization. Our BO approaches can be composed with QSW and its variants (demonstrated by ABOSW/ARBOSW) and require no changes to downstream losses or gradients. We provide numerical experiments where our methods achieve state-of-the-art performance, and on the experimental suite of the original QSW paper, we find that ABOSW and ARBOSW can achieve convergence comparable to the best QSW variants with modest runtime overhead.

Comment: Algorithmic efficiency: adaptive Bayesian optimization to learn projection directions for sliced Wasserstein, improving computation in optimization loops.

Relevance: 7 Novelty: 7


47. Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

ArXiv ID: 2509.17552

Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan

Abstract: The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

Comment: Representation Learning: training-free integration of non-text foundation model representations into LLMs via in-context representation learning; Efficiency: avoids fine-tuning.

Relevance: 7 Novelty: 7


48. Unrolled Graph Neural Networks for Constrained Optimization

ArXiv ID: 2509.17156

Authors: Samar Hadou, Alejandro Ribeiro

Abstract: In this paper, we unroll the dynamics of the dual ascent (DA) algorithm in two coupled graph neural networks (GNNs) to solve constrained optimization problems. The two networks interact with each other at the layer level to find a saddle point of the Lagrangian. The primal GNN finds a stationary point for a given dual multiplier, while the dual network iteratively refines its estimates to reach an optimal solution. We force the primal and dual networks to mirror the dynamics of the DA algorithm by imposing descent and ascent constraints. We propose a joint training scheme that alternates between updating the primal and dual networks. Our numerical experiments demonstrate that our approach yields near-optimal near-feasible solutions and generalizes well to out-of-distribution (OOD) problems.

Comment: Matches Model Architecture: unrolls dual ascent into coupled primal/dual GNNs with training constraints mirroring the optimization dynamics.

Relevance: 7 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.