Personalized Daily ArXiv Papers 2025-11-05

[gpt-5]	Prompt	Completion	Total
Token	80661	65226	145887
Cost	$0.1	$0.65	$0.75

Total arXiv papers: 829

Total scanned papers: 476

Total relevant papers: 56

Table of contents with paper titles:

On the Emergence of Induction Heads for In-Context Learning Authors: Tiberiu Musat, Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants Authors: Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Ang\'elica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali
LongCat-Flash-Omni Technical Report Authors: Meituan LongCat Team, Bairui Wang, Bayan, Bin Xiao, Bo Zhang, Bolin Rong, Borun Chen, Chang Wan, Chao Zhang, Chen Huang, Chen Chen, Chen Chen, Chengxu Yang, Chengzuo Yang, Cong Han, Dandan Peng, Delian Ruan, Detai Xin, Disong Wang, Dongchao Yang, Fanfan Liu, Fengjiao Chen, Fengyu Yang, Gan Dong, Gang Huang, Gang Xu, Guanglu Wan, Guoqiang Tan, Guoqiao Yu, Haibo Qiu, Hao Lu, Hongbo Liu, Hongyu Xiang, Jiaheng Wu, Jian Yang, Jiaxing Liu, Jing Huang, Jingang Wang, Jinrui Ding, Juchao Jiang, Jun Kuang, Jun Wang, Junhui Mei, Ke Ding, Kefeng Zhang, Lei Chen, Liang Shi, Limeng Qiao, Liming Zheng, Lin Ma, Liuyang Guo, Liya Ma, Luying Sun, Man Gao, Mengshen Zhu, Miao Cao, Minliang Lin, Nuo Xu, Peng Shi, Qi Zhang, Qian Fang, Qian Wang, Qian Yang, Quanxiu Wang, Rongxiang Weng, Rongxin Guo, Ruoxuan Liang, Senbin Yang, Shanbo Xu, Shanglin Lei, Shengze Ye, Shimin Chen, Shuaiqi Chen, Shujie Hu, Shuo Li, Siqi Yang, Siyu Xu, Siyu Ren, Song Li, Songxiang Liu, Tianhao Bai, Tianye Dai, Wei Hong, Wei Wang, Weixiao Zhao, Wengang Cao, Wenlong Zhu, Wenlong He, Xi Su, Xi Nan, Xiaohan Zhao, Xiaohao Wang, Xiaoyu Zhao, Xiaoyu Wang, Xiaoyu Li, Xin Pan, Xin Chen, Xiusong Sun, Xu Xiang, Xudong Xing, Xuezhi Cao, Xunliang Cai, Yang Yang, Yanli Tan, Yao Yao, Yerui Sun, Yi Chen, Yifan Lu, Yin Gong, Yining Zhang, Yitian Chen, Yiyang Gan, Yuchen Tang, Yuchen Xie, Yueqian Wang, Yuewen Zheng, Yufei Zhang, Yufeng Zhong, Yulei Qian, Yuqi Peng, Yuwei Jiang, Zeyang Hu, Zheng Zhang, Zhengkun Tian, Zhiqing Hong, Zhixiong Zeng, Zhuqi Mi, Ziran Li, Ziwen Wang, Ziyi Zhao, Ziyuan Zhuang, Zizhe Zhao
Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining Authors: Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error Authors: Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits Authors: Dowon Kim, MinJae Lee, Janghyeon Kim, HyuckSung Kwon, Hyeonggyu Jeong, Sang-Soo Park, Minyong Yoon, Si-Dong Roh, Yongsuk Kwon, Jinin So, Jungwook Choi
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials Authors: Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li
Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? Authors: Weijie Su
MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling Authors: Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan
The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks Authors: Khoat Than
Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving Authors: Yuchen Zhang, Hanyue Du, Chun Cao, Jingwei Xu
ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks Authors: Chengzhang Yu, Zening Lu, Chenyang Zheng, Chiyue Wang, Yiming Zhang, Zhanpeng Jin
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs Authors: Octavian Alexandru Trifan, Karthik Sangaiah, Muhammad Awad, Muhammad Osama, Sumanth Gudaparthi, Alexandru Nicolau, Alexander Veidenbaum, Ganesh Dasika
From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models Authors: Xingqi Cui, Chieh-Jan Mike Liang, Jiarong Xing, Haoran Qiu
Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering Authors: Hossein Abdi, Mingfei Sun, Wei Pan
A new class of Markov random fields enabling lightweight sampling Authors: Jean-Baptiste Courbot, Hugo Gangloff, Bruno Colicchio
Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes Authors: Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Bulk-boundary decomposition of neural networks Authors: Donghee Lee, Hye-Sung Lee, Jaeok Yi
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks Authors: Xiumei Deng, Zehui Xiong, Binbin Chen, Dong In Kim, Merouane Debbah, H. Vincent Poor
Language Modeling With Factorization Memory Authors: Lee Xiong, Maksim Tkachenko, Johanes Effendi, Ting Cai
Calibration Across Layers: Understanding Calibration Evolution in LLMs Authors: Abhinav Joshi, Areeb Ahmad, Ashutosh Modi
Efficiently Training A Flat Neural Network Before It has been Quantizated Authors: Peng Xia, Junbiao Pang, Tianyang Cai
From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators Authors: Lei Liu, Zhongyi Yu, Hong Wang, Huanshuo Dong, Haiyang Xin, Hongwei Zhao, Bin Li
Apriel-H1: Towards Efficient Enterprise Reasoning Models Authors: Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Val\'erie B\'ecaert, Sathwik Tejaswi Madhusudhan, Torsten Scholak
LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS Authors: Stefan F. Schouten, Peter Bloem
Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT Authors: Da Chang, Peng Xue, Yu Li, Yongxiang Liu, Pengxiang Xu, Shixun Zhang
Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects Authors: Mansi Choudhary, Karthik Sangaiah, Sonali Singh, Muhammad Osama, Lisa Wu Wills, Ganesh Dasika
Redundancy Maximization as a Principle of Associative Memory Learning Authors: Mark Bl\"umel, Andreas C. Schneider, Valentin Neuhaus, David A. Ehrlich, Marcel Graetz, Michael Wibral, Abdullah Makkeh, Viola Priesemann
Efficient Vector Symbolic Architectures from Histogram Recovery Authors: Zirui Deng, Netanel Raviv
Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving Authors: Chengying Huan, Ziheng Meng, Yongchao Liu, Zhengyi Yang, Yun Zhu, Yue Yun, Shipeng Li, Rong Gu, Xiabao Wu, Haitao Zhang, Chuntao Hong, Shaonan Ma, Guihai Chen, Chen Tian
A Proof of Learning Rate Transfer under $\mu$P Authors: Soufiane Hayou
Accelerated Frank-Wolfe Algorithms: Complementarity Conditions and Sparsity Authors: Dan Garber
Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering Authors: Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman, Hidenori Tanaka, Ekdeep Singh Lubana
Real-time Continual Learning on Intel Loihi 2 Authors: Elvin Hajizada, Danielle Rager, Timothy Shea, Leobardo Campos-Macias, Andreas Wild, Eyke H\"ullermeier, Yulia Sandamirskaya, Mike Davies
Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold Authors: Xinghan Li, Haodong Wen, Kaifeng Lyu
DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching Authors: Zicheng Xu, Guanchu Wang, Yu-Neng Chuang, Guangyao Zheng, Alexander S. Szalay, Zirui Liu, Vladimir Braverman
More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks Authors: Swapnil Bhosale, Cosmin Frateanu, Camilla Clark, Arnoldas Jasonas, Chris Mitchell, Xiatian Zhu, Vamsi Krishna Ithapu, Giacomo Ferroni, Cagdas Bilen, Sanjeel Parekh
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior Authors: Daniel Aarao Reis Arturi, Eric Zhang, Andrew Ansah, Kevin Zhu, Ashwinee Panda, Aishwarya Balwani
Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling Authors: Lancelot Da Costa, Sanjeev Namjoshi, Mohammed Abbas Ansari, Bernhard Sch\"olkopf
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization Authors: Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding
EchoLSTM: A Self-Reflective Recurrent Network for Stabilizing Long-Range Memory Authors: Prasanth K K, Shubham Sharma
Energy-Efficient Deep Learning Without Backpropagation: A Rigorous Evaluation of Forward-Only Algorithms Authors: Przemys{\l}aw Spyra, Witold Dzwinel
In Good GRACEs: Principled Teacher Selection for Knowledge Distillation Authors: Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham Kakade, Surbhi Goel
Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs Authors: Bo Bai
When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding Authors: Min Fang, Zhihui Fu, Qibin Zhao, Jun Wang
ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models Authors: Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil
In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization Authors: Cooper Simpson, Stephen Becker, Alireza Doostan
A Non-Adversarial Approach to Idempotent Generative Modelling Authors: Mohammed Al-Jaff, Giovanni Luca Marchetti, Michael C Welle, Jens Lundell, Mats G. Gustafsson, Gustav Eje Henter, Hossein Azizpour, Danica Kragic
Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning Authors: Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, Stefano Soatto
OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning Authors: Kevin Valencia, Thilina Balasooriya, Xihaier Luo, Shinjae Yoo, David Keetae Park
How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks Authors: Wanda Hou, Leon Zhou, Hong-Ye Hu, Yi-Zhuang You, Xiao-Liang Qi
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph Authors: Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang
DoFlow: Causal Generative Flows for Interventional and Counterfactual Time-Series Prediction Authors: Dongze Wu, Feng Qiu, Yao Xie
Mutual Information guided Visual Contrastive Learning Authors: Hanyang Chen, Yanchao Yang
The stability of shallow neural networks on spheres: A sharp spectral analysis Authors: Xinliang Liu, Tong Mao, Jinchao Xu
Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement Authors: Sekh Mainul Islam, Pepa Atanasova, Isabelle Augenstein

1. On the Emergence of Induction Heads for In-Context Learning

ArXiv ID: 2511.01033

Authors: Tiberiu Musat, Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann

Abstract: Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.

Comment: Representation Learning/Training Dynamics: theoretical and empirical analysis of induction-head emergence with provable low-dimensional subspace constraints and scaling of emergence time.

Relevance: 10 Novelty: 8

2. Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

ArXiv ID: 2511.02043

Authors: Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Ang\'elica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali

Abstract: Bad charactors when submitting to arXiv: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

Comment: High Performance Computing: compiler-native PyTorch extensions that automatically generate fused, FlashAttention-style kernels for diverse attention variants with tiling and fusion.

Relevance: 10 Novelty: 8

3. LongCat-Flash-Omni Technical Report

ArXiv ID: 2511.00279

Authors: Meituan LongCat Team, Bairui Wang, Bayan, Bin Xiao, Bo Zhang, Bolin Rong, Borun Chen, Chang Wan, Chao Zhang, Chen Huang, Chen Chen, Chen Chen, Chengxu Yang, Chengzuo Yang, Cong Han, Dandan Peng, Delian Ruan, Detai Xin, Disong Wang, Dongchao Yang, Fanfan Liu, Fengjiao Chen, Fengyu Yang, Gan Dong, Gang Huang, Gang Xu, Guanglu Wan, Guoqiang Tan, Guoqiao Yu, Haibo Qiu, Hao Lu, Hongbo Liu, Hongyu Xiang, Jiaheng Wu, Jian Yang, Jiaxing Liu, Jing Huang, Jingang Wang, Jinrui Ding, Juchao Jiang, Jun Kuang, Jun Wang, Junhui Mei, Ke Ding, Kefeng Zhang, Lei Chen, Liang Shi, Limeng Qiao, Liming Zheng, Lin Ma, Liuyang Guo, Liya Ma, Luying Sun, Man Gao, Mengshen Zhu, Miao Cao, Minliang Lin, Nuo Xu, Peng Shi, Qi Zhang, Qian Fang, Qian Wang, Qian Yang, Quanxiu Wang, Rongxiang Weng, Rongxin Guo, Ruoxuan Liang, Senbin Yang, Shanbo Xu, Shanglin Lei, Shengze Ye, Shimin Chen, Shuaiqi Chen, Shujie Hu, Shuo Li, Siqi Yang, Siyu Xu, Siyu Ren, Song Li, Songxiang Liu, Tianhao Bai, Tianye Dai, Wei Hong, Wei Wang, Weixiao Zhao, Wengang Cao, Wenlong Zhu, Wenlong He, Xi Su, Xi Nan, Xiaohan Zhao, Xiaohao Wang, Xiaoyu Zhao, Xiaoyu Wang, Xiaoyu Li, Xin Pan, Xin Chen, Xiusong Sun, Xu Xiang, Xudong Xing, Xuezhi Cao, Xunliang Cai, Yang Yang, Yanli Tan, Yao Yao, Yerui Sun, Yi Chen, Yifan Lu, Yin Gong, Yining Zhang, Yitian Chen, Yiyang Gan, Yuchen Tang, Yuchen Xie, Yueqian Wang, Yuewen Zheng, Yufei Zhang, Yufeng Zhong, Yulei Qian, Yuqi Peng, Yuwei Jiang, Zeyang Hu, Zheng Zhang, Zhengkun Tian, Zhiqing Hong, Zhixiong Zeng, Zhuqi Mi, Ziran Li, Ziwen Wang, Ziyi Zhao, Ziyuan Zhuang, Zizhe Zhao

Abstract: We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.

Comment: Model Architecture (MoE) + HPC: Shortcut-connected MoE with zero-computation experts and modality-decoupled parallelism for efficient large-scale multimodal training.

Relevance: 10 Novelty: 8

4. Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

ArXiv ID: 2511.02237

Authors: Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun

Abstract: An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39\%$ and $15\%$ in the MoE layer decode latency, respectively.

Comment: MoE Efficiency: batch-aware opportunistic expert activation to reduce activated experts and decode latency without retraining.

Relevance: 10 Novelty: 8

5. FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

ArXiv ID: 2511.02302

Authors: Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun

Abstract: Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and na\"ive FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon.

Comment: High Performance Computing + Low-Precision Training: FP8-centric, quantization-consistent dataflow for MoE with fused operators, eliminating cast overhead and avoiding double quantization error.

Relevance: 10 Novelty: 8

6. Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

ArXiv ID: 2511.00321

Authors: Dowon Kim, MinJae Lee, Janghyeon Kim, HyuckSung Kwon, Hyeonggyu Jeong, Sang-Soo Park, Minyong Yoon, Si-Dong Roh, Yongsuk Kwon, Jinin So, Jungwook Choi

Abstract: The expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables non-eviction frameworks that offload the full KV-cache to scalable external memory, these frameworks still suffer from costly data transfers when recalling non-resident KV tokens to limited GPU memory as context lengths increase. This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference, a CXL-enabled KV-cache management system that coordinates memory and computation beyond GPU limits. Our design offloads token page selection to a PNM accelerator within CXL memory, eliminating costly recalls and enabling larger GPU batch sizes. We further introduce a hybrid parallelization strategy and a steady-token selection mechanism to enhance compute efficiency and scalability. Implemented atop a state-of-the-art CXL-PNM system, our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts. Our PNM-only offloading scheme (PNM-KV) and GPU-PNM hybrid with steady-token execution (PnG-KV) achieve up to 21.9x throughput improvement, up to 60x lower energy per token, and up to 7.3x better total cost efficiency than the baseline, demonstrating that CXL-enabled multi-PNM architectures can serve as a scalable backbone for future long-context LLM inference.

Comment: High Performance Computing/Memory Optimization: CXL-enabled processing-near-memory KV-cache management and hybrid execution for 1M-token LLM inference.

Relevance: 10 Novelty: 8

7. Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

ArXiv ID: 2511.00833

Authors: Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li

Abstract: Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.

Comment: Model Architecture and Efficiency: introduces Visual-Contrast Attention that replaces MHSA, reducing complexity from quadratic to O(N n C) with architectural modifications enabling sparse, contrastive interactions.

Relevance: 9 Novelty: 8

8. Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

ArXiv ID: 2511.00674

Authors: Weijie Su

Abstract: In this paper, we introduce a model for analyzing deep learning optimization over a single iteration by leveraging the matrix structure of the weights. We derive the model by assuming isotropy of curvature, including the second-order Hessian and higher-order terms, of the loss function across all perturbation directions; hence, we call it the isotropic curvature model. This model is a convex optimization program amenable to analysis, which allows us to understand how an update on the weights in the form of a matrix relates to the change in the total loss function. As an application, we use the isotropic curvature model to analyze the recently introduced Muon optimizer and other matrix-gradient methods for training language models. First, we show that under a general growth condition on the curvature, the optimal update matrix is obtained by making the spectrum of the original gradient matrix more homogeneous -- that is, making its singular values closer in ratio -- which in particular improves the conditioning of the update matrix. Next, we show that the orthogonalized gradient becomes optimal for the isotropic curvature model when the curvature exhibits a phase transition in growth. Taken together, these results suggest that the gradient orthogonalization employed in Muon and other related methods is directionally correct but may not be strictly optimal. Finally, we discuss future research on how to leverage the isotropic curvature model for designing new optimization methods for training deep learning and language models.

Comment: Training Dynamics/Optimization Theory: isotropic curvature model to analyze single-iteration updates and explain when gradient orthogonalization is optimal, informing optimizer design.

Relevance: 9 Novelty: 8

9. MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling

ArXiv ID: 2511.00056

Authors: Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan

Abstract: The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single layer and optimizes it sequentially, while freezing the other layers to save optimizer states and activations. Although effective, these methods ignore the varying importance of the modules within each layer, leading to suboptimal performance. Moreover, layer-wise sampling provides only limited memory savings, as at least one full layer must remain active during optimization. To overcome these limitations, we propose Module-wise Importance SAmpling (MISA), a novel method that divides each layer into smaller modules and assigns importance scores to each module. MISA uses a weighted random sampling mechanism to activate modules, provably reducing gradient variance compared to layer-wise sampling. Additionally, we establish an (\mathcal{O}(1/\sqrt{K})) convergence rate under non-convex and stochastic conditions, where $K$ is the total number of block updates, and provide a detailed memory analysis showcasing MISA's superiority over existing baseline methods. Experiments on diverse learning tasks validate the effectiveness of MISA. Source code is available at https://github.com/pkumelon/MISA.

Comment: High-Efficiency Training: memory-efficient LLM optimization via module-wise importance sampling with variance reduction, convergence guarantees, and favorable memory analysis.

Relevance: 9 Novelty: 8

10. The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

ArXiv ID: 2511.00958

Authors: Khoat Than

Abstract: Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and improve generalization. However, the underlying theoretical mechanism by which normalization contributes to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a DNN architecture. In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.

Comment: Representation Learning/Theory: shows normalization yields exponential Lipschitz reduction, explaining smoother optimization and better generalization (capacity control).

Relevance: 9 Novelty: 8

11. Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving

ArXiv ID: 2511.00101

Authors: Yuchen Zhang, Hanyue Du, Chun Cao, Jingwei Xu

Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning (PEFT) technique for adapting large language models (LLMs) to downstream tasks. While prior work has explored strategies for integrating LLM training and serving, there still remains a gap in unifying fine-tuning and inference for LoRA-based models. We present Loquetier, a virtualized multi-LoRA framework that seamlessly integrates LoRA fine-tuning and serving within a single runtime. Loquetier introduces two key components: (1) a Virtualized Module that isolates PEFT-based modifications and supports multiple adapters on a shared base model, and (2) an optimized computation flow with a kernel design that merges fine-tuning and inference paths in forward propagation, enabling efficient batching and minimizing kernel invocation overhead. Extensive experiments across three task settings show that Loquetier consistently outperforms existing baselines in both performance and flexibility, achieving up to $3.0\times$ the throughput of the state-of-the-art co-serving system on inference-only tasks and $46.4\times$ higher SLO attainment than PEFT on unified fine-tuning and inference tasks. The implementation of Loquetier is publicly available at https://github.com/NJUDeepEngine/Loquetier.

Comment: Compression/Efficiency + Systems: virtualized multi-LoRA unifying fine-tuning and serving with shared base model and merged forward kernels for high-throughput co-serving.

Relevance: 9 Novelty: 8

12. ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks

ArXiv ID: 2511.01581

Authors: Chengzhang Yu, Zening Lu, Chenyang Zheng, Chiyue Wang, Yiming Zhang, Zhanpeng Jin

Abstract: Large language models suffer from knowledge staleness and lack of interpretability due to implicit knowledge storage across entangled network parameters, preventing targeted updates and reasoning transparency. We propose ExplicitLM, a novel architecture featuring a million-scale external memory bank storing human-readable knowledge as token sequences, enabling direct inspection and modification. We design a differentiable two-stage retrieval mechanism with efficient coarse-grained filtering via product key decomposition (reducing complexity from $\mathcal{O}(N \cdot |I|)$ to $\mathcal{O}(\sqrt{N} \cdot |I|)$) and fine-grained Gumbel-Softmax matching for end-to-end training. Inspired by dual-system cognitive theory, we partition knowledge into frozen explicit facts (20%) and learnable implicit patterns (80%), maintained through Exponential Moving Average updates for stability. ExplicitLM achieves up to 43.67% improvement on knowledge-intensive tasks versus standard Transformers, with 3.62$\times$ gains in low-data regimes (10k samples). Analysis shows strong correlations between memory retrieval and performance, with correct predictions achieving 49% higher hit rates. Unlike RAG systems with frozen retrieval, our jointly optimized architecture demonstrates that interpretable, updatable models can maintain competitive performance while providing unprecedented knowledge transparency.

Comment: Model Architecture: external explicit memory bank with differentiable two-stage retrieval and conditional routing (product-key filtering + Gumbel-Softmax) for interpretable, updatable knowledge.

Relevance: 9 Novelty: 8

13. Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

ArXiv ID: 2511.02168

Authors: Octavian Alexandru Trifan, Karthik Sangaiah, Muhammad Awad, Muhammad Osama, Sumanth Gudaparthi, Alexandru Nicolau, Alexander Veidenbaum, Ganesh Dasika

Abstract: As large language models (LLMs) continue to scale, their workloads increasingly rely on distributed execution across multiple GPUs. However, the conventional bulk synchronous parallel~(BSP) model used in such settings introduces significant performance inefficiencies. To characterize these bottlenecks, we introduce the ''Three Taxes'' (Bulk Synchronous, Inter-Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework. We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution. By exploiting libraries like Iris for Triton, we gain access to in-kernel communication primitives that enable the design of novel fine-grained programming patterns, offering greater flexibility and performance than traditional BSP-based approaches. These patterns systematically eliminate the three taxes by creating direct, tile-level producer-consumer pipelines and replacing global barriers with fine-grained dataflow synchronization. Applying this methodology to critical kernels, from the foundational All-Gather + general matrix multiplication operation to the complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end latency over BSP-based approaches, establishing a more programmable and efficient paradigm for distributed LLM workloads.

Comment: HPC/Systems: in-kernel communication and fine-grained dataflow synchronization to eliminate BSP taxes and improve distributed LLM efficiency.

Relevance: 9 Novelty: 8

14. From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

ArXiv ID: 2511.02248

Authors: Xingqi Cui, Chieh-Jan Mike Liang, Jiarong Xing, Haoran Qiu

Abstract: Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size, sequence length, and traffic rate. This heterogeneity suggests that the operator, rather than the entire model, is the right granularity for scaling decisions. We propose an operator-level autoscaling framework, which allocates resources at finer (operator)-granularity, optimizing the scaling, batching, and placement based on individual operator profiles. Evaluated on production-scale traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less energy, or under fixed resources achieves 1.6x higher throughput with 5% less energy. These results show that the operator, rather than the model, is fundamentally a more effective unit for scaling large generative workloads.

Comment: Systems/HPC: operator-level autoscaling for LLM inference with fine-grained scaling, batching, and placement; significant resource and energy savings.

Relevance: 9 Novelty: 8

15. Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering

ArXiv ID: 2511.01694

Authors: Hossein Abdi, Mingfei Sun, Wei Pan

Abstract: Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior--or comparable--ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.

Comment: Matches Efficiency/Optimization: Bayesian natural gradient fine-tuning via Kalman filtering approximates NGD for large CLIP models, improving training efficiency and robustness.

Relevance: 9 Novelty: 8

16. A new class of Markov random fields enabling lightweight sampling

ArXiv ID: 2511.02373

Authors: Jean-Baptiste Courbot, Hugo Gangloff, Bruno Colicchio

Abstract: This work addresses the problem of efficient sampling of Markov random fields (MRF). The sampling of Potts or Ising MRF is most often based on Gibbs sampling, and is thus computationally expensive. We consider in this work how to circumvent this bottleneck through a link with Gaussian Markov Random fields. The latter can be sampled in several cost-effective ways, and we introduce a mapping from real-valued GMRF to discrete-valued MRF. The resulting new class of MRF benefits from a few theoretical properties that validate the new model. Numerical results show the drastic performance gain in terms of computational efficiency, as we sample at least 35x faster than Gibbs sampling using at least 37x less energy, all the while exhibiting empirical properties close to classical MRFs.

Comment: Matches Efficiency and Model Architecture: introduces a new class of MRFs via mapping from GMRFs enabling lightweight, much faster sampling than Gibbs.

Relevance: 9 Novelty: 8

17. Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes

ArXiv ID: 2511.02681

Authors: Mohammadsajad Alipour, Mohammad Mohammadi Amiri

Abstract: Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.

Comment: Model Compression and Efficiency: selectively sparsified low-rank storage of fine-tuning updates leveraging interleaved singular vector importance (sparsity + low-rank).

Relevance: 9 Novelty: 8

18. Bulk-boundary decomposition of neural networks

ArXiv ID: 2511.02003

Authors: Donghee Lee, Hye-Sung Lee, Jaeok Yi

Abstract: We present the bulk-boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a natural extension, we develop a field-theoretic formulation of neural dynamics based on this decomposition.

Comment: Training Dynamics/Representation Learning: bulk–boundary decomposition and field-theoretic formulation of SGD dynamics, separating architecture-driven and data-driven effects.

Relevance: 9 Novelty: 8

19. Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks

ArXiv ID: 2511.02647

Authors: Xiumei Deng, Zehui Xiong, Binbin Chen, Dong In Kim, Merouane Debbah, H. Vincent Poor

Abstract: Large language models (LLMs) are proliferating rapidly at the edge, delivering intelligent capabilities across diverse application scenarios. However, their practical deployment in collaborative scenarios confronts fundamental challenges: privacy vulnerabilities, communication overhead, and computational bottlenecks. To address these, we propose Federated Attention (FedAttn), which integrates the federated paradigm into the self-attention mechanism, creating a new distributed LLM inference framework that simultaneously achieves privacy protection, communication efficiency, and computational efficiency. FedAttn enables participants to perform local self-attention over their own token representations while periodically exchanging and aggregating Key-Value (KV) matrices across multiple Transformer blocks, collaboratively generating LLM responses without exposing private prompts. Further, we identify a structural duality between contextual representation refinement in FedAttn and parameter optimization in FL across private data, local computation, and global aggregation. This key insight provides a principled foundation for systematically porting federated optimization techniques to collaborative LLM inference. Building on this framework, we theoretically analyze how local self-attention computation within participants and heterogeneous token relevance among participants shape error propagation dynamics across Transformer blocks. Moreover, we characterize the fundamental trade-off between response quality and communication/computation efficiency, which is governed by the synchronization interval and the number of participants. Experimental results validate our theoretical analysis, and reveal significant optimization opportunities through sparse attention and adaptive KV aggregation, highlighting FedAttn's potential to deliver scalability and efficiency in real-world edge deployments.

Comment: High Performance Computing/Distributed Inference: Federated Attention integrates FL principles into self-attention with KV aggregation and analyzes quality–communication trade-offs.

Relevance: 9 Novelty: 8

20. Language Modeling With Factorization Memory

ArXiv ID: 2511.00315

Authors: Lee Xiong, Maksim Tkachenko, Johanes Effendi, Ting Cai

Abstract: We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks while also demonstrating superior generalization in long-context scenarios. Our model builds upon Mamba-2, enabling Factorization Memory to exploit parallel computations during training while preserving constant computational and memory complexity during inference. To further optimize model efficiency and representational capacity, we develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart. To our knowledge, this represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings. This work provides a systematic empirical analysis of Factorization Memory in comparison to Transformer and Mamba-2 architectures.

Comment: Model Architecture: Factorization Memory RNN with sparse memory activation; parallel-trainable and O(1) compute/memory at inference for long contexts.

Relevance: 9 Novelty: 8

21. Calibration Across Layers: Understanding Calibration Evolution in LLMs

ArXiv ID: 2511.00280

Authors: Abhinav Joshi, Areeb Ahmad, Ashutosh Modi

Abstract: Large Language Models (LLMs) have demonstrated inherent calibration capabilities, where predicted probabilities align well with correctness, despite prior findings that deep neural networks are often overconfident. Recent studies have linked this behavior to specific components in the final layer, such as entropy neurons and the unembedding matrix null space. In this work, we provide a complementary perspective by investigating how calibration evolves throughout the network depth. Analyzing multiple open-weight models on the MMLU benchmark, we uncover a distinct confidence correction phase in the upper/later layers, where model confidence is actively recalibrated after decision certainty has been reached. Furthermore, we identify a low-dimensional calibration direction in the residual stream whose perturbation significantly improves calibration metrics (ECE and MCE) without harming accuracy. Our findings suggest that calibration is a distributed phenomenon, shaped throughout the network forward pass, not just in its final projection, providing new insights into how confidence-regulating mechanisms operate within LLMs.

Comment: Representation Learning/Training Dynamics: analyzes calibration evolution across depth and identifies a low-dimensional calibration direction in the residual stream improving ECE/MCE.

Relevance: 9 Novelty: 7

22. Efficiently Training A Flat Neural Network Before It has been Quantizated

ArXiv ID: 2511.01462

Authors: Peng Xia, Junbiao Pang, Tianyang Cai

Abstract: Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the relationship between a well-trained NN and the quantized model, leading to considerable quantization error for PTQ. However, it is unclear how to efficiently train a model-agnostic neural network which is tailored for a predefined precision low-bit model. In this paper, we firstly discover that a flat full precision neural network is crucial for low-bit quantization. To achieve this, we propose a framework that proactively pre-conditions the model by measuring and disentangling the error sources. Specifically, both the Activation Quantization Error (AQE) and the Weight Quantization Error (WQE) are statistically modeled as independent Gaussian noises. We study several noise injection optimization methods to obtain a flat minimum. Experimental results attest to the effectiveness of our approach. These results open novel pathways for obtaining low-bit PTQ models.

Comment: Compression/Efficiency: pre-conditioning for PTQ via noise-injection to reach flat minima, modeling activation/weight quantization errors.

Relevance: 9 Novelty: 7

23. From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators

ArXiv ID: 2511.00032

Authors: Lei Liu, Zhongyi Yu, Hong Wang, Huanshuo Dong, Haiyang Xin, Hongwei Zhao, Bin Li

Abstract: In recent years, Neural Operators(NO) have gradually emerged as a popular approach for solving Partial Differential Equations (PDEs). However, their application to large-scale engineering tasks suffers from significant computational overhead. And the fact that current models impose a uniform computational cost while physical fields exhibit vastly different complexities constitutes a fundamental mismatch, which is the root of this inefficiency. For instance, in turbulence flows, intricate vortex regions require deeper network processing compared to stable flows. To address this, we introduce a framework: Skip-Block Routing (SBR), a general framework designed for Transformer-based neural operators, capable of being integrated into their multi-layer architectures. First, SBR uses a routing mechanism to learn the complexity and ranking of tokens, which is then applied during inference. Then, in later layers, it decides how many tokens are passed forward based on this ranking. This way, the model focuses more processing capacity on the tokens that are more complex. Experiments demonstrate that SBR is a general framework that seamlessly integrates into various neural operators. Our method reduces computational cost by approximately 50% in terms of Floating Point Operations (FLOPs), while still delivering up to 2x faster inference without sacrificing accuracy.

Comment: Conditional/Dynamic Networks: skip-block routing that ranks tokens and prunes computation in later layers of Transformer-based neural operators to cut FLOPs.

Relevance: 9 Novelty: 7

24. Apriel-H1: Towards Efficient Enterprise Reasoning Models

ArXiv ID: 2511.02651

Authors: Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Val\'erie B\'ecaert, Sathwik Tejaswi Madhusudhan, Torsten Scholak

Abstract: Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.

Comment: Model Architecture and Efficiency: hybrid SSM–Transformer via distillation replacing attention layers with Mamba to reduce KV-cache needs and boost throughput.

Relevance: 9 Novelty: 7

25. LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS

ArXiv ID: 2511.02089

Authors: Stefan F. Schouten, Peter Bloem

Abstract: Contrast-Consistent Search (CCS) is an unsupervised probing method able to test whether large language models represent binary features, such as sentence truth, in their internal activations. While CCS has shown promise, its two-term objective has been only partially understood. In this work, we revisit CCS with the aim of clarifying its mechanisms and extending its applicability. We argue that what should be optimized for, is relative contrast consistency. Building on this insight, we reformulate CCS as an eigenproblem, yielding closed-form solutions with interpretable eigenvalues and natural extensions to multiple variables. We evaluate these approaches across a range of datasets, finding that they recover similar performance to CCS, while avoiding problems around sensitivity to random initialization. Our results suggest that relativizing contrast consistency not only improves our understanding of CCS but also opens pathways for broader probing and mechanistic interpretability methods.

Comment: Mechanistic interpretability: reformulates CCS probing as a contrastive eigenproblem with closed-form solutions and multi-variable extension.

Relevance: 9 Novelty: 7

26. Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT

ArXiv ID: 2511.00051

Authors: Da Chang, Peng Xue, Yu Li, Yongxiang Liu, Pengxiang Xu, Shixun Zhang

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods are crucial for adapting large pre-trained models. Among these, LoRA is considered a foundational approach. Building on this, the influential DoRA method enhances performance by decomposing weight updates into magnitude and direction. However, its underlying mechanism remains unclear, and it introduces significant computational overhead. In this work, we first identify that DoRA's success stems from its capacity to increase the singular value entropy of the weight update matrix, which promotes a more uniform update distribution akin to full fine-tuning. We then reformulate DoRA into a mathematically equivalent and more efficient matrix form, revealing it as a learnable weight conditioning method. Based on this insight, we propose a unified framework for designing advanced PEFT methods by exploring two orthogonal dimensions: the architectural placement and the transformation type of the conditioning matrix. Within this framework, we introduce two novel methods: (1) \textbf{Pre-Diag}, which applies a diagonal conditioning matrix before the LoRA update to efficiently calibrate the pre-trained weights, thereby enhancing performance while reducing training time; and (2) \textbf{S}kewed \textbf{O}rthogonal \textbf{R}otation \textbf{A}daptation (\textbf{SORA}), which employs a parameter-efficient orthogonal rotation to perform a more powerful, norm-preserving transformation of the feature space. Extensive experiments on natural language understanding and generation tasks demonstrate that our proposed methods achieve superior performance and efficiency compared to both LoRA and DoRA. The code is available at https://github.com/MaeChd/SORA.

Comment: PEFT/Efficiency: reformulates DoRA, links performance to singular value entropy, and introduces Pre-Diag and SORA for more efficient, conditioned LoRA-style updates.

Relevance: 9 Novelty: 7

27. Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

ArXiv ID: 2511.02132

Authors: Mansi Choudhary, Karthik Sangaiah, Sonali Singh, Muhammad Osama, Lisa Wu Wills, Ganesh Dasika

Abstract: The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA effects distort locality in multi-head attention (MHA) and present Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit intra-chiplet cache reuse. On AMD's MI300X architecture, our method achieves up to 50% higher performance over state-of-the-art attention algorithms using conventional scheduling techniques and sustains consistently high L2 cache hit rates of 80-97%. These results demonstrate that NUMA-aware scheduling is now fundamental to achieving full efficiency on next-generation disaggregated GPUs, offering a path forward for scalable AI training and inference.

Comment: High Performance Computing: NUMA-aware attention scheduling (Swizzled Head-first Mapping) exploiting intra-chiplet locality on disaggregated GPUs.

Relevance: 9 Novelty: 7

28. Redundancy Maximization as a Principle of Associative Memory Learning

ArXiv ID: 2511.02584

Authors: Mark Bl\"umel, Andreas C. Schneider, Valentin Neuhaus, David A. Ehrlich, Marcel Graetz, Michael Wibral, Abdullah Makkeh, Viola Priesemann

Abstract: Associative memory, traditionally modeled by Hopfield networks, enables the retrieval of previously stored patterns from partial or noisy cues. Yet, the local computational principles which are required to enable this function remain incompletely understood. To formally characterize the local information processing in such systems, we employ a recent extension of information theory - Partial Information Decomposition (PID). PID decomposes the contribution of different inputs to an output into unique information from each input, redundant information across inputs, and synergistic information that emerges from combining different inputs. Applying this framework to individual neurons in classical Hopfield networks we find that below the memory capacity, the information in a neuron's activity is characterized by high redundancy between the external pattern input and the internal recurrent input, while synergy and unique information are close to zero until the memory capacity is surpassed and performance drops steeply. Inspired by this observation, we use redundancy as an information-theoretic learning goal, which is directly optimized for each neuron, dramatically increasing the network's memory capacity to 1.59, a more than tenfold improvement over the 0.14 capacity of classical Hopfield networks and even outperforming recent state-of-the-art implementations of Hopfield networks. Ultimately, this work establishes redundancy maximization as a new design principle for associative memories and opens pathways for new associative memory models based on information-theoretic goals.

Comment: Representation Learning/Architecture: introduces redundancy maximization as an information-theoretic local learning principle for associative memory, greatly increasing Hopfield capacity.

Relevance: 8 Novelty: 8

29. Efficient Vector Symbolic Architectures from Histogram Recovery

ArXiv ID: 2511.01838

Authors: Zirui Deng, Netanel Raviv

Abstract: Vector symbolic architectures (VSAs) are a family of information representation techniques which enable composition, i.e., creating complex information structures from atomic vectors via binding and superposition, and have recently found wide ranging applications in various neurosymbolic artificial intelligence (AI) systems. Recently, Raviv proposed the use of random linear codes in VSAs, suggesting that their subcode structure enables efficient binding, while preserving the quasi-orthogonality that is necessary for neural processing. Yet, random linear codes are difficult to decode under noise, which severely limits the resulting VSA's ability to support recovery, i.e., the retrieval of information objects and their attributes from a noisy compositional representation. In this work we bridge this gap by utilizing coding theoretic tools. First, we argue that the concatenation of Reed-Solomon and Hadamard codes is suitable for VSA, due to the mutual quasi-orthogonality of the resulting codewords (a folklore result). Second, we show that recovery of the resulting compositional representations can be done by solving a problem we call histogram recovery. In histogram recovery, a collection of $N$ histograms over a finite field is given as input, and one must find a collection of Reed-Solomon codewords of length $N$ whose entry-wise symbol frequencies obey those histograms. We present an optimal solution to the histogram recovery problem by using algorithms related to list-decoding, and analyze the resulting noise resilience. Our results give rise to a noise-resilient VSA with formal guarantees regarding efficient encoding, quasi-orthogonality, and recovery, without relying on any heuristics or training, and while operating at improved parameters relative to similar solutions such as the Hadamard code.

Comment: Representation Learning: coding-theoretic vector symbolic architecture with efficient binding and provable recovery via histogram recovery and list-decoding, enabling robust compositional representations.

Relevance: 8 Novelty: 8

30. Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

ArXiv ID: 2511.01633

Authors: Chengying Huan, Ziheng Meng, Yongchao Liu, Zhengyi Yang, Yun Zhu, Yue Yun, Shipeng Li, Rong Gu, Xiabao Wu, Haitao Zhang, Chuntao Hong, Shaonan Ma, Guihai Chen, Chen Tian

Abstract: Graph Chain-of-Thought (Graph-CoT) enables large language models (LLMs) to perform step-by-step reasoning over graph-structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution. We present GLM, the first multi-agent Graph-CoT system co-designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph-CoT-aware LLM inference mechanism with graph-specific KV-cache management, priority-based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines, enabling efficient adoption for complex real-world reasoning at scale.

Comment: High Performance Computing and Efficiency: multi-agent Graph-CoT decomposition plus LLM serving optimizations (graph-specific KV-cache management, priority eviction, pipelining) for lower tokens/latency and higher throughput.

Relevance: 8 Novelty: 8

31. A Proof of Learning Rate Transfer under $\mu$P

ArXiv ID: 2511.01734

Authors: Soufiane Hayou

Abstract: We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $\mu$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $\mu P$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

Comment: Representation Learning/Training Dynamics: theoretical proof of learning-rate transfer under μP parameterization, contrasting with SP/NTP.

Relevance: 8 Novelty: 8

32. Accelerated Frank-Wolfe Algorithms: Complementarity Conditions and Sparsity

ArXiv ID: 2511.02821

Authors: Dan Garber

Abstract: We develop new accelerated first-order algorithms in the Frank-Wolfe (FW) family for minimizing smooth convex functions over compact convex sets, with a focus on two prominent constraint classes: (1) polytopes and (2) matrix domains given by the spectrahedron and the unit nuclear-norm ball. A key technical ingredient is a complementarity condition that captures solution sparsity -- face dimension for polytopes and rank for matrices. We present two algorithms: (1) a purely linear optimization oracle (LOO) method for polytopes that has optimal worst-case first-order (FO) oracle complexity and, aside of a finite \emph{burn-in} phase and up to a logarithmic factor, has LOO complexity that scales with $r/\sqrt{\epsilon}$, where $\epsilon$ is the target accuracy and $r$ is the solution sparsity $r$ (independently of the ambient dimension), and (2) a hybrid scheme that combines FW with a sparse projection oracle (e.g., low-rank SVDs for matrix domains with low-rank solutions), which also has optimal FO oracle complexity, and after a finite burn-in phase, only requires $O(1/\sqrt{\epsilon})$ sparse projections and LOO calls (independently of both the ambient dimension and the rank of optimal solutions). Our results close a gap on how to accelerate recent advancements in linearly-converging FW algorithms for strongly convex optimization, without paying the price of the dimension.

Comment: Optimization for Sparsity/Low-rank: accelerated Frank–Wolfe with complementarity conditions yielding sparsity- and rank-aware complexity guarantees.

Relevance: 8 Novelty: 8

33. Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

ArXiv ID: 2511.00617

Authors: Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman, Hidenori Tanaka, Ekdeep Singh Lubana

Abstract: Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.

Comment: Representation Learning/Training Dynamics: a unified Bayesian account linking in-context learning and activation steering, predictive of internal belief updates.

Relevance: 8 Novelty: 8

34. Real-time Continual Learning on Intel Loihi 2

ArXiv ID: 2511.01553

Authors: Elvin Hajizada, Danielle Rager, Timothy Shea, Leobardo Campos-Macias, Andreas Wild, Eyke H\"ullermeier, Yulia Sandamirskaya, Mike Davies

Abstract: AI systems on edge devices face a critical challenge in open-world environments: adapting when data distributions shift and novel classes emerge. While offline training dominates current paradigms, online continual learning (OCL)--where models learn incrementally from non-stationary streams without catastrophic forgetting--remains challenging in power-constrained settings. We present a neuromorphic solution called CLP-SNN: a spiking neural network architecture for Continually Learning Prototypes and its implementation on Intel's Loihi 2 chip. Our approach introduces three innovations: (1) event-driven and spatiotemporally sparse local learning, (2) a self-normalizing three-factor learning rule maintaining weight normalization, and (3) integrated neurogenesis and metaplasticity for capacity expansion and forgetting mitigation. On OpenLORIS few-shot learning experiments, CLP-SNN achieves accuracy competitive with replay methods while being rehearsal-free. CLP-SNN delivers transformative efficiency gains: 70\times faster (0.33ms vs 23.2ms), and 5,600\times more energy efficient (0.05mJ vs 281mJ) than the best alternative OCL on edge GPU. This demonstrates that co-designed brain-inspired algorithms and neuromorphic hardware can break traditional accuracy-efficiency trade-offs for future edge AI systems.

Comment: Matches Architecture+HPC/Efficiency: SNN with local learning, normalization, neurogenesis on neuromorphic hardware (Loihi 2) enabling real-time, energy-efficient continual learning.

Relevance: 8 Novelty: 8

35. Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

ArXiv ID: 2511.02773

Authors: Xinghan Li, Haodong Wen, Kaifeng Lyu

Abstract: Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $\tr(\mH)$, whereas we prove that Adam minimizes $\tr(\Diag(\mH)^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.

Comment: Training Dynamics/Optimization: shows Adam minimizes a distinct sharpness measure via adaptive updates (SDE analysis), contrasting SGD and extending to other adaptive methods.

Relevance: 8 Novelty: 8

36. DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching

ArXiv ID: 2511.00640

Authors: Zicheng Xu, Guanchu Wang, Yu-Neng Chuang, Guangyao Zheng, Alexander S. Szalay, Zirui Liu, Vladimir Braverman

Abstract: Large Reasoning Models (LRMs) demonstrate strong performance on complex reasoning tasks, yet they often suffer from overthinking, producing excessively long chain-of-thought (CoT) traces that increase inference cost and may degrade accuracy. Our analysis reveals a clear anti-correlation between reasoning length and accuracy, where across multiple stochastic decodes, the short reasoning paths consistently achieve the highest correctness, while longer ones accumulate errors and repetitions. These short optimal reasoning paths can be found ideally through full enumeration of the reasoning space. However, the tree-structured reasoning space grows exponentially with sequence length, rendering exhaustive exploration infeasible. To address this, we propose DTS, a model-agnostic decoding framework that sketches the reasoning space by selectively branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path. This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision. Experiments on AIME2024 and AIME2025 datasets with DeepSeek-R1-Distill-Qwen-7B and 1.5B show that DTS improves accuracy by up to 8%, reduces average reasoning length by 23%, and decreases repetition frequency by 12%, demonstrating DTS's ability for scalable and efficient LRM reasoning.

Comment: Model Compression and Efficiency: proposes a training-free decoding framework that selectively branches at high-entropy tokens with early stopping to shorten chain-of-thought while improving accuracy.