Personalized Daily ArXiv Papers 2025-10-21

[gpt-5]	Prompt	Completion	Total
Token	94070	73067	167137
Cost	$0.12	$0.73	$0.85

Total arXiv papers: 1193

Total scanned papers: 742

Total relevant papers: 56

Table of contents with paper titles:

Unbiased Gradient Low-Rank Projection Authors: Rui Pan, Yang Luo, Yuxing Liu, Yang You, Tong Zhang
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads Authors: Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, Zhouchen Lin
The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis Authors: Hoang Pham, The-Anh Ta, Tom Jacobs, Rebekka Burkholz, Long Tran-Thanh
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts Authors: Yongxiang Hua, Haoyu Cao, Zhou Tao, Bocheng Li, Zihao Wu, Chaohu Liu, Linli Xu
TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs Authors: Ye Qiao, Zhiheng Chen, Yifan Zhang, Yian Wang, Sitao Huang
Expert Merging in Sparse Mixture of Experts with Nash Bargaining Authors: Dung V. Nguyen, Anh T. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Shiqi Jiang, Ethan Fetaya, Linh Duy Tran, Gal Chechik, Tan M. Nguyen
Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior Authors: Fuqun Han, Stanley Osher, Wuchen Li
Accelerating Frontier MoE Training with 3D Integrated Optics Authors: Mikhail Bernadskiy, Peter Carson, Thomas Graham, Taylor Groves, Ho John Lee, Eric Yeh
AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization Authors: Mengtao Lv, Ruiqi Zhu, Xinyu Wang, Yun Li
One-Bit Quantization for Random Features Models Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi
Localist LLMs with Recruitment Learning Authors: Joachim Diederich
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling Authors: Chen Zhang, Weixin Bu, Wendong Xu, Runsheng Yu, Yik-Chung Wu, Ngai Wong
CTR-LoRA: Curvature-Aware and Trust-Region Guided Low-Rank Adaptation for Large Language Models Authors: Zhuxuanzi Wang, Mingqiao Mo, Xi Xiao, Chen Liu, Chenrui Ma, Yunbei Zhang, Xiao Wang, Smita Krishnaswamy, Tianyang Wang
Glyph: Scaling Context Windows via Visual-Text Compression Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity Authors: Tuowei Wang, Kun Li, Zixu Hao, Donglin Bai, Ju Ren, Yaoxue Zhang, Ting Cao, Mao Yang
MuonBP: Faster Muon via Block-Periodic Orthogonalization Authors: Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, Youngsuk Park
AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution Authors: Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, Il-Chul Moon
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern Authors: Ao Shen, Rui Zhang, Junping Zhao
Compressing Many-Shots in In-Context Learning Authors: Devvrit Khatri, Pranamya Kulkarni, Nilesh Gupta, Yerram Varun, Liqian Peng, Jay Yagnik, Praneeth Netrapalli, Cho-Jui Hsieh, Alec Go, Inderjit S Dhillon, Aditya Kusupati, Prateek Jain
Neuronal Group Communication for Efficient Neural representation Authors: Zhengqi Pei, Qingming Huang, Shuhui Wang
Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models Authors: Samuel Lippl, Thomas McGee, Kimberly Lopez, Ziwen Pan, Pierce Zhang, Salma Ziadi, Oliver Eberle, Ida Momennejad
Infinite Neural Operators: Gaussian processes on functions Authors: Daniel Augusto de Souza, Yuchen Zhu, Harry Jake Cunningham, Yuri Saporito, Diego Mesquita, Marc Peter Deisenroth
Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares Authors: Lachlan Ewen MacDonald, Hancheng Min, Leandro Palma, Salma Tarmoun, Ziqing Xu, Ren\'e Vidal
Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability Authors: Hoang-Son Nguyen, Xiao Fu
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference Authors: Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun, Yongpan Liu
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models Authors: Yutong Wang, Haiyu Wang, Sai Qian Zhang
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models Authors: Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang
Bitwidth-Specific Logarithmic Arithmetic for Future Hardware-Accelerated Training Authors: Hassan Hamad, Yuou Qiu, Peter A. Beerel, Keith M. Chugg
Symmetry and Generalisation in Neural Approximations of Renormalisation Transformations Authors: Cassidy Ashworth, Pietro Li`o, Francesco Caso
Just-In-Time Piecewise-Linear Semantics for ReLU-type Networks Authors: Hongyi Duan, Haoyang Liu, Jian'an Zhang, Fengrui Liu, Yiyi Wang
Asymptotically Stable Quaternion-valued Hopfield-structured Neural Network with Periodic Projection-based Supervised Learning Rules Authors: Tianwei Wang, Xinhui Ma, Wei Pang
Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models Authors: Li Sun, Zhenhao Huang, Ming Zhang, Philip S. Yu
On the Impossibility of Retrain Equivalence in Machine Unlearning Authors: Jiatong Yu, Yinghui He, Anirudh Goyal, Sanjeev Arora
How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime? Authors: Wei Huang, Andi Han, Yujin Song, Yilan Chen, Denny Wu, Difan Zou, Taiji Suzuki
Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation Authors: Rishi Sonthalia, Raj Rao Nadakuditi
Computational Budget Should Be Considered in Data Selection Authors: Weilin Wan, Weizhong Zhang, Cheng Jin
Atlas-based Manifold Representations for Interpretable Riemannian Machine Learning Authors: Ryan A. Robinett, Sophia A. Madejski, Kyle Ruark, Samantha J. Riesenfeld, Lorenzo Orecchia
All You Need is One: Capsule Prompt Tuning with a Single Vector Authors: Yiyang Liu, James C. Liang, Heng Fan, Wenhao Yang, Yiming Cui, Xiaotian Han, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han
Zeroth-Order Sharpness-Aware Learning with Exponential Tilting Authors: Xuchen Gong, Tian Li
MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models Authors: Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng
NeurIPT: Foundation Model for Neural Interfaces Authors: Zitao Fang, Chenxuan Li, Hongting Zhou, Shuyang Yu, Guodong Du, Ashwaq Qasem, Yang Lu, Jing Li, Junsong Zhang, Sim Kuan Goh
Model Metamers Reveal Invariances in Graph Neural Networks Authors: Wei Xu, Xiaoyi Jiang, Lixiang Xu, Dechao Tang
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs Authors: Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing Authors: Tianhua Xia, Sai Qian Zhang
Early-stopping for Transformer model training Authors: Jing He, Hua Jiang, Cheng Li, Siqian Xin, Shuzhen Yang
ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification Authors: Athanasios Angelakis, Amne Mousa, Micah L. A. Heldeweg, Laurens A. Biesheuvel, Mark A. Haaksma, Jasper M. Smit, Pieter R. Tuinman, Paul W. G. Elbers
Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization Authors: Tina Behnia, Puneesh Deora, Christos Thrampoulidis
Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization Authors: Changsheng Wang, Xin Chen, Sijia Liu, Ke Ding
Memorizing Long-tail Data Can Help Generalization Through Composition Authors: Mo Zhou, Haoyang Ma, Rong Ge
Protein Folding with Neural Ordinary Differential Equations Authors: Arielle Sanford, Shuo Sun, Christian B. Mendl
Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness Authors: Longwei Wang, Ifrat Ikhtear Uddin, KC Santosh, Chaowei Zhang, Xiao Qin, Yang Zhou
DFNN: A Deep Fr\'echet Neural Network Framework for Learning Metric-Space-Valued Responses Authors: Kyum Kim, Yaqing Chen, Paromita Dubey
Local properties of neural networks through the lens of layer-wise Hessians Authors: Maxim Bolshim (ITMO University, Saint Petersburg, Russia), Alexander Kugaevskikh (ITMO University, Saint Petersburg, Russia)
Vector Quantization in the Brain: Grid-like Codes in World Models Authors: Xiangyuan Peng, Xingsi Dong, Si Wu
Mapping Post-Training Forgetting in Language Models at Scale Authors: Jackson Harmon, Andreas Hochlehnert, Matthias Bethge, Ameya Prabhu

1. Unbiased Gradient Low-Rank Projection

ArXiv ID: 2510.17802

Authors: Rui Pan, Yang Luo, Yuxing Liu, Yang You, Tong Zhang

Abstract: Memory-efficient optimization is critical for training increasingly large language models (LLMs). A popular strategy involves gradient low-rank projection, storing only the projected optimizer states, with GaLore being a representative example. However, a significant drawback of many such methods is their lack of convergence guarantees, as various low-rank projection approaches introduce inherent biases relative to the original optimization algorithms, which contribute to performance gaps compared to full-parameter training. Aiming to tackle this problem, this paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms. In particular, an instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore's mechanism and the Muon algorithm, named GaLore Unbiased with Muon (GUM). We theoretically prove our method matches the convergence guarantees of the base Muon algorithm while preserving the memory efficiency of low-rank techniques. Empirical experiments on LLM fine-tuning and pretraining also demonstrate non-trivial improvements over GaLore and even better performance than full-parameter training. Further investigation shows that the improvement of this technique comes from a more uniform distribution of knowledge inside layers, leading to more efficient utilization of the model parameter space and better memorization.

Comment: Model Compression and Efficiency: unbiased low-rank gradient projection (GUM) with convergence guarantees, preserving memory savings while matching/improving full-parameter training.

Relevance: 10 Novelty: 9

2. Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads

ArXiv ID: 2510.16807

Authors: Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, Zhouchen Lin

Abstract: Transformer models have driven breakthroughs across various language tasks by their strong capability to learn rich contextual representations. Scaling them to improve representation, however, often demands substantial memory and compute costs, such as the Key-Value (KV) cache used during auto-regressive decoding. Skip connections offer a promising way to improve representation without bloating resource usage, yet most prior works either improve expressivity while leaving KV costs unchanged, or reduce memory at the cost of weaker representation. In this work, we propose SkipV1Former, a Transformer variant that uses skip connections from the first layer's Value heads to strengthen model representation and reduce KV cache. Specifically, from the second block onward, each layer reuses half of its Value heads from the very first layer, while computing the other half as usual-cutting Value projections and V cache by nearly 50 \%. Theoretically, we show that routing uncompressed first-layer Values into deeper layers restores information lost to compression and accelerates the model's implicit mesa-optimization-a key pattern of Transformer in auto-regressive tasks. Empirically, across different model scales, SkipV1Former delivers consistent reductions of approximately 25 \% in KV cache while improving perplexity relative to standard Multi-Head Attention (MHA) Transformers and some advanced variants. Moreover, we propose a recipe for uptraining existing MHA Transformer checkpoints to SkipV1Former with only 10-15\% additional compute. Finally, SkipV1Former can seamlessly combine advanced methods like Group-Query Attention and Multi-Latent Attention to achieve further KV cache savings and performance improvement. When combined with YOCO, it cuts KV cache size by nearly 50 \% while still improving performance.

Comment: Model Architecture + Efficiency: SkipV1Former reuses first-layer Value heads to cut V projections/KV cache (~25–50%) while improving perplexity; KV-cache reduction.

Relevance: 10 Novelty: 9

3. The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis

ArXiv ID: 2510.17515

Authors: Hoang Pham, The-Anh Ta, Tom Jacobs, Rebekka Burkholz, Long Tran-Thanh

Abstract: Sparse neural networks promise efficiency, yet training them effectively remains a fundamental challenge. Despite advances in pruning methods that create sparse architectures, understanding why some sparse structures are better trainable than others with the same level of sparsity remains poorly understood. Aiming to develop a systematic approach to this fundamental problem, we propose a novel theoretical framework based on the theory of graph limits, particularly graphons, that characterizes sparse neural networks in the infinite-width regime. Our key insight is that connectivity patterns of sparse neural networks induced by pruning methods converge to specific graphons as networks' width tends to infinity, which encodes implicit structural biases of different pruning methods. We postulate the Graphon Limit Hypothesis and provide empirical evidence to support it. Leveraging this graphon representation, we derive a Graphon Neural Tangent Kernel (Graphon NTK) to study the training dynamics of sparse networks in the infinite width limit. Graphon NTK provides a general framework for the theoretical analysis of sparse networks. We empirically show that the spectral analysis of Graphon NTK correlates with observed training dynamics of sparse networks, explaining the varying convergence behaviours of different pruning methods. Our framework provides theoretical insights into the impact of connectivity patterns on the trainability of various sparse network architectures.

Comment: Matches Model Compression and Sparsity Theory: introduces a graphon-based infinite-width framework and Graphon NTK to analyze pruning and sparse network trainability.

Relevance: 10 Novelty: 9

4. Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts

ArXiv ID: 2510.16448

Authors: Yongxiang Hua, Haoyu Cao, Zhou Tao, Bocheng Li, Zihao Wu, Chaohu Liu, Linli Xu

Abstract: Sparse Mixture of Experts (sMoE) has become a pivotal approach for scaling large vision-language models, offering substantial capacity while maintaining computational efficiency through dynamic, sparse activation of experts. However, existing routing mechanisms, typically based on similarity scoring, struggle to effectively capture the underlying input structure. This limitation leads to a trade-off between expert specialization and balanced computation, hindering both scalability and performance. We propose Input Domain Aware MoE, a novel routing framework that leverages a probabilistic mixture model to better partition the input space. By modeling routing probabilities as a mixture of distributions, our method enables experts to develop clear specialization boundaries while achieving balanced utilization. Unlike conventional approaches, our routing mechanism is trained independently of task-specific objectives, allowing for stable optimization and decisive expert assignments. Empirical results on vision-language tasks demonstrate that our method consistently outperforms existing sMoE approaches, achieving higher task performance and improved expert utilization balance.

Comment: Model Architecture (MoE): probabilistic input-domain-aware routing decoupled from task optimization for expert specialization and balanced utilization.

Relevance: 10 Novelty: 8

5. TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs

ArXiv ID: 2510.15926

Authors: Ye Qiao, Zhiheng Chen, Yifan Zhang, Yian Wang, Sitao Huang

Abstract: With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as low as 1.58~bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected long latency of the prefill stage. We present \textbf{TeLLMe}, the first table-lookup-based ternary LLM accelerator for low-power edge FPGAs that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. TeLLMe incorporates several novel techniques, including (1) a table-lookup-based ternary matrix multiplication (TLMM) engine utilizing grouped activations and online precomputation for low resource utilization and high throughput; (2) a fine-grained analytic URAM-based weight buffer management scheme for efficient loading and compute engine access; (3) a streaming dataflow architecture that fuses floating-point element-wise operations with linear computations to hide latency; (4) a reversed-reordered prefill stage attention with fused attention operations for high memory efficiency; and (5) a resource-efficient specialized decoding stage attention. Under a 5~W power budget, TeLLMe delivers up to 25~tokens/s decoding throughput and 0.45--0.96~s time-to-first-token (TTFT) for 64--128 token prompts, marking a significant energy-efficiency advancement in LLM inference on edge FPGAs.

Comment: High Performance Computing / Compression: ternary (1.58-bit) LLM accelerator with table-lookup matmul, fused attention, and prefill/decoding optimizations on edge FPGAs.

Relevance: 10 Novelty: 8

6. Expert Merging in Sparse Mixture of Experts with Nash Bargaining

ArXiv ID: 2510.16138

Authors: Dung V. Nguyen, Anh T. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Shiqi Jiang, Ethan Fetaya, Linh Duy Tran, Gal Chechik, Tan M. Nguyen

Abstract: Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, we reinterpret expert merging through the lens of game theory, revealing cooperative and competitive dynamics among experts. Based on this perspective, we introduce Nash Merging of Experts (NAMEx), a novel framework that incorporates Nash Bargaining into the merging process, enabling more balanced and efficient collaboration among experts. Additionally, we incorporate complex momentum into NAMEx to accelerate expert propagation with theoretical guarantees for convergence. Extensive experiments across language modelling, text classification, image classification, and zero-shot robustness under data corruption show that NAMEx consistently outperforms competing methods while integrating seamlessly with popular MoE architectures. Finally, we demonstrate NAMEx's scalability by applying it to large-scale systems, including Qwen1.5-MoE (14B) and DeepSeek-MoE (16B), where it proves effective in both zero-shot and fine-tuning settings.

Comment: Model Architecture (MoE): principled expert merging for sparse MoE via Nash bargaining with convergence guarantees; improves merging over ad-hoc averaging.

Relevance: 10 Novelty: 8

7. Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior

ArXiv ID: 2510.16356

Authors: Fuqun Han, Stanley Osher, Wuchen Li

Abstract: In this work, we propose a sparse transformer architecture that incorporates prior information about the underlying data distribution directly into the transformer structure of the neural network. The design of the model is motivated by a special optimal transport problem, namely the regularized Wasserstein proximal operator, which admits a closed-form solution and turns out to be a special representation of transformer architectures. Compared with classical flow-based models, the proposed approach improves the convexity properties of the optimization problem and promotes sparsity in the generated samples. Through both theoretical analysis and numerical experiments, including applications in generative modeling and Bayesian inverse problems, we demonstrate that the sparse transformer achieves higher accuracy and faster convergence to the target distribution than classical neural ODE-based methods.

Comment: Model Architecture + Sparsity: proposes a sparse transformer grounded in regularized Wasserstein proximal operator with L1 prior; theoretical and architectural innovation.

Relevance: 10 Novelty: 8

8. Accelerating Frontier MoE Training with 3D Integrated Optics

ArXiv ID: 2510.15893

Authors: Mikhail Bernadskiy, Peter Carson, Thomas Graham, Taylor Groves, Ho John Lee, Eric Yeh

Abstract: The unabated growth in AI workload demands is driving the need for concerted advances in compute, memory, and interconnect performance. As traditional semiconductor scaling slows, high-speed interconnects have emerged as the new scaling engine, enabling the creation of larger logical GPUs by linking many GPUs into a single, low-latency, high-bandwidth compute domain. While initial scale-up fabrics leveraged copper interconnects for their power and cost advantages, the maximum reach of passive electrical interconnects (approximately 1 meter) effectively limits the scale-up domain to within a single rack. The advent of 3D-stacked optics and logic offers a transformative, power-efficient scale-up solution for connecting hundreds of GPU packages (thousands of GPUs) across multiple data center racks. This work explores the design tradeoffs of scale-up technologies and demonstrates how frontier LLMs necessitate novel photonic solutions to achieve aggressive power and performance targets. We model the benefits of 3D CPO (Passage) enabled GPUs and switches within the scale-up domain when training Frontier Mixture of Experts (MoE) models exceeding one trillion parameters. Our results show that the substantial increases in bandwidth and radix enabled by 3D CPO allow for an 8X increase in scale-up capability. This affords new opportunities for multi-dimensional parallelism within the scale-up domain and results in a 2.7X reduction in time-to-train, unlocking unprecedented model scaling.

Comment: High Performance Computing: photonic 3D co-packaged optics to scale MoE training across racks; systems-level innovation enabling larger parallelism and faster training.

Relevance: 10 Novelty: 8

ArXiv ID: 2510.16045

Authors: Mengtao Lv, Ruiqi Zhu, Xinyu Wang, Yun Li

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in various kinds of tasks, while the billion or even trillion parameters bring storage and efficiency bottlenecks for inference. Quantization, particularly floating-point quantization, is known to be capable of speeding up LLM inference by reducing memory footprint and data movement during the inference process. For the first time, we advance the floating-point quantization exploration from integer bitwidths to non-integer bit-widths, namely AMS-Quant, to further approach the quantization sweet spot. AMS-Quant incorporates two novel techniques to put it into effect: (1) it proposes Mantissa-bit Sharing, which groups k quantized weights and lets them share the least significant mantissa bit, allowing us to further approach the minimum quantization bit-width without accuracy loss. (2) It introduces Adaptive Searching, which employs an offline optimization strategy to minimize the accuracy degradation introduced by sharing. Moreover, AMS-Quant is also prototyped as efficient CUDA Linear kernels, which translates memory savings into wall-clock latency reduction by reducing memory access. Extensive experiments on large-scale datasets and models show that AMS-Quant can quantize the model to FP-5.33-e2m3 and FP4.25-e2m2, and significantly speed up the LLM decoding over FP16 inference (2.8x and 3.2x), with negligible accuracy loss.

Comment: Matches Model Compression and Efficiency: introduces adaptive mantissa-bit sharing for sub-integer floating-point quantization with CUDA kernels, reducing memory access and latency.

Relevance: 10 Novelty: 8

10. One-Bit Quantization for Random Features Models

ArXiv ID: 2510.16250

Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi

Abstract: Recent advances in neural networks have led to significant computational and memory demands, spurring interest in one-bit weight compression to enable efficient inference on resource-constrained devices. However, the theoretical underpinnings of such compression remain poorly understood. We address this gap by analyzing one-bit quantization in the Random Features model, a simplified framework that corresponds to neural networks with random representations. We prove that, asymptotically, quantizing weights of all layers except the last incurs no loss in generalization error, compared to the full precision random features model. Our findings offer theoretical insights into neural network compression. We also demonstrate empirically that one-bit quantization leads to significant inference speed ups for the Random Features models even on a laptop GPU, confirming the practical benefits of our work. Additionally, we provide an asymptotically precise characterization of the generalization error for Random Features with an arbitrary number of layers. To the best of our knowledge, our analysis yields more general results than all previous works in the related literature.

Comment: Model Compression and Efficiency: theory for one-bit quantization in Random Features models showing no generalization loss when quantizing all but last layer.

Relevance: 9 Novelty: 8

11. Localist LLMs with Recruitment Learning

ArXiv ID: 2510.17358

Authors: Joachim Diederich

Abstract: We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovations are (1) a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining, (2) an information-theoretic recruitment mechanism that adaptively allocates semantic blocks as needed, eliminating the requirement for complete domain knowledge at initialization, and (3) a hierarchical recruitment framework that extends capacity allocation to entire specialized LLMs, enabling multi-granularity architectural adaptation. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, dynamic rule injection, and principled recruitment criteria based on penalized likelihood with explicit units. We provide rigorous mathematical results establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks at stationary points, with exact bounds on attention entropy and pointer fidelity. The hierarchical recruitment mechanism provides convergence guarantees at both the block level (fine-grained, within-LLM) and the LLM level (coarse-grained, cross-domain), ensuring the system discovers semantic partitions that balance model complexity against data encoding efficiency. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes while adapting architectural capacity at multiple granularities, supporting applications in regulated domains requiring both transparency and capability.

Comment: Model Architecture/Sparsity: introduces a tunable locality dial and information-theoretic recruitment with group sparsity on attention for adaptive interpretable-to-distributed encodings.

Relevance: 9 Novelty: 8

12. Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

ArXiv ID: 2510.17196

Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu

Abstract: Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

Comment: Model Architecture/Sparsity: analyzes and improves hierarchical sparse attention for extreme length generalization with key design principles and theory for chunk encoding/residual bypass.

Relevance: 9 Novelty: 8

13. Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling

ArXiv ID: 2510.17106

Authors: Chen Zhang, Weixin Bu, Wendong Xu, Runsheng Yu, Yik-Chung Wu, Ngai Wong

Abstract: Transformers have achieved remarkable success in time series modeling, yet their internal mechanisms remain opaque. This work demystifies the Transformer encoder by establishing its fundamental equivalence to a Graph Convolutional Network (GCN). We show that in the forward pass, the attention distribution matrix serves as a dynamic adjacency matrix, and its composition with subsequent transformations performs computations analogous to graph convolution. Moreover, we demonstrate that in the backward pass, the update dynamics of value and feed-forward projections mirror those of GCN parameters. Building on this unified theoretical reinterpretation, we propose \textbf{Fighter} (Flexible Graph Convolutional Transformer), a streamlined architecture that removes redundant linear projections and incorporates multi-hop graph aggregation. This perspective yields an explicit and interpretable representation of temporal dependencies across different scales, naturally expressed as graph edges. Experiments on standard forecasting benchmarks confirm that Fighter achieves competitive performance while providing clearer mechanistic interpretability of its predictions.

Comment: Model Architecture and Analysis: proves equivalence between Transformer attention and GCNs in time series, and introduces a streamlined graph-convolutional Transformer (Fighter).

Relevance: 9 Novelty: 8

14. CTR-LoRA: Curvature-Aware and Trust-Region Guided Low-Rank Adaptation for Large Language Models

ArXiv ID: 2510.15962

Authors: Zhuxuanzi Wang, Mingqiao Mo, Xi Xiao, Chen Liu, Chenrui Ma, Yunbei Zhang, Xiao Wang, Smita Krishnaswamy, Tianyang Wang

Abstract: Parameter-efficient fine-tuning (PEFT) has become the standard approach for adapting large language models under limited compute and memory budgets. Although previous methods improve efficiency through low-rank updates, quantization, or heuristic budget reallocation, they often decouple the allocation of capacity from the way updates evolve during training. In this work, we introduce CTR-LoRA, a framework guided by curvature trust region that integrates rank scheduling with stability-aware optimization. CTR-LoRA allocates parameters based on marginal utility derived from lightweight second-order proxies and constrains updates using a Fisher/Hessian-metric trust region. Experiments on multiple open-source backbones (7B-13B), evaluated on both in-distribution and out-of-distribution benchmarks, show consistent improvements over strong PEFT baselines. In addition to increased accuracy, CTR-LoRA enhances training stability, reduces memory requirements, and achieves higher throughput, positioning it on the Pareto frontier of performance and efficiency. These results highlight a principled path toward more robust and deployable PEFT.

Comment: Model Compression/Efficiency: PEFT via curvature-aware trust-region LoRA with adaptive rank scheduling using second-order proxies; stability and throughput gains.

Relevance: 9 Novelty: 8

15. Glyph: Scaling Context Windows via Visual-Text Compression

ArXiv ID: 2510.17800

Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang

Abstract: Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

Comment: Model Compression and Efficiency: compresses long textual context via visual rendering to reduce tokens and compute, yielding faster prefilling/decoding and SFT.

Relevance: 9 Novelty: 8

16. Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity

ArXiv ID: 2510.15964

Authors: Tuowei Wang, Kun Li, Zixu Hao, Donglin Bai, Ju Ren, Yaoxue Zhang, Ting Cao, Mao Yang

Abstract: The adaptation of pre-trained large language models (LLMs) to diverse downstream tasks via fine-tuning is critical for numerous applications. However, the inefficiency of parameter-efficient fine-tuning (PEFT) techniques presents significant challenges in terms of time investments and operational costs. In this paper, we first introduce a nuanced form of sparsity, termed Shadowy Sparsity, which is distinctive in fine-tuning and has not been adequately addressed for acceleration. Under Shadowy Sparsity, we propose Long Exposure, an efficient system to accelerate PEFT for LLMs. Long Exposure comprises three key components: Shadowy-sparsity Exposer employs a prolonged sensing range to capture more sparsity details under shadowy sparsity; Sequence-oriented Predictor provides efficient yet accurate predictions to handle large sequence inputs and constantly-evolving parameters; and Dynamic-aware Operator facilitates more structured computational patterns and coalesced memory accesses, addressing dynamic sparse operations. Extensive evaluations show that Long Exposure outperforms state-of-the-arts with up to a $2.49\times$ speedup in end-to-end fine-tuning, offering promising advancements in accelerating PEFT for LLMs.

Comment: Model Compression and Efficiency/HPC: exploits fine-tuning-time sparsity with dynamic sparse operators and predictors to accelerate PEFT.

Relevance: 9 Novelty: 8

17. MuonBP: Faster Muon via Block-Periodic Orthogonalization

ArXiv ID: 2510.16981

Authors: Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, Youngsuk Park

Abstract: Gradient orthogonalization is a simple strategy that shows great utility in speeding up gradient descent. The Muon optimizer (Jordan, Jin, et al., 2024) combines gradient orthogonalization with first-order momentum and achieves significant improvement in data efficiency over Adam/AdamW (Loshchilov and Hutter, 2019) for language model training. However, when using model parallelism, gradient orthogonalization introduces additional overhead compared to coordinate-wise optimizers (such as AdamW) due to additional gather and scatter operations on gradient matrix shards from different devices. This additional communication can amount to a throughput hit of 5%-10% compared to Adam/AdamW. To remedy this, we propose Muon with Block-Periodic Orthogonalization (MuonBP), which applies orthogonalization independently to matrix shards on each device and periodically performs full orthogonalization to maintain training stability at scale. We show how to adjust the learning rate from the baseline to MuonBP and give convergence guarantees for this algorithm. Crucially, our theory dictates that we use two stepsizes: one for the blockwise orthogonalization steps, and one for the full orthogonalization steps. Our method is simple, requires minimal hyperparameter adjustments, and achieves competitive iteration complexity compared with baseline Muon while providing per-iteration throughput comparable to coordinate-wise methods such as AdamW. When training an 8B model with eight-way tensor parallelism and ZeRO optimizer state sharding, MuonBP achieves 8% throughput increase compared to Muon with no degradation in performance.

Comment: High Performance Computing: distributed-friendly optimizer (block-periodic orthogonalization) reducing communication with theory and throughput gains.

Relevance: 9 Novelty: 8

18. AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution

ArXiv ID: 2510.15982

Authors: Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, Il-Chul Moon

Abstract: Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $\alpha$-mixture assistant distribution, a novel generalized family of assistant distributions, and $\alpha$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $\alpha$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $\alpha$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space.

Comment: Model Compression and Efficiency: generalized assistant distribution and divergences for KD of LLMs improving stability/performance.

Relevance: 9 Novelty: 8

19. FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern

ArXiv ID: 2510.15882

Authors: Ao Shen, Rui Zhang, Junping Zhao

Abstract: As large language models (LLMs) continue to scale, multi-node deployment has become a necessity. Consequently, communication has become a critical performance bottleneck. Current intra-node communication libraries, like NCCL, typically make use of a single interconnect such as NVLink. This approach creates performance ceilings, especially on hardware like the H800 GPU where the primary interconnect's bandwidth can become a bottleneck, and leaves other hardware resources like PCIe and Remote Direct Memory Access (RDMA)-capable Network Interface Cards (NICs) largely idle during intensive workloads. We propose FlexLink, the first collective communication framework to the best of our knowledge designed to systematically address this by aggregating these heterogeneous links-NVLink, PCIe, and RDMA NICs-into a single, high-performance communication fabric. FlexLink employs an effective two-stage adaptive load balancing strategy that dynamically partitions communication traffic across all available links, ensuring that faster interconnects are not throttled by slower ones. On an 8-GPU H800 server, our design improves the bandwidth of collective operators such as AllReduce and AllGather by up to 26% and 27% over the NCCL baseline, respectively. This gain is achieved by offloading 2-22% of the total communication traffic to the previously underutilized PCIe and RDMA NICs. FlexLink provides these improvements as a lossless, drop-in replacement compatible with the NCCL API, ensuring easy adoption.

Comment: High Performance Computing: novel collective communication fabric aggregating NVLink, PCIe, and RDMA with adaptive load balancing; drop-in replacement for NCCL.

Relevance: 9 Novelty: 8

20. Compressing Many-Shots in In-Context Learning

ArXiv ID: 2510.16092

Authors: Devvrit Khatri, Pranamya Kulkarni, Nilesh Gupta, Yerram Varun, Liqian Peng, Jay Yagnik, Praneeth Netrapalli, Cho-Jui Hsieh, Alec Go, Inderjit S Dhillon, Aditya Kusupati, Prateek Jain

Abstract: Large Language Models (LLMs) have been shown to be able to learn different tasks without explicit finetuning when given many input-output examples / demonstrations through In-Context Learning (ICL). Increasing the number of examples, called ``shots'', improves downstream task performance but incurs higher memory and computational costs. In this work, we study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts. Given many shots comprising t tokens, our goal is to generate a m soft-token summary, where m < t. We first show that existing prompt compression methods are ineffective for many-shot compression, and simply using fewer shots as a baseline is surprisingly strong. To achieve effective compression, we find that: (a) a stronger compressor model with more trainable parameters is necessary, and (b) compressing many-shot representations at each transformer layer enables more fine-grained compression by providing each layer with its own compressed representation. Based on these insights, we propose MemCom, a layer-wise compression method. We systematically evaluate various compressor models and training approaches across different model sizes (2B and 7B), architectures (Gemma and Mistral), many-shot sequence lengths (3k-6k tokens), and compression ratios (3x to 8x). MemCom outperforms strong baselines across all compression ratios on multiple classification tasks with large label sets. Notably, while baseline performance degrades sharply at higher compression ratios, often by over 20-30%, MemCom maintains high accuracy with minimal degradation, typically dropping by less than 10%.

Comment: Model Efficiency: compresses many-shot in-context prompts via layer-wise soft-token summaries to cut memory/compute during inference.

Relevance: 9 Novelty: 8

21. Neuronal Group Communication for Efficient Neural representation

ArXiv ID: 2510.16851

Authors: Zhengqi Pei, Qingming Huang, Shuhui Wang

Abstract: The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential'', which nudges the neural dynamics away from trivial trajectories while preserving stability. Empirically, we instantiate NGC in large language models (LLMs) and demonstrate improved performance on complex reasoning benchmarks under moderate compression. NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable compression rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.

Comment: Matches Model Architecture and Compression/Efficiency: proposes low-rank, group-based neuronal communication with a stability metric, improving compactness and modularity.

Relevance: 9 Novelty: 8

22. Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models

ArXiv ID: 2510.15987

Authors: Samuel Lippl, Thomas McGee, Kimberly Lopez, Ziwen Pan, Pierce Zhang, Salma Ziadi, Oliver Eberle, Ida Momennejad

Abstract: How do latent and inference time computations enable large language models (LLMs) to solve multi-step reasoning? We introduce a framework for tracing and steering algorithmic primitives that underlie model reasoning. Our approach links reasoning traces to internal activation patterns and evaluates algorithmic primitives by injecting them into residual streams and measuring their effect on reasoning steps and task performance. We consider four benchmarks: Traveling Salesperson Problem (TSP), 3SAT, AIME, and graph navigation. We operationalize primitives by clustering neural activations and labeling their matched reasoning traces. We then apply function vector methods to derive primitive vectors as reusable compositional building blocks of reasoning. Primitive vectors can be combined through addition, subtraction, and scalar operations, revealing a geometric logic in activation space. Cross-task and cross-model evaluations (Phi-4, Phi-4-Reasoning, Llama-3-8B) show both shared and task-specific primitives. Notably, comparing Phi-4 with its reasoning-finetuned variant highlights compositional generalization after finetuning: Phi-4-Reasoning exhibits more systematic use of verification and path-generation primitives. Injecting the associated primitive vectors in Phi-4-Base induces behavioral hallmarks associated with Phi-4-Reasoning. Together, these findings demonstrate that reasoning in LLMs may be supported by a compositional geometry of algorithmic primitives, that primitives transfer cross-task and cross-model, and that reasoning finetuning strengthens algorithmic generalization across domains.

Comment: Matches Representation Learning/Mechanistic Interpretability: identifies and steers compositional activation primitives underlying LLM reasoning via function vectors.

Relevance: 9 Novelty: 8

23. Infinite Neural Operators: Gaussian processes on functions

ArXiv ID: 2510.16675

Authors: Daniel Augusto de Souza, Yuchen Zhu, Harry Jake Cunningham, Yuri Saporito, Diego Mesquita, Marc Peter Deisenroth

Abstract: A variety of infinitely wide neural architectures (e.g., dense NNs, CNNs, and transformers) induce Gaussian process (GP) priors over their outputs. These relationships provide both an accurate characterization of the prior predictive distribution and enable the use of GP machinery to improve the uncertainty quantification of deep neural networks. In this work, we extend this connection to neural operators (NOs), a class of models designed to learn mappings between function spaces. Specifically, we show conditions for when arbitrary-depth NOs with Gaussian-distributed convolution kernels converge to function-valued GPs. Based on this result, we show how to compute the covariance functions of these NO-GPs for two NO parametrizations, including the popular Fourier neural operator (FNO). With this, we compute the posteriors of these GPs in regression scenarios, including PDE solution operators. This work is an important step towards uncovering the inductive biases of current FNO architectures and opens a path to incorporate novel inductive biases for use in kernel-based operator learning methods.

Comment: Matches Model Architecture Theory: establishes GP limits for neural operators (incl. FNO), enabling kernel-based operator learning with computed covariances/posteriors.

Relevance: 9 Novelty: 8

24. Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares

ArXiv ID: 2510.17506

Authors: Lachlan Ewen MacDonald, Hancheng Min, Leandro Palma, Salma Tarmoun, Ziqing Xu, Ren\'e Vidal

Abstract: Classical optimisation theory guarantees monotonic objective decrease for gradient descent (GD) when employed in a small step size, or stable", regime. In contrast, gradient descent on neural networks is frequently performed in a large step size regime called theedge of stability", in which the objective decreases non-monotonically with an observed implicit bias towards flat minima. In this paper, we take a step toward quantifying this phenomenon by providing convergence rates for gradient descent with large learning rates in an overparametrised least squares setting. The key insight behind our analysis is that, as a consequence of overparametrisation, the set of global minimisers forms a Riemannian manifold $M$, which enables the decomposition of the GD dynamics into components parallel and orthogonal to $M$. The parallel component corresponds to Riemannian gradient descent on the objective sharpness, while the orthogonal component is a bifurcating dynamical system. This insight allows us to derive convergence rates in three regimes characterised by the learning rate size: (a) the subcritical regime, in which transient instability is overcome in finite time before linear convergence to a suboptimally flat global minimum; (b) the critical regime, in which instability persists for all time with a power-law convergence toward the optimally flat global minimum; and (c) the supercritical regime, in which instability persists for all time with linear convergence to an orbit of period two centred on the optimally flat global minimum.

Comment: Matches Training Dynamics Theory: convergence rates and regimes for GD at edge of stability via manifold-based decomposition in overparameterized least squares.

Relevance: 9 Novelty: 8

25. Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability

ArXiv ID: 2510.17040

Authors: Hoang-Son Nguyen, Xiao Fu

Abstract: Latent component identification from unknown nonlinear mixtures is a foundational challenge in machine learning, with applications in tasks such as disentangled representation learning and causal inference. Prior work in nonlinear independent component analysis (nICA) has shown that auxiliary signals -- such as weak supervision -- can support identifiability of conditionally independent latent components. More recent approaches explore structural assumptions, e.g., sparsity in the Jacobian of the mixing function, to relax such requirements. In this work, we introduce Diverse Influence Component Analysis (DICA), a framework that exploits the convex geometry of the mixing function's Jacobian. We propose a Jacobian Volume Maximization (J-VolMax) criterion, which enables latent component identification by encouraging diversity in their influence on the observed variables. Under reasonable conditions, this approach achieves identifiability without relying on auxiliary information, latent component independence, or Jacobian sparsity assumptions. These results extend the scope of identifiability analysis and offer a complementary perspective to existing methods.

Comment: Matches Representation Learning/Identifiability: introduces Jacobian Volume Maximization to identify nonlinear latent components without auxiliary signals or sparsity assumptions.

Relevance: 9 Novelty: 8

26. SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference

ArXiv ID: 2510.17189

Authors: Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun, Yongpan Liu

Abstract: Transformers have shown remarkable performance in both natural language processing (NLP) and computer vision (CV) tasks. However, their real-time inference speed and efficiency are limited due to the inefficiency in Softmax and Layer Normalization (LayerNorm). Previous works based on function approximation suffer from inefficient implementation as they place emphasis on computation while disregarding memory overhead concerns. Moreover, such methods rely on retraining to compensate for approximation error which can be costly and inconvenient. In this paper, we present SOLE, a hardware-software co-design for Softmax and LayerNorm which is composed of E2Softmax and AILayerNorm. E2Softmax utilizes log2 quantization of exponent function and log-based division to approximate Softmax while AILayerNorm adopts low-precision statistic calculation. Compared with state-of-the-art designs, we achieve both low-precision calculation and low bit-width storage on Softmax and LayerNorm. Experiments show that SOLE maintains inference accuracy without retraining while offering orders of magnitude speedup and energy savings over GPU, achieving 3.04x, 3.86x energy-efficiency improvements and 2.82x, 3.32x area-efficiency improvements over prior state-of-the-art custom hardware for Softmax and LayerNorm, respectively.

Comment: High Performance Computing / Efficiency: hardware-software co-design of Softmax and LayerNorm (E2Softmax, AILayerNorm) with low-precision arithmetic and no retraining.

Relevance: 9 Novelty: 7

27. QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

ArXiv ID: 2510.16292

Authors: Yutong Wang, Haiyu Wang, Sai Qian Zhang

Abstract: Vision-Language Models (VLMs) are integral to tasks such as image captioning and visual question answering, but their high computational cost, driven by large memory footprints and processing time, limits their scalability and real-time applicability. In this work, we propose leveraging Singular-Value Decomposition (SVD) over the joint query (Q), key (K), and value (V) weight matrices to reduce KV cache size and computational overhead. We in addition introduce an efficient rank allocation strategy that dynamically adjusts the SVD rank based on its impact on VLM accuracy, achieving a significant reduction in both memory usage and computational cost. Finally, we extend this approach by applying quantization to both VLM weights and activations, resulting in a highly efficient VLM. Our method outperforms previous approaches that rely solely on quantization or SVD by achieving more than $10\%$ accuracy improvement while consuming less hardware cost, making it better for real-time deployment on resource-constrained devices. We open source our code at \href{https://github.com/SAI-Lab-NYU/QSVD}{\texttt{https://github.com/SAI-Lab-NYU/QSVD}}.

Comment: Model Compression and Efficiency: unified low-rank SVD across Q/K/VP with rank allocation and joint quantization to reduce KV cache and compute in VLMs.

Relevance: 9 Novelty: 7

28. ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

ArXiv ID: 2510.17197

Authors: Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang

Abstract: As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

Comment: Model Compression and Efficiency: zero-shot, prompt-aware visual token pruning for VLMs to reduce inference cost while preserving task-relevant content.

Relevance: 9 Novelty: 7

29. Bitwidth-Specific Logarithmic Arithmetic for Future Hardware-Accelerated Training

ArXiv ID: 2510.17058

Authors: Hassan Hamad, Yuou Qiu, Peter A. Beerel, Keith M. Chugg

Abstract: While advancements in quantization have significantly reduced the computational costs of inference in deep learning, training still predominantly relies on complex floating-point arithmetic. Low-precision fixed-point training presents a compelling alternative. This work introduces a novel enhancement in low-precision logarithmic fixed-point training, geared towards future hardware accelerator designs. We propose incorporating bitwidth in the design of approximations to arithmetic operations. To this end, we introduce a new hardware-friendly, piece-wise linear approximation for logarithmic addition. Using simulated annealing, we optimize this approximation at different precision levels. A C++ bit-true simulation demonstrates training of VGG-11 and VGG-16 models on CIFAR-100 and TinyImageNet, respectively, using 12-bit integer arithmetic with minimal accuracy degradation compared to 32-bit floating-point training. Our hardware study reveals up to 32.5% reduction in area and 53.5% reduction in energy consumption for the proposed LNS multiply-accumulate units compared to that of linear fixed-point equivalents.

Comment: Matches Compression and Efficiency: bitwidth-specific logarithmic arithmetic with hardware-friendly piecewise-linear addition enabling low-precision training.

Relevance: 9 Novelty: 7

30. Symmetry and Generalisation in Neural Approximations of Renormalisation Transformations

ArXiv ID: 2510.16591

Authors: Cassidy Ashworth, Pietro Li`o, Francesco Caso

Abstract: Deep learning models have proven enormously successful at using multiple layers of representation to learn relevant features of structured data. Encoding physical symmetries into these models can improve performance on difficult tasks, and recent work has motivated the principle of parameter symmetry breaking and restoration as a unifying mechanism underlying their hierarchical learning dynamics. We evaluate the role of parameter symmetry and network expressivity in the generalisation behaviour of neural networks when learning a real-space renormalisation group (RG) transformation, using the central limit theorem (CLT) as a test case map. We consider simple multilayer perceptrons (MLPs) and graph neural networks (GNNs), and vary weight symmetries and activation functions across architectures. Our results reveal a competition between symmetry constraints and expressivity, with overly complex or overconstrained models generalising poorly. We analytically demonstrate this poor generalisation behaviour for certain constrained MLP architectures by recasting the CLT as a cumulant recursion relation and making use of an established framework to propagate cumulants through MLPs. We also empirically validate an extension of this framework from MLPs to GNNs, elucidating the internal information processing performed by these more complex models. These findings offer new insight into the learning dynamics of symmetric networks and their limitations in modelling structured physical transformations.

Comment: Representation Learning/Training Dynamics: analyzes symmetry constraints and expressivity in MLPs/GNNs for learning RG transformations, with theoretical and empirical insights.

Relevance: 8 Novelty: 8

31. Just-In-Time Piecewise-Linear Semantics for ReLU-type Networks

ArXiv ID: 2510.17622

Authors: Hongyi Duan, Haoyang Liu, Jian'an Zhang, Fengrui Liu, Yiyi Wang

Abstract: We present a JIT PL semantics for ReLU-type networks that compiles models into a guarded CPWL transducer with shared guards. The system adds hyperplanes only when operands are affine on the current cell, maintains global lower/upper envelopes, and uses a budgeted branch-and-bound. We obtain anytime soundness, exactness on fully refined cells, monotone progress, guard-linear complexity (avoiding global $\binom{k}{2}$), dominance pruning, and decidability under finite refinement. The shared carrier supports region extraction, decision complexes, Jacobians, exact/certified Lipschitz, LP/SOCP robustness, and maximal causal influence. A minimal prototype returns certificates or counterexamples with cost proportional to visited subdomains.

Comment: Model Analysis/Verification: JIT piecewise-linear semantics for ReLU networks enabling exact/approx certificates, Lipschitz, robustness—foundational network semantics.

Relevance: 8 Novelty: 8

32. Asymptotically Stable Quaternion-valued Hopfield-structured Neural Network with Periodic Projection-based Supervised Learning Rules

ArXiv ID: 2510.16607

Authors: Tianwei Wang, Xinhui Ma, Wei Pang

Abstract: Motivated by the geometric advantages of quaternions in representing rotations and postures, we propose a quaternion-valued supervised learning Hopfield-structured neural network (QSHNN) with a fully connected structure inspired by the classic Hopfield neural network (HNN). Starting from a continuous-time dynamical model of HNNs, we extend the formulation to the quaternionic domain and establish the existence and uniqueness of fixed points with asymptotic stability. For the learning rules, we introduce a periodic projection strategy that modifies standard gradient descent by periodically projecting each 4*4 block of the weight matrix onto the closest quaternionic structure in the least-squares sense. This approach preserves both convergence and quaternionic consistency throughout training. Benefiting from this rigorous mathematical foundation, the experimental model implementation achieves high accuracy, fast convergence, and strong reliability across randomly generated target sets. Moreover, the evolution trajectories of the QSHNN exhibit well-bounded curvature, i.e., sufficient smoothness, which is crucial for applications such as control systems or path planning modules in robotic arms, where joint postures are parameterized by quaternion neurons. Beyond these application scenarios, the proposed model offers a practical implementation framework and a general mathematical methodology for designing neural networks under hypercomplex or non-commutative algebraic structures.

Comment: Model Architecture: quaternion-valued Hopfield-type network with projection-based learning and stability guarantees.

Relevance: 8 Novelty: 8

33. Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models

ArXiv ID: 2510.17457

Authors: Li Sun, Zhenhao Huang, Ming Zhang, Philip S. Yu

Abstract: Message Passing Neural Networks (MPNNs) is the building block of graph foundation models, but fundamentally suffer from oversmoothing and oversquashing. There has recently been a surge of interest in fixing both issues. Existing efforts primarily adopt global approaches, which may be beneficial in some regions but detrimental in others, ultimately leading to the suboptimal expressiveness. In this paper, we begin by revisiting oversquashing through a global measure -- spectral gap $\lambda$ -- and prove that the increase of $\lambda$ leads to gradient vanishing with respect to the input features, thereby undermining the effectiveness of message passing. Motivated by such theoretical insights, we propose a \textbf{local} approach that adaptively adjusts message passing based on local structures. To achieve this, we connect local Riemannian geometry with MPNNs, and establish a novel nonhomogeneous boundary condition to address both oversquashing and oversmoothing. Building on the Robin condition, we design a GBN network with local bottleneck adjustment, coupled with theoretical guarantees. Extensive experiments on homophilic and heterophilic graphs show the expressiveness of GBN. Furthermore, GBN does not exhibit performance degradation even when the network depth exceeds $256$ layers.

Comment: Model Architecture/Representation Learning: local Riemannian approach addressing oversmoothing/oversquashing with theoretical guarantees for deep MPNNs.

Relevance: 8 Novelty: 8

34. On the Impossibility of Retrain Equivalence in Machine Unlearning

ArXiv ID: 2510.16629

Authors: Jiatong Yu, Yinghui He, Anirudh Goyal, Sanjeev Arora

Abstract: Machine unlearning seeks to selectively remove the "influence" of specific training data on a model's outputs. The ideal goal is Retrain Equivalence--behavior identical to a model trained from scratch on only the retained data. This goal was formulated for models trained on i.i.d. data batches, but modern pipelines often involve multi-stage training, with each stage having a distinct data distribution and objective. Examples include LLM fine-tuning for alignment, reasoning ability, etc. Our study shows via theory and experiments that this shift to multi-stage training introduces a fundamental barrier for machine unlearning. The theory indicates that the outcome of local unlearning--methods that only use gradients computed on the forget set--is path-dependent. That is, a model's behavior during unlearning is influenced by the order of its training stages during learning, making it impossible for path-oblivious algorithms to universally achieve Retrain Equivalence. We empirically demonstrate the same phenomenon in LLM post-training across Llama and Qwen models (1B to 14B) with gradient ascent, NPO, and SimNPO local unlearning algorithms. Models fine-tuned via different orderings of identical training stages diverge in behavior during unlearning, with the degradation in GSM8K accuracy after unlearning varying by over 20% across paths. We also observe that some learning paths consistently produce models that unlearn slowly. During unlearning, whether the probability mass gets squeezed into paraphrasing or alternative concepts is also path-dependent. These results consistently show that Retrain Equivalence is an ill-posed target for local unlearning algorithms, so long as the target models are trained in stages. In situations where access to models' training histories is hard, the current work calls for rethinking the definition and desiderata of machine unlearning.

Comment: Representation Learning/Training Dynamics: theoretical impossibility result for retrain equivalence in multi-stage training, highlighting path dependence of local unlearning.

Relevance: 8 Novelty: 8

35. How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime?

ArXiv ID: 2510.17526

Authors: Wei Huang, Andi Han, Yujin Song, Yilan Chen, Denny Wu, Difan Zou, Taiji Suzuki

Abstract: The capacity of deep learning models is often large enough to both learn the underlying statistical signal and overfit to noise in the training set. This noise memorization can be harmful especially for data with a low signal-to-noise ratio (SNR), leading to poor generalization. Inspired by prior observations that label noise provides implicit regularization that improves generalization, in this work, we investigate whether introducing label noise to the gradient updates can enhance the test performance of neural network (NN) in the low SNR regime. Specifically, we consider training a two-layer NN with a simple label noise gradient descent (GD) algorithm, in an idealized signal-noise data setting. We prove that adding label noise during training suppresses noise memorization, preventing it from dominating the learning process; consequently, label noise GD enjoys rapid signal growth while the overfitting remains controlled, thereby achieving good generalization despite the low SNR. In contrast, we also show that NN trained with standard GD tends to overfit to noise in the same low SNR setting and establish a non-vanishing lower bound on its test error, thus demonstrating the benefit of introducing label noise in gradient-based training.

Comment: Matches Representation Learning/Training Dynamics Theory: proves label-noise gradient descent suppresses noise memorization and improves generalization in low SNR.

Relevance: 8 Novelty: 8

36. Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation

ArXiv ID: 2510.17120

Authors: Rishi Sonthalia, Raj Rao Nadakuditi

Abstract: We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.

Comment: Matches Model Architecture/Regularization: introduces matricial free energy loss from free probability to Gaussianize autoencoder codes.

Relevance: 8 Novelty: 8

37. Computational Budget Should Be Considered in Data Selection

ArXiv ID: 2510.16806

Authors: Weilin Wan, Weizhong Zhang, Cheng Jin

Abstract: Data selection improves computational efficiency by choosing informative subsets of training samples. However, existing methods ignore the compute budget, treating data selection and importance evaluation independently of compute budget constraints. Yet empirical studies show no algorithm can consistently outperform others (or even random selection) across varying budgets. We therefore argue that compute budget must be integral to data-selection strategies, since different budgets impose distinct requirements on data quantity, quality, and distribution for effective training. To this end, we propose a novel Computational budget-Aware Data Selection (CADS) method and naturally formulate it into a bilevel optimization framework, where the inner loop trains the model within the constraints of the computational budget on some selected subset of training data, while the outer loop optimizes data selection based on model evaluation. Our technical contributions lie in addressing two main challenges in solving this bilevel optimization problem: the expensive Hessian matrix estimation for outer-loop gradients and the computational burden of achieving inner-loop optimality during iterations. To solve the first issue, we propose a probabilistic reparameterization strategy and compute the gradient using a Hessian-free policy gradient estimator. To address the second challenge, we transform the inner optimization problem into a penalty term in the outer objective, further discovering that we only need to estimate the minimum of a one-dimensional loss to calculate the gradient, significantly improving efficiency. Extensive experiments show that our method achieves performance gains of up to 14.42% over baselines in vision and language benchmarks.

Comment: Matches Efficiency/Data Selection: compute-budget-aware bilevel data selection with Hessian-free gradient estimator and efficient inner-loop relaxation.

Relevance: 8 Novelty: 8

38. Atlas-based Manifold Representations for Interpretable Riemannian Machine Learning

ArXiv ID: 2510.17772

Authors: Ryan A. Robinett, Sophia A. Madejski, Kyle Ruark, Samantha J. Riesenfeld, Lorenzo Orecchia

Abstract: Despite the popularity of the manifold hypothesis, current manifold-learning methods do not support machine learning directly on the latent $d$-dimensional data manifold, as they primarily aim to perform dimensionality reduction into $\mathbb{R}^D$, losing key manifold features when the embedding dimension $D$ approaches $d$. On the other hand, methods that directly learn the latent manifold as a differentiable atlas have been relatively underexplored. In this paper, we aim to give a proof of concept of the effectiveness and potential of atlas-based methods. To this end, we implement a generic data structure to maintain a differentiable atlas that enables Riemannian optimization over the manifold. We complement this with an unsupervised heuristic that learns a differentiable atlas from point cloud data. We experimentally demonstrate that this approach has advantages in terms of efficiency and accuracy in selected settings. Moreover, in a supervised classification task over the Klein bottle and in RNA velocity analysis of hematopoietic data, we showcase the improved interpretability and robustness of our approach.

Comment: Representation Learning: learns a differentiable atlas for latent manifolds enabling Riemannian optimization and interpretable representations.