Personalized Daily ArXiv Papers 2025-10-15

[gpt-5]	Prompt	Completion	Total
Token	96425	97731	194156
Cost	$0.12	$0.98	$1.1

Total arXiv papers: 894

Total scanned papers: 512

Total relevant papers: 64

Table of contents with paper titles:

Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise Authors: Luca Scimeca, Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling Authors: Huiyin Xue, Nafise Sadat Moosavi, Nikolaos Aletras
AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs Authors: Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee
Dr.LLM: Dynamic Layer Routing in LLMs Authors: Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
Stability of Transformers under Layer Normalization Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Krishna Kumar, Markos A. Katsoulakis
MC#: Mixture Compressor for Mixture-of-Experts Large Models Authors: Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, Xiaojuan Qi
Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling Authors: Hehe Fan, Yi Yang, Mohan Kankanhalli, Fei Wu
PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models Authors: Lancheng Zou, Shuo Yin, Zehua Pei, Tsung-Yi Ho, Farzan Farnia, Bei Yu
ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces Authors: Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar
On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation Authors: Di Zhang
Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers Authors: Sarthak Mittal, Divyat Mahajan, Guillaume Lajoie, Mohammad Pezeshki
Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models Authors: Shihao Ji, Zihui Song, Jiajie Huang
Decomposer Networks: Deep Component Analysis and Synthesis Authors: Mohsen Joneidi
Direct Multi-Token Decoding Authors: Xuan Luo, Weizhi Wang, Xifeng Yan
Adversarial Attacks Leverage Interference Between Features in Superposition Authors: Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal
Differentiable Fast Top-K Selection for Large-Scale Recommendation Authors: Yanjie Zhu, Zhen Zhang, Yunli Wang, Zhiqiang Wang, Yu Li, Rufan Zhou, Shiyang Wen, Peng Jiang, Chenhao Lin, Jian Yang
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs Authors: Jo\~ao Paulo Cardoso de Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences Authors: Wenbo Wu, Qingyi Si, Xiurui Pan, Ye Wang, Jie Zhang
Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent Authors: Sara Dragutinovi\'c, Andrew M. Saxe, Aaditya K. Singh
Chimera: State Space Models Beyond Sequences Authors: Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu
In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning Authors: Tomoya Wakayama, Taiji Suzuki
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs Authors: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism Authors: Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, Chuan Wu
Rademacher Meets Colors: More Expressivity, but at What Cost ? Authors: Martin Carrasco, Caio Deberaldini Netto, Vahan A. Martirosyan, Aneeqa Mehrab, Ehimare Okoyomon, Caterina Graziani
Understanding Self-supervised Contrastive Learning through Supervised Objectives Authors: Byeongchan Lee
Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding Authors: Payel Bhattacharjee, Fengwei Tian, Meiyu Zhong, Guangyi Zhang, Osvaldo Simeone, Ravi Tandon
What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably) Authors: Zixuan Gong, Jiaye Teng, Yong Liu
Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors Authors: Quentin Fruytier, Akshay Malhotra, Shahab Hamidi-Rad, Aditya Sant, Aryan Mokhtari, Sujay Sanghavi
Compositional Symmetry as Compression: Lie Pseudogroup Structure in Algorithmic Agents Authors: Giulio Ruffini
Vanishing Contributions: A Unified Approach to Smoothly Transition Neural Models into Compressed Form Authors: Lorenzo Nikiforos, Charalampos Antoniadis, Luciano Prono, Fabio Pareschi, Riccardo Rovatti, Gianluca Setti
Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware Authors: Lion Mueller, Alberto Garcia-Ortiz, Ardalan Najafi, Adam Fuks, Lennart Bamberg
Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models Authors: Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos
ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning Authors: Jinyang Zhang, Yue Fang, Hongxin Ding, Weibin Liao, Muyang Ye, Xu Chu, Junfeng Zhao, Yasha Wang
Neural Weight Compression for Language Models Authors: Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference Authors: Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang
Hierarchical LoRA MoE for Efficient CTR Model Scaling Authors: Zhichen Zeng, Mengyue Hang, Xiaolong Liu, Xiaoyi Liu, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Zhining Liu, Siyang Yuan, Chaofei Yang, Yiqun Liu, Hang Yin, Jiyan Yang, Hanghang Tong
CauchyNet: Compact and Data-Efficient Learning using Holomorphic Activation Functions Authors: Hong-Kun Zhang, Xin Li, Sikun Yang, Zhihong Xia
CacheClip: Accelerating RAG with Effective KV Cache Reuse Authors: Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu
ICL-Router: In-Context Learned Model Representations for LLM Routing Authors: Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, Shuyue Hu
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks Authors: Xuan Tang, Han Zhang, Yuan Cao, Difan Zou
Redundancy as a Structural Information Principle for Learning and Generalization Authors: Yuda Bi, Ying Zhu, Vince D Calhoun
Y-shaped Generative Flows Authors: Arip Asadulaev, Semyon Semenov, Abduragim Shtanchaev, Eric Moulines, Fakhri Karray, Martin Takac
Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits Authors: Abdurakhmon Sadiev, Peter Richt\'arik, Ilyas Fatkhullin
Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance Authors: Mamoona Ghafoor, Tatsuya Akutsu
HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization Authors: Ziyi Han, Huanyu Wang, Zeyu Zhang, Xiangxiang Dai, Xutong Liu, John C. S. Lui
The Geometry of Reasoning: Flowing Logics in Representation Space Authors: Yufa Zhou, Yixiao Wang, Xunjian Yin, Shuyan Zhou, Anru R. Zhang
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony Authors: Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus Authors: Zhaolin Hu, Kun Li, Hehe Fan, Yi Yang
EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling Authors: Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet \"Ust\"un
MoRA: On-the-fly Molecule-aware Low-Rank Adaptation Framework for LLM-based Multi-Modal Molecular Assistant Authors: Tao Yin, Xiaohong Zhang, Jiacheng Zhang, Li Huang, Zhibin Zhang, Yuansong Zeng, Jin Xie, Meng Yan
Multi-View Graph Learning with Graph-Tuple Authors: Shiyu Chen, Ningyuan Huang, Soledad Villar
Scaling Laws and Symmetry, Evidence from Neural Force Fields Authors: Khang Ngo, Siamak Ravanbakhsh
Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space Authors: Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, Liqiang Nie
Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs Authors: Nikolaus Salvatore, Hao Wang, Qiong Zhang
Topological Alignment of Shared Vision-Language Embedding Space Authors: Junwon You, Dasol Kang, Jae-Hun Jung
Rethinking Knowledge Distillation: A Data Dependent Regulariser With a Negative Asymmetric Payoff Authors: Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis
LightSAE: Parameter-Efficient and Heterogeneity-Aware Embedding for IoT Multivariate Time Series Forecasting Authors: Yi Ren, Xinjie Yu
Multitask Learning with Learned Task Relationships Authors: Zirui Wan, Stefan Vlaski
An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants Authors: Michael Crawshaw, Chirag Modi, Mingrui Liu, Robert M. Gower
Test-Time Adaptation by Causal Trimming Authors: Yingnan Liu, Rui Qiao, Mong Li Lee, Wynne Hsu
Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models Authors: Yukun Zhang, Qi Dong
BioOSS: A Bio-Inspired Oscillatory State System with Spatio-Temporal Dynamics Authors: Zhongju Yuan, Geraint Wiggins, Dick Botteldooren
Why Do Transformers Fail to Forecast Time Series In-Context? Authors: Yufa Zhou, Yixiao Wang, Surbhi Goel, Anru R. Zhang
Discursive Circuits: How Do Language Models Understand Discourse Relations? Authors: Yisong Miao, Min-Yen Kan

1. Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise

ArXiv ID: 2510.09660

Authors: Luca Scimeca, Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio

Abstract: Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t!\to!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.

Comment: Author match

2. Deconstructing Attention: Investigating Design Principles for Effective Language Modeling

ArXiv ID: 2510.11602

Authors: Huiyin Xue, Nafise Sadat Moosavi, Nikolaos Aletras

Abstract: The success of Transformer language models is widely credited to their dot-product attention mechanism, which interweaves a set of key design principles: mixing information across positions (enabling multi-token interactions), sequence-dependent activations (where attention weights adapt to each input), a specific mathematical form (dot-product similarities plus softmax weighting), and coupling of queries and keys to evolving hidden states (grounding attention in the current layer). However, the necessity of each of these principles remains largely untested. In this work, we systematically deconstruct attention by designing controlled variants that selectively relax these principles, applied both uniformly across all layers and in hybrid architectures where only some layers retain standard attention. Our empirical analysis reveals that mechanisms for mixing tokens are indispensable, as their absence collapses models to near-random behavior, while the exact mathematical form and sequence dependency can be substantially relaxed, especially when preserved in just a subset of layers. Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention, highlighting a cooperative effect. These findings deepen our understanding of what truly underpins attention's effectiveness and open new avenues for simplifying language models without sacrificing performance.

Comment: Systematic analysis and relaxation of attention design principles in Transformers — Model Architecture (attention mechanism).

Relevance: 10 Novelty: 8

3. AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

ArXiv ID: 2510.10467

Authors: Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

Abstract: The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.

Comment: Direct hit on Model Compression and Efficiency: multi-precision quantization with bit-plane compute and hardware–algorithm co-design for LLMs.

Relevance: 10 Novelty: 8

4. Dr.LLM: Dynamic Layer Routing in LLMs

ArXiv ID: 2510.12773

Authors: Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh

Abstract: Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

Comment: Direct match to Model Architecture: conditional/dynamic networks via per-layer routers for skip/execute/repeat, supervised by MCTS for adaptive-depth inference.

Relevance: 10 Novelty: 8

5. Stability of Transformers under Layer Normalization

ArXiv ID: 2510.09904

Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Krishna Kumar, Markos A. Katsoulakis

Abstract: Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.

Comment: Direct match to architectural analysis/training stability: principled theory on Transformer stability under different LayerNorm placements and residual scaling.

Relevance: 10 Novelty: 8

6. MC#: Mixture Compressor for Mixture-of-Experts Large Models

ArXiv ID: 2510.10962

Authors: Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, Xiaojuan Qi

Abstract: Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation. However, preloading all experts into memory and activating multiple experts per input introduces significant computational and memory overhead, making the expert module a major contributor to model size and inference cost. To address this, we propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning by leveraging the significance of experts and tokens for aggressive compression of MoE-LLMs/VLMs. To reduce storage and loading costs, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which optimizes bit allocation via linear programming, balancing expert importance and quantization error for a Pareto-optimal trade-off between size and performance. To reduce runtime computation, Online Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a subset of experts per token, enabling fine-grained control over activation. By combining PMQ's static bit-width optimization with OTP's dynamic routing, MC# achieves extreme compression with minimal accuracy loss. On DeepSeek-VL2, MC# achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7% accuracy drop across five multimodal benchmarks. Additionally, OTP reduces expert activation over 20% with less than 1% performance degradation, demonstrating strong potential for efficient MoE-based model deployment.

Comment: MoE compression via mixed-precision quantization and dynamic expert pruning/routing (quantization + sparsity/pruning for MoE efficiency).

Relevance: 10 Novelty: 8

7. Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

ArXiv ID: 2510.10060

Authors: Hehe Fan, Yi Yang, Mohan Kankanhalli, Fei Wu

Abstract: When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named {\alpha}-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including {\alpha}-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.

Comment: Matches Model Architecture: proposes Translution unifying self-attention and convolution with a lightweight alpha-Translution variant for adaptive relative modeling.

Relevance: 10 Novelty: 8

8. PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models

ArXiv ID: 2510.10136

Authors: Lancheng Zou, Shuo Yin, Zehua Pei, Tsung-Yi Ho, Farzan Farnia, Bei Yu

Abstract: Channel permutation is a powerful technique for enhancing the accuracy of N:M sparse models by reordering the channels of weight matrices to prioritize the retention of important weights. However, traditional channel permutation methods rely on handcrafted quality metrics, which often fail to accurately capture the true impact of pruning on model performance. To address this limitation, we propose PermLLM, a novel post-training pruning framework that introduces learnable channel permutation (LCP) for N:M sparsity. LCP leverages Sinkhorn normalization to transform discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Additionally, PermLLM incorporates an efficient block-wise channel permutation strategy, which significantly reduces the number of learnable parameters and computational complexity. PermLLM seamlessly integrates with existing one-shot pruning methods to adaptively optimize channel permutations, effectively mitigating pruning-induced errors. Extensive experiments on the LLaMA series, Qwen, and OPT models demonstrate that PermLLM achieves superior performance in optimizing N:M sparse models. The code is available at https://github.com/lanchengzou/PermLLM.

Comment: Matches Model Compression and Efficiency: N:M sparsity with learnable channel permutation via differentiable Sinkhorn normalization and block-wise optimization.

Relevance: 10 Novelty: 8

9. ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

ArXiv ID: 2510.11168

Authors: Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar

Abstract: Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations -- gradient fusion and chunking -- enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method, Renee without compromising accuracy.

Comment: Pure low-precision (BF16/Float8) training with Kahan summation, stochastic rounding, and memory optimizations (gradient fusion/chunking) — Model Compression/Efficiency for large output spaces.

Relevance: 9 Novelty: 8

10. On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation

ArXiv ID: 2510.10980

Authors: Di Zhang

Abstract: Self-supervised learning (SSL) has achieved remarkable success by learning meaningful representations without labeled data. However, a unified theoretical framework for understanding and comparing the efficiency of different SSL paradigms remains elusive. In this paper, we introduce a novel information-geometric framework to quantify representation efficiency. We define representation efficiency $\eta$ as the ratio between the effective intrinsic dimension of the learned representation space and its ambient dimension, where the effective dimension is derived from the spectral properties of the Fisher Information Matrix (FIM) on the statistical manifold induced by the encoder. Within this framework, we present a theoretical analysis of the Barlow Twins method. Under specific but natural assumptions, we prove that Barlow Twins achieves optimal representation efficiency ($\eta = 1$) by driving the cross-correlation matrix of representations towards the identity matrix, which in turn induces an isotropic FIM. This work provides a rigorous theoretical foundation for understanding the effectiveness of Barlow Twins and offers a new geometric perspective for analyzing SSL algorithms.

Comment: Information-geometric theory of Barlow Twins showing optimal representation efficiency via isotropic FIM — Representation Learning theory.

Relevance: 9 Novelty: 8

11. Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers

ArXiv ID: 2510.11471

Authors: Sarthak Mittal, Divyat Mahajan, Guillaume Lajoie, Mohammad Pezeshki

Abstract: Modern learning systems increasingly rely on amortized learning - the idea of reusing computation or inductive biases shared across tasks to enable rapid generalization to novel problems. This principle spans a range of approaches, including meta-learning, in-context learning, prompt tuning, learned optimizers and more. While motivated by similar goals, these approaches differ in how they encode and leverage task-specific information, often provided as in-context examples. In this work, we propose a unified framework which describes how such methods differ primarily in the aspects of learning they amortize - such as initializations, learned updates, or predictive mappings - and how they incorporate task data at inference. We introduce a taxonomy that categorizes amortized models into parametric, implicit, and explicit regimes, based on whether task adaptation is externalized, internalized, or jointly modeled. Building on this view, we identify a key limitation in current approaches: most methods struggle to scale to large datasets because their capacity to process task data at inference (e.g., context length) is often limited. To address this, we propose iterative amortized inference, a class of models that refine solutions step-by-step over mini-batches, drawing inspiration from stochastic optimization. Our formulation bridges optimization-based meta-learning with forward-pass amortization in models like LLMs, offering a scalable and extensible foundation for general-purpose task adaptation.

Comment: Unified framework for amortized learning (ICL, learned optimizers) with iterative amortized inference — Representation Learning/training dynamics and adaptation.

Relevance: 9 Novelty: 8

12. Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models

ArXiv ID: 2510.12137

Authors: Shihao Ji, Zihui Song, Jiajie Huang

Abstract: Large Language Models (LLMs) hallucinate, generating factually incorrect yet confident assertions. We argue this stems from the Transformer's Softmax function, which creates "Artificial Certainty" by collapsing ambiguous attention scores into a single probability distribution, discarding uncertainty information at each layer. To fix this, we introduce the Credal Transformer, which replaces standard attention with a Credal Attention Mechanism (CAM) based on evidential theory. CAM produces a "credal set" (a set of distributions) instead of a single attention vector, with the set's size directly measuring model uncertainty. We implement this by re-conceptualizing attention scores as evidence masses for a Dirichlet distribution: sufficient evidence recovers standard attention, while insufficient evidence yields a diffuse distribution, representing ambiguity. Empirically, the Credal Transformer identifies out-of-distribution inputs, quantifies ambiguity, and significantly reduces confident errors on unanswerable questions by abstaining. Our contribution is a new architecture to mitigate hallucinations and a design paradigm that integrates uncertainty quantification directly into the model, providing a foundation for more reliable AI.

Comment: Model Architecture: replaces softmax attention with a Credal Attention Mechanism yielding credal sets for uncertainty-aware Transformers; integrates uncertainty directly into the attention mechanism.

Relevance: 9 Novelty: 8

13. Decomposer Networks: Deep Component Analysis and Synthesis

ArXiv ID: 2510.09825

Authors: Mohsen Joneidi

Abstract: We propose the Decomposer Networks (DecompNet), a semantic autoencoder that factorizes an input into multiple interpretable components. Unlike classical autoencoders that compress an input into a single latent representation, the Decomposer Network maintains N parallel branches, each assigned a residual input defined as the original signal minus the reconstructions of all other branches. By unrolling a Gauss--Seidel style block-coordinate descent into a differentiable network, DecompNet enforce explicit competition among components, yielding parsimonious, semantically meaningful representations. We situate our model relative to linear decomposition methods (PCA, NMF), deep unrolled optimization, and object-centric architectures (MONet, IODINE, Slot Attention), and highlight its novelty as the first semantic autoencoder to implement an all-but-one residual update rule.

Comment: Matches Model Architecture and Representation Learning: semantic autoencoder with Gauss–Seidel-style unrolled competition among components for interpretable factorization.

Relevance: 9 Novelty: 8

14. Direct Multi-Token Decoding

ArXiv ID: 2510.11958

Authors: Xuan Luo, Weizhi Wang, Xifeng Yan

Abstract: Decoder-only transformers have become the standard architecture for large language models (LLMs) due to their strong performance. Recent studies suggest that, in pre-trained LLMs, early, middle, and late layers may serve distinct roles: Early layers focus on understanding the input context, middle layers handle task-specific processing, and late layers convert abstract representations into output tokens. We hypothesize that once representations have been processed by the early and middle layers, the resulting hidden states may encapsulate sufficient information to support the generation of multiple tokens using only the late layers, eliminating the need to repeatedly traverse the early and middle layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces no additional parameters, auxiliary routines, or post-generation verification. Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss. Moreover, as shown in our scaling analysis, its performance is expected to further improve with larger training datasets.

Comment: Model Efficiency/HPC — Direct Multi-Token Decoding uses late layers to emit multiple tokens per step without auxiliary models, reducing repeated forward passes.

Relevance: 9 Novelty: 8

15. Adversarial Attacks Leverage Interference Between Features in Superposition

ArXiv ID: 2510.11709

Authors: Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal

Abstract: Fundamental questions remain about when and why adversarial examples arise in neural networks, with competing views characterising them either as artifacts of the irregularities in the decision landscape or as products of sensitivity to non-robust input features. In this paper, we instead argue that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, we show how superposition - where networks represent more features than they have dimensions - creates arrangements of latent representations that adversaries can exploit. We demonstrate that adversarial perturbations leverage interference between superposed features, making attack patterns predictable from feature arrangements. Our framework provides a mechanistic explanation for two known phenomena: adversarial attack transferability between models with similar training regimes and class-specific vulnerability patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. We then demonstrate that these findings persist in a ViT trained on CIFAR-10. These findings reveal adversarial vulnerability can be a byproduct of networks' representational compression, rather than flaws in the learning process or non-robust inputs.

Comment: Provides mechanistic representation-learning insight via superposition explaining adversarial vulnerability (representation learning/training dynamics).

Relevance: 9 Novelty: 8

16. Differentiable Fast Top-K Selection for Large-Scale Recommendation

ArXiv ID: 2510.11472

Authors: Yanjie Zhu, Zhen Zhang, Yunli Wang, Zhiqiang Wang, Yu Li, Rufan Zhou, Shiyang Wen, Peng Jiang, Chenhao Lin, Jian Yang

Abstract: Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77\% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.

Comment: Designs a differentiable Top-K operator with O(n) complexity for end-to-end training (algorithmic efficiency breakthrough).

Relevance: 9 Novelty: 8

17. Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

ArXiv ID: 2510.11192

Authors: Jo\~ao Paulo Cardoso de Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan

Abstract: Structured sparsity enables deploying large language models (LLMs) on resource-constrained systems. Approaches like dense-to-sparse fine-tuning are particularly compelling, achieving remarkable structured sparsity by reducing the model size by over 6.7x, while still maintaining acceptable accuracy. Despite this reduction, LLM inference, especially the decode stage being inherently memory-bound, is extremely expensive on conventional Von-Neumann architectures. Compute-in-memory (CIM) architectures mitigate this by performing computations directly in memory, and when paired with sparse LLMs, enable storing and computing the entire model in memory, eliminating the data movement on the off-chip bus and improving efficiency. Nonetheless, naively mapping sparse matrices onto CIM arrays leads to poor array utilization and diminished computational efficiency. In this paper, we present an automated framework with novel mapping and scheduling strategies to accelerate sparse LLM inference on CIM accelerators. By exploiting block-diagonal sparsity, our approach improves CIM array utilization by over 50%, achieving more than 4x reduction in both memory footprint and the number of required floating-point operations.

Comment: Matches HPC and Compression/Efficiency: automated mapping and scheduling for block-diagonal sparse LLMs on compute-in-memory accelerators to boost array utilization and reduce memory/compute.

Relevance: 9 Novelty: 8

18. LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

ArXiv ID: 2510.11292

Authors: Wenbo Wu, Qingyi Si, Xiurui Pan, Ye Wang, Jie Zhang

Abstract: While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy leveraging temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model's attention patterns, enabling precise identification of critical KVs. Furthermore, to boost efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluations show that LouisKV achieves up to 4.7$\times$ speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.

Comment: Matches Compression/Efficiency and HPC: semantic-aware KV retrieval and fine-grained decoupled management with custom kernels to accelerate long-sequence LLM decoding while preserving accuracy.

Relevance: 9 Novelty: 8

19. Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent

ArXiv ID: 2510.10425

Authors: Sara Dragutinovi\'c, Andrew M. Saxe, Aaditya K. Singh

Abstract: The remarkable ability of transformers to learn new concepts solely by reading examples within the input prompt, termed in-context learning (ICL), is a crucial aspect of intelligent behavior. Here, we focus on understanding the learning algorithm transformers use to learn from context. Existing theoretical work, often based on simplifying assumptions, has primarily focused on linear self-attention and continuous regression tasks, finding transformers can learn in-context by gradient descent. Given that transformers are typically trained on discrete and complex tasks, we bridge the gap from this existing work to the setting of classification, with non-linear (importantly, softmax) activation. We find that transformers still learn to do gradient descent in-context, though on functionals in the kernel feature space and with a context-adaptive learning rate in the case of softmax transformer. These theoretical findings suggest a greater adaptability to context for softmax attention, which we empirically verify and study through ablations. Overall, we hope this enhances theoretical understanding of in-context learning algorithms in more realistic settings, pushes forward our intuitions and enables further theory bridging to larger models.

Comment: Representation Learning: theoretical analysis of in-context learning dynamics in transformers with softmax attention (kernel gradient descent, context-adaptive rates).

Relevance: 9 Novelty: 8

20. Chimera: State Space Models Beyond Sequences

ArXiv ID: 2510.12111

Authors: Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu

Abstract: Transformer-based deep learning methods have become the standard approach for modeling diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires inductive biases--such as position embeddings in sequences and images, or random walks in graphs--to incorporate topology. However, designing such task-specific biases requires significant effort and can introduce side effects that hinder generalization. We introduce Chimera, a unified model that directly incorporates data topology in a principled way, removing the need for domain-specific biases. The key idea is that state space models--which naturally do not require position embeddings--can be generalized to capture any graph topology. Our experiments show that Chimera achieves strong performance across language, vision, and graph domains, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on the Long Range Graph Benchmark. We further propose algorithmic optimizations to improve Chimera's efficiency: (1) for Directed Acyclic Graphs, Chimera can be implemented as a linear-time recurrence; (2) for general graphs, a simple mathematical relaxation achieves Transformer's quadratic complexity without domain-specific heuristics. These results validate Chimera's core contribution and support the idea that data topology is a powerful inductive bias across modalities.

Comment: Model Architecture and Efficiency: generalizes state space models to arbitrary topologies; provides linear-time DAG recurrence and quadratic-time relaxation for general graphs.

Relevance: 9 Novelty: 8

21. In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

ArXiv ID: 2510.10981

Authors: Tomoya Wakayama, Taiji Suzuki

Abstract: This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.

Comment: Representation Learning/Training Dynamics: finite-sample generalization theory for ICL in Transformers with risk decomposition and non-asymptotic bounds.

Relevance: 9 Novelty: 8

22. QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

ArXiv ID: 2510.11696

Authors: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen

Abstract: We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

Comment: Model Compression/Efficiency and HPC: NVFP4 quantization + LoRA to accelerate RL training of LLMs, with adaptive quantization noise for exploration.

Relevance: 9 Novelty: 8

23. DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism

ArXiv ID: 2510.10620

Authors: Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, Chuan Wu

Abstract: Context parallelism has emerged as a key technique to support long-context training, a growing trend in generative AI for modern large models. However, existing context parallel methods rely on static parallelization configurations that overlook the dynamic nature of training data, specifically, the variability in sequence lengths and token relationships (i.e., attention patterns) across samples. As a result, these methods often suffer from unnecessary communication overhead and imbalanced computation. In this paper, we present DCP, a dynamic context parallel training framework that introduces fine-grained blockwise partitioning of both data and computation. By enabling flexible mapping of data and computation blocks to devices, DCP can adapt to varying sequence characteristics, effectively reducing communication and improving memory and computation balance. Micro-benchmarks demonstrate that DCP accelerates attention by 1.19x~2.45x under causal masks and 2.15x~3.77x under sparse attention patterns. Additionally, we observe up to 0.94x~1.16x end-to-end training speed-up for causal masks, and 1.00x~1.46x for sparse masks.

Comment: Matches High Performance Computing: dynamic context parallelism with fine-grained blockwise partitioning for long-context training, reducing communication and improving balance.

Relevance: 9 Novelty: 8

24. Rademacher Meets Colors: More Expressivity, but at What Cost ?

ArXiv ID: 2510.10101

Authors: Martin Carrasco, Caio Deberaldini Netto, Vahan A. Martirosyan, Aneeqa Mehrab, Ehimare Okoyomon, Caterina Graziani

Abstract: The expressive power of graph neural networks (GNNs) is typically understood through their correspondence with graph isomorphism tests such as the Weisfeiler-Leman (WL) hierarchy. While more expressive GNNs can distinguish a richer set of graphs, they are also observed to suffer from higher generalization error. This work provides a theoretical explanation for this trade-off by linking expressivity and generalization through the lens of coloring algorithms. Specifically, we show that the number of equivalence classes induced by WL colorings directly bounds the GNNs Rademacher complexity -- a key data-dependent measure of generalization. Our analysis reveals that greater expressivity leads to higher complexity and thus weaker generalization guarantees. Furthermore, we prove that the Rademacher complexity is stable under perturbations in the color counts across different samples, ensuring robustness to sampling variability across datasets. Importantly, our framework is not restricted to message-passing GNNs or 1-WL, but extends to arbitrary GNN architectures and expressivity measures that partition graphs into equivalence classes. These results unify the study of expressivity and generalization in GNNs, providing a principled understanding of why increasing expressive power often comes at the cost of generalization.

Comment: Matches Representation Learning/Theory: links GNN expressivity (WL colorings) to Rademacher complexity, explaining generalization–expressivity trade-offs.

Relevance: 9 Novelty: 8

25. Understanding Self-supervised Contrastive Learning through Supervised Objectives

ArXiv ID: 2510.10572

Authors: Byeongchan Lee

Abstract: Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.

Comment: Strongly matches Representation Learning by providing a theoretical formulation linking self-supervised contrastive objectives to supervised ones, yielding insights into InfoNCE and balanced contrastive losses.

Relevance: 9 Novelty: 8

26. Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

ArXiv ID: 2510.09942

Authors: Payel Bhattacharjee, Fengwei Tian, Meiyu Zhong, Guangyi Zhang, Osvaldo Simeone, Ravi Tandon

Abstract: Edge-cloud speculative decoding (SD) accelerates inference by having a cloud-based large language model (LLM) that verifies draft tokens generated by a resource-constrained small language model (SLM) at the edge. A central bottleneck is the limited bandwidth of the edge-cloud link, which necessitates efficient compression of draft token distributions. We first derive an information-theoretic bound that decomposes the token rejection rate into contributions from SLM-LLM distribution mismatch and from quantization distortion. Guided by this analysis, we propose the Sparse Quantize-and-Sample SD (SQS-SD) framework, which exploits distributional sparsity through structured sparsification and lattice-based quantization. Within this framework, K-SQS applies fixed top-K truncation, while C-SQS adaptively adjusts the retained token set via online conformal prediction to ensure bounded deviation from the dense distribution. Empirical results confirm that both approaches improve end-to-end latency and rejection rates in complimentary operating regimes.

Comment: Strongly matches Model Compression and Efficiency by leveraging structured sparsification, conformal prediction, and lattice quantization to compress token distributions for speculative decoding; systems-level bandwidth optimization aligns with efficiency goals.

Relevance: 9 Novelty: 8

27. What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)

ArXiv ID: 2510.10089

Authors: Zixuan Gong, Jiaye Teng, Yong Liu

Abstract: While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

Comment: Strongly matches Model Architecture by analyzing looped-attention Transformers vs single-pass Transformers via loss-landscape theory and proposing a staged training framework (SHIFT), touching training dynamics as well.

Relevance: 9 Novelty: 8

28. Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors

ArXiv ID: 2510.11953

Authors: Quentin Fruytier, Akshay Malhotra, Shahab Hamidi-Rad, Aditya Sant, Aryan Mokhtari, Sujay Sanghavi

Abstract: Learning disentangled representations, where distinct factors of variation are captured by independent latent variables, is a central goal in machine learning. The dominant approach has been the Variational Autoencoder (VAE) framework, which uses a Kullback-Leibler (KL) divergence penalty to encourage the latent space to match a factorized Gaussian prior. In this work, however, we provide direct evidence that this KL-based regularizer is an unreliable mechanism, consistently failing to enforce the target distribution on the aggregate posterior. We validate this and quantify the resulting entanglement using our novel, unsupervised Latent Predictability Score (LPS). To address this failure, we introduce the Programmable Prior Framework, a method built on the Maximum Mean Discrepancy (MMD). Our framework allows practitioners to explicitly sculpt the latent space, achieving state-of-the-art mutual independence on complex datasets like CIFAR-10 and Tiny ImageNet without the common reconstruction trade-off. Furthermore, we demonstrate how this programmability can be used to engineer sophisticated priors that improve alignment with semantically meaningful features. Ultimately, our work provides a foundational tool for representation engineering, opening new avenues for model identifiability and causal reasoning.

Comment: Representation Learning: VAE-style disentanglement with MMD-based programmable priors; introduces an unsupervised Latent Predictability Score and achieves independence without reconstruction loss trade-off.

Relevance: 9 Novelty: 8

29. Compositional Symmetry as Compression: Lie Pseudogroup Structure in Algorithmic Agents

ArXiv ID: 2510.10586

Authors: Giulio Ruffini

Abstract: In the algorithmic (Kolmogorov) view, agents are programs that track and compress sensory streams using generative programs. We propose a framework where the relevant structural prior is simplicity (Solomonoff) understood as \emph{compositional symmetry}: natural streams are well described by (local) actions of finite-parameter Lie pseudogroups on geometrically and topologically complex low-dimensional configuration manifolds (latent spaces). Modeling the agent as a generic neural dynamical system coupled to such streams, we show that accurate world-tracking imposes (i) \emph{structural constraints} -- equivariance of the agent's constitutive equations and readouts -- and (ii) \emph{dynamical constraints}: under static inputs, symmetry induces conserved quantities (Noether-style labels) in the agent dynamics and confines trajectories to reduced invariant manifolds; under slow drift, these manifolds move but remain low-dimensional. This yields a hierarchy of reduced manifolds aligned with the compositional factorization of the pseudogroup, providing a geometric account of the ``blessing of compositionality'' in deep models. We connect these ideas to the Spencer formalism for Lie pseudogroups and formulate a symmetry-based, self-contained version of predictive coding in which higher layers receive only \emph{coarse-grained residual transformations} (prediction-error coordinates) along symmetry directions unresolved at lower layers.

Comment: Representation Learning: theoretical framework linking compositional symmetry/equivariance to manifold reductions and predictive coding, offering principles for compressive, hierarchical representations.

Relevance: 8 Novelty: 9

30. Vanishing Contributions: A Unified Approach to Smoothly Transition Neural Models into Compressed Form

ArXiv ID: 2510.09696

Authors: Lorenzo Nikiforos, Charalampos Antoniadis, Luciano Prono, Fabio Pareschi, Riccardo Rovatti, Gianluca Setti

Abstract: The increasing scale of deep neural networks has led to a growing need for compression techniques such as pruning, quantization, and low-rank decomposition. While these methods are very effective in reducing memory, computation and energy consumption, they often introduce severe accuracy degradation when applied directly. We introduce Vanishing Contributions (VCON), a general approach for smoothly transitioning neural models into compressed form. Rather than replacing the original network directly with its compressed version, VCON executes the two in parallel during fine-tuning. The contribution of the original (uncompressed) model is progressively reduced, while that of the compressed model is gradually increased. This smooth transition allows the network to adapt over time, improving stability and mitigating accuracy degradation. We evaluate VCON across computer vision and natural language processing benchmarks, in combination with multiple compression strategies. Across all scenarios, VCON leads to consistent improvements: typical gains exceed 3%, while some configuration exhibits accuracy boosts of 20%. VCON thus provides a generalizable method that can be applied to the existing compression techniques, with evidence of consistent gains across multiple benchmarks.

Comment: Matches Model Compression and Efficiency: proposes a general training scheme (VCON) to smoothly transition models to compressed forms (pruning/quantization/low-rank) to mitigate accuracy loss.

Relevance: 9 Novelty: 7

31. Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware

ArXiv ID: 2510.11484

Authors: Lion Mueller, Alberto Garcia-Ortiz, Ardalan Najafi, Adam Fuks, Lennart Bamberg

Abstract: Integer AI inference significantly reduces computational complexity in embedded systems. Quantization-aware training (QAT) helps mitigate accuracy degradation associated with post-training quantization but still overlooks the impact of integer rescaling during inference, which is a hardware costly operation in integer-only AI inference. This work shows that rescaling cost can be dramatically reduced post-training, by applying a stronger quantization to the rescale multiplicands at no model-quality loss. Furthermore, we introduce Rescale-Aware Training, a fine tuning method for ultra-low bit-width rescaling multiplicands. Experiments show that even with 8x reduced rescaler widths, the full accuracy is preserved through minimal incremental retraining. This enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems.

Comment: Matches Model Compression and Efficiency: quantization- and rescale-aware training for integer-only inference; reduces rescaler bitwidth post-training with minimal retraining.

Relevance: 9 Novelty: 7

32. Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

ArXiv ID: 2510.10964

Authors: Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos

Abstract: While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

Comment: Compression/Efficiency: scale-dependent guidelines for allocating memory between weights, KV cache, and generation length; compares KV eviction vs quantization for reasoning models.

Relevance: 9 Novelty: 7

33. ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

ArXiv ID: 2510.10071

Authors: Jinyang Zhang, Yue Fang, Hongxin Ding, Weibin Liao, Muyang Ye, Xu Chu, Junfeng Zhao, Yasha Wang

Abstract: Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical benchmarks show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general domain and 5.58% on the target domain with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://github.com/PuppyKnightUniversity/ADEPT

Comment: Matches Model Architecture/Efficiency: selective layer expansion and unit-wise decoupled tuning for parameter-efficient continual pretraining of LLMs.

Relevance: 9 Novelty: 7

34. Neural Weight Compression for Language Models

ArXiv ID: 2510.11234

Authors: Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee

Abstract: The efficient storage and transmission of language model weights is becoming increasingly important, as their scale and adoption continue to grow. However, as our understanding of this new data modality is limited, designing a good compression algorithm for language model weights heavily relies on manual, trial-and-error approaches. In this paper, we propose a learned compression framework that trains neural codecs directly from pretrained language model weights. Unlike conventional data (e.g., images), language model weights pose unique challenges: the sizes and shapes of weight tensors vary significantly, and the reconstruction quality must be judged by downstream model predictions rather than na\"ive MSE loss. To address this, we introduce Neural Weight Compression (NWC), a novel autoencoder-based neural codec tailored to model weight compression. The proposed method inherits the advantages of autoencoder-based codecs while incorporating three technical components: (1) column-wise tensor chunking and normalization; (2) an importance-aware training loss; (3) an inference-time error compensation mechanism guided by model outputs. Experiments on open-weight language models show that NWC achieves competitive or state-of-the-art accuracy-compression tradeoffs, with particularly strong results at 4-6 bit precisions where accuracy remains nearly on par with FP16 models.

Comment: Model Compression and Efficiency — learned autoencoder codec for LM weight compression with importance-aware loss and inference-time error compensation.

Relevance: 9 Novelty: 7

35. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

ArXiv ID: 2510.09665

Authors: Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang

Abstract: Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant computation by reusing KV caches across queries and to increase GPU utilization by disaggregating a single query to different engines, their promises cannot be realized without efficiently offloading and communicating KV cache across LLM inference engines and queries. We present LMCache, the first and so far the most efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) and shares the KV caches across engines and queries. LMCache exposes KV caches in the LLM engine interface, effectively transforming LLM engines from individual token processors to a collection of engines with KV cache as the storage and communication medium. In particular, it supports both cache offloading (prefix reuse across queries) and prefill-decode disaggregation (cross-engine cache transfer). LMCache's high performance and wide adoption stem from the following contributions: highly optimized KV cache data movement with performance optimizations including batched data movement operations, compute and I/O pipelining; a modular KV cache connector component, decoupling LMCache from the rapid evolution of inference engines; a first-class control API, such as pinning, lookup, cleanup, movement, and compression, for flexible cache orchestration across GPU, CPU, storage, and network layers. Evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across diverse workloads. With a growing community, LMCache has seen dramatic growth in adoption by enterprise inference systems, which provides valuable lessons for future KV caching solutions. The source code of LMCache is at: https://github.com/LMCache/LMCache.

Comment: High Performance Computing — systems-level KV-cache offloading and cross-engine sharing with pipelined data movement and a control API for enterprise-scale LLM inference.

Relevance: 9 Novelty: 7

36. Hierarchical LoRA MoE for Efficient CTR Model Scaling

ArXiv ID: 2510.10432

Authors: Zhichen Zeng, Mengyue Hang, Xiaolong Liu, Xiaoyi Liu, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Zhining Liu, Siyang Yuan, Chaofei Yang, Yiqun Liu, Hang Yin, Jiyan Yang, Hanghang Tong

Abstract: Deep models have driven significant advances in click-through rate (CTR) prediction. While vertical scaling via layer stacking improves model expressiveness, the layer-by-layer sequential computation poses challenges to efficient scaling. Conversely, horizontal scaling through Mixture of Experts (MoE) achieves efficient scaling by activating a small subset of experts in parallel, but flat MoE layers may struggle to capture the hierarchical structure inherent in recommendation tasks. To push the Return-On-Investment (ROI) boundary, we explore the complementary strengths of both directions and propose HiLoMoE, a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner. Specifically, HiLoMoE employs lightweight rank-1 experts for parameter-efficient horizontal scaling, and stacks multiple MoE layers with hierarchical routing to enable combinatorially diverse expert compositions. Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel. A principled three-stage training framework ensures stable optimization and expert diversity. Experiments on four public datasets show that HiLoMoE achieving better performance-efficiency tradeoff, achieving an average AUC improvement of 0.20\% in AUC and 18.5\% reduction in FLOPs compared to the non-MoE baseline.

Comment: Introduces a hierarchical LoRA-based MoE with routing-by-scores enabling parallel layer execution (model architecture + parameter-efficient scaling).

Relevance: 9 Novelty: 7

37. CauchyNet: Compact and Data-Efficient Learning using Holomorphic Activation Functions

ArXiv ID: 2510.10195

Authors: Hong-Kun Zhang, Xin Li, Sikun Yang, Zhihong Xia

Abstract: A novel neural network inspired by Cauchy's integral formula, is proposed for function approximation tasks that include time series forecasting, missing data imputation, etc. Hence, the novel neural network is named CauchyNet. By embedding real-valued data into the complex plane, CauchyNet efficiently captures complex temporal dependencies, surpassing traditional real-valued models in both predictive performance and computational efficiency. Grounded in Cauchy's integral formula and supported by the universal approximation theorem, CauchyNet offers strong theoretical guarantees for function approximation. The architecture incorporates complex-valued activation functions, enabling robust learning from incomplete data while maintaining a compact parameter footprint and reducing computational overhead. Through extensive experiments in diverse domains, including transportation, energy consumption, and epidemiological data, CauchyNet consistently outperforms state-of-the-art models in predictive accuracy, often achieving a 50% lower mean absolute error with fewer parameters. These findings highlight CauchyNet's potential as an effective and efficient tool for data-driven predictive modeling, particularly in resource-constrained and data-scarce environments.

Comment: Model Architecture: complex-valued holomorphic activation functions (Cauchy-inspired) enabling compact, data-efficient networks with theoretical guarantees.

Relevance: 9 Novelty: 7

38. CacheClip: Accelerating RAG with Effective KV Cache Reuse

ArXiv ID: 2510.10129

Authors: Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.

Comment: Model Compression/Efficiency: selective KV-cache recomputation and shared-prefix reuse for faster LLM inference in RAG (algorithmic cache optimization).

Relevance: 9 Novelty: 7

39. ICL-Router: In-Context Learned Model Representations for LLM Routing

ArXiv ID: 2510.09719

Authors: Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, Shuyue Hu

Abstract: Large language models (LLMs) often exhibit complementary strengths. Model routing harnesses these strengths by dynamically directing each query to the most suitable model, given a candidate model pool. However, routing performance relies on accurate model representations, and adding new models typically requires retraining, limiting scalability. To address these challenges, we propose a novel routing method using in-context vectors to represent model capabilities. The method proceeds in two stages. First, queries are embedded and projected into vectors, with a projector and LLM-based router trained to reconstruct the original queries, aligning vector representations with the router's semantic space. Second, each candidate model is profiled on a query set, and the router learns -- based on in-context vectors of query and model performance -- to predict whether each model can correctly answer new queries. Extensive experiments demonstrate that our method achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks. Moreover, our method allows for seamless integration of new models without retraining the router. The code is available at https://github.com/lalalamdbf/ICL-Router.

Comment: Matches Model Architecture (Dynamic Routing/MoE-style): learns in-context model representations to route queries across LLMs, enabling scalable routing without retraining.

Relevance: 9 Novelty: 7

40. Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

ArXiv ID: 2510.11354

Authors: Xuan Tang, Han Zhang, Yuan Cao, Difan Zou

Abstract: Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $\lambda$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $\lambda$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam's generalization performance.

Comment: Matches training dynamics/generalization criterion: theoretical analysis of stochastic Adam vs full-batch and role of batch size/weight decay in neural nets.

Relevance: 8 Novelty: 8

41. Redundancy as a Structural Information Principle for Learning and Generalization

ArXiv ID: 2510.10938

Authors: Yuda Bi, Ying Zhu, Vince D Calhoun

Abstract: We present a theoretical framework that extends classical information theory to finite and structured systems by redefining redundancy as a fundamental property of information organization rather than inefficiency. In this framework, redundancy is expressed as a general family of informational divergences that unifies multiple classical measures, such as mutual information, chi-squared dependence, and spectral redundancy, under a single geometric principle. This reveals that these traditional quantities are not isolated heuristics but projections of a shared redundancy geometry. The theory further predicts that redundancy is bounded both above and below, giving rise to an optimal equilibrium that balances over-compression (loss of structure) and over-coupling (collapse). While classical communication theory favors minimal redundancy for transmission efficiency, finite and structured systems, such as those underlying real-world learning, achieve maximal stability and generalization near this equilibrium. Experiments with masked autoencoders are used to illustrate and verify this principle: the model exhibits a stable redundancy level where generalization peaks. Together, these results establish redundancy as a measurable and tunable quantity that bridges the asymptotic world of communication and the finite world of learning.

Comment: Representation Learning — theoretical framework redefining redundancy as a structural principle, predicting generalization equilibria and validated with masked autoencoders.

Relevance: 8 Novelty: 8

42. Y-shaped Generative Flows

ArXiv ID: 2510.11955

Authors: Arip Asadulaev, Semyon Semenov, Abduragim Shtanchaev, Eric Moulines, Fakhri Karray, Martin Takac

Abstract: Modern continuous-time generative models often induce V-shaped transport: each sample travels independently along nearly straight trajectories from prior to data, overlooking shared structure. We introduce Y-shaped generative flows, which move probability mass together along shared pathways before branching to target-specific endpoints. Our formulation is based on novel velocity-powered transport cost with a sublinear exponent (between zero and one). this concave dependence rewards joint and fast mass movement. Practically, we instantiate the idea in a scalable neural ODE training objective. On synthetic, image, and biology datasets, Y-flows recover hierarchy-aware structure, improve distributional metrics over strong flow-based baselines, and reach targets with fewer integration steps.

Comment: Matches Model Architecture: introduces Y-shaped generative flows with a new sublinear velocity-powered transport cost and scalable neural ODE training objective.

Relevance: 8 Novelty: 8

43. Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits

ArXiv ID: 2510.10690

Authors: Abdurakhmon Sadiev, Peter Richt\'arik, Ilyas Fatkhullin

Abstract: Heavy-tailed noise is pervasive in modern machine learning applications, arising from data heterogeneity, outliers, and non-stationary stochastic environments. While second-order methods can significantly accelerate convergence in light-tailed or bounded-noise settings, such algorithms are often brittle and lack guarantees under heavy-tailed noise -- precisely the regimes where robustness is most critical. In this work, we take a first step toward a theoretical understanding of second-order optimization under heavy-tailed noise. We consider a setting where stochastic gradients and Hessians have only bounded $p$-th moments, for some $p\in (1,2]$, and establish tight lower bounds on the sample complexity of any second-order method. We then develop a variant of normalized stochastic gradient descent that leverages second-order information and provably matches these lower bounds. To address the instability caused by large deviations, we introduce a novel algorithm based on gradient and Hessian clipping, and prove high-probability upper bounds that nearly match the fundamental limits. Our results provide the first comprehensive sample complexity characterization for second-order optimization under heavy-tailed noise. This positions Hessian clipping as a robust and theoretically sound strategy for second-order algorithm design in heavy-tailed regimes.

Comment: Optimization/Training Dynamics: robust second-order method with gradient/Hessian clipping under heavy-tailed noise and tight sample complexity bounds.

Relevance: 8 Novelty: 8

44. Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance

ArXiv ID: 2510.10706

Authors: Mamoona Ghafoor, Tatsuya Akutsu

Abstract: The generation of trees with a specified tree edit distance has significant applications across various fields, including computational biology, structured data analysis, and image processing. Recently, generative networks have been increasingly employed to synthesize new data that closely resembles the original datasets. However, the appropriate size and depth of generative networks required to generate data with a specified tree edit distance remain unclear. In this paper, we theoretically establish the existence and construction of generative networks capable of producing trees similar to a given tree with respect to the tree edit distance. Specifically, for a given rooted, ordered, and vertex-labeled tree T of size n + 1 with labels from an alphabet \Sigma, and a non-negative integer d, we prove that all rooted, ordered, and vertex-labeled trees over \Sigma with tree edit distance at most d from T can be generated using a ReLU-based generative network with size O(n^3 ) and constant depth. The proposed networks were implemented and evaluated for generating trees with up to 21 nodes. Due to their deterministic architecture, the networks successfully generated all valid trees within the specified tree edit distance. In contrast, state-of-the-art graph generative models GraphRNN and GraphGDP, which rely on non-deterministic mechanisms, produced significantly fewer valid trees, achieving validation rates of only up to 35% and 48%, respectively. These findings provide a theoretical foundation towards construction of compact generative models and open new directions for exact and valid tree-structured data generation. An implementation of the proposed networks is available at https://github.com/MGANN-KU/TreeGen_ReLUNetworks.

Comment: Model Architecture/Theory: constructs constant-depth ReLU generative networks (O(n^3)) to exactly enumerate tree-structured outputs by edit distance.

Relevance: 8 Novelty: 8

45. HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization

ArXiv ID: 2510.12266

Authors: Ziyi Han, Huanyu Wang, Zeyu Zhang, Xiangxiang Dai, Xutong Liu, John C. S. Lui

Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely used technique for adapting large language models (LLMs) to new domains, due to its modular design and broad availability on platforms such as HuggingFace. This availability has motivated efforts to reuse existing LoRAs for domain generalization. However, existing methods often rely on explicit task labels or additional training, which are impractical for deployment. Moreover, they typically activate a fixed number of entire LoRA modules, leading to parameter redundancy or insufficiency that degrade performance. In this paper, we propose \texttt{HiLoRA}, a training-free framework that performs adaptive hierarchical routing over LoRA pools. Drawing on structural properties of LoRA, we define rank-one components (ROCs), in which each rank parameter is regarded as an independent unit. For a given input sequence, \texttt{HiLoRA} first adaptively selects a subset of LoRAs and determines their ROC allocation based on Gaussian likelihoods at the sequence level. At the token level, it further refines routing by activating only the most informative ROCs. We further provide theoretical guarantees that \texttt{HiLoRA} selects the most relevant LoRAs with high probability. Extensive experiments show that \texttt{HiLoRA} achieves substantial improvements in domain generalization, with accuracy gains of up to {\small $55\%$} over state-of-the-art baselines, while maintaining comparable inference throughput.

Comment: Matches Model Architecture: adaptive, conditional routing over LoRA modules at rank-one granularity (MoE-like), enabling dynamic composition without training.