Personalized Daily ArXiv Papers 2025-10-23

[gpt-5]	Prompt	Completion	Total
Token	44031	36622	80653
Cost	$0.06	$0.37	$0.42

Total arXiv papers: 569

Total scanned papers: 302

Total relevant papers: 28

Table of contents with paper titles:

Transformers are Inherently Succinct Authors: Pascal Bergstr\"a{\ss}er, Ryan Cotterell, Anthony W. Lin
HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission Authors: Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, Qiang Wang
ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression Authors: Lin Xv, Jingsheng Gao, Xian Gao, Ting Liu, Yuzhuo Fu
Transformers are almost optimal metalearners for linear classification Authors: Roey Magen, Gal Vardi
MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs Authors: Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li
NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning Authors: Zhi Zhang, Yixian Shen, Congfeng Cao, Ekaterina Shutova
Fast Inference via Hierarchical Speculative Decoding Authors: Amir Globerson, Haim Kaplan, Yishay Mansour, Clara Mohri, Tal Schuster
A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond Authors: Nikos Tsikouras, Yorgos Pantis, Ioannis Mitliagkas, Christos Tzamos
ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices Authors: Xin Nie, Liang Dong, HaiCheng Zhang, JiaWang Xiao, G. Sun
When Do Transformers Learn Heuristics for Graph Connectivity? Authors: Qilin Ye, Deqing Fu, Robin Jia, Vatsal Sharan
Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation Authors: Chenyu Wang, Zhanglu Yan, Zhi Zhou, Xu Chen, Weng-Fai Wong
Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge Authors: Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, Kai Wang
CPSVD: Enhancing Large Language Model Compression via Column-Preserving Singular Value Decomposition Authors: Lin Xv, Jingsheng Gao, Xian Gao, Ting Li, Yuzhuo Fu
GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters Authors: Anand Choudhary, Yasser Sula{\i}man, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Antoine Bosselut
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders Authors: Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
Feature Space Adaptation for Robust Model Fine-Tuning Authors: Peng Wang, Minghao Gu, Qiang Huang
Latent Space Factorization in LoRA Authors: Shashi Kumar, Yacouba Kaloga, John Mitros, Petr Motlicek, Ina Kodrasi
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning Authors: Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network Authors: Matthew Raffel, Adwaith Renjith, Lizhong Chen
Study of Training Dynamics for Memory-Constrained Fine-Tuning Authors: A\"el Qu\'elennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs Authors: Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, Ion Stoica
Category learning in deep neural networks: Information content and geometry of internal representations Authors: Laurent Bonnasse-Gahot, Jean-Pierre Nadal
Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning Authors: Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi
Understanding the Implicit Biases of Design Choices for Time Series Foundation Models Authors: Annan Yu, Danielle C. Maddix, Boran Han, Xiyuan Zhang, Abdul Fatir Ansari, Oleksandr Shchur, Christos Faloutsos, Andrew Gordon Wilson, Michael W. Mahoney, Yuyang Wang
Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning Authors: M. H. I. Abdalla, Zhipin Wang, Christian Frey, Steffen Eger, Josif Grabocka
Weight Decay may matter more than muP for Learning Rate Transfer in Practice Authors: Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen
Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks Authors: Changbo Wu, Zhuolong Yu, Gongming Zhao, Hongli Xu
Knowledge Distillation of Uncertainty using Deep Latent Factor Model Authors: Sehyun Park, Jongjin Lee, Yunseop Shin, Ilsang Ohn, Yongdai Kim

1. Transformers are Inherently Succinct

ArXiv ID: 2510.19315

Authors: Pascal Bergstr\"a{\ss}er, Ryan Cotterell, Anthony W. Lin

Abstract: We propose succinctness as a measure of the expressive power of a transformer in describing a concept. To this end, we prove that transformers are highly expressive in that they can represent formal languages substantially more succinctly than standard representations of formal languages like finite automata and Linear Temporal Logic (LTL) formulas. As a by-product of this expressivity, we show that verifying properties of transformers is provably intractable (i.e. EXPSPACE-complete).

Comment: Model Architecture Theory: proves transformers’ high succinctness vs automata/LTL and EXPSPACE-complete verification.

Relevance: 10 Novelty: 8

2. HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission

ArXiv ID: 2510.19470

Authors: Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, Qiang Wang

Abstract: Mixture-of-Experts (MoE) has become a popular architecture for scaling large models. However, the rapidly growing scale outpaces model training on a single DC, driving a shift toward a more flexible, cross-DC training paradigm. Under this, Expert Parallelism (EP) of MoE faces significant scalability issues due to the limited cross-DC bandwidth. Specifically, existing EP optimizations attempt to overlap data communication and computation, which has little benefit in low-bandwidth scenarios due to a much longer data communication time. Therefore, the trends of cross-DC EP scaling is fast becoming a critical roadblock to the continued growth of MoE models. To address this, we propose HybridEP, a modeling-guided framework to optimize EP under constrained bandwidth. Our key idea is to dynamically transform the spatial placement of experts to reduce data communication traffic and frequency, thereby minimizing EP's communication overheads. However, it is non-trivial to find the optimal solution because it complicates the original communication pattern by mixing data and expert communication. We therefore build a stream-based model to determine the optimal transmission ratio. Guided by this, we incorporate two techniques: (1) domain-based partition to construct the mapping between hybrid patterns and specific communication topology at GPU level, and (2) parameter-efficient migration to further refine this topology by reducing expert transmission overhead and enlarging the domain size. Combining all these designs, HybridEP can be considered as a more general EP with better scalability. Experimental results show that HybridEP outperforms existing state-of-the-art MoE training systems by up to 5.6x under constrained bandwidth. We further compare HybridEP and EP on large-scale simulations. HybridEP achieves up to 1.45x speedup with 1k DCs under different bandwidths.

Comment: Matches HPC and MoE scaling: HybridEP introduces modeling-guided hybrid expert/data transmission and topology/domain partitioning to scale Expert Parallelism across datacenters under bandwidth constraints.

Relevance: 10 Novelty: 8

3. ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression

ArXiv ID: 2510.19389

Authors: Lin Xv, Jingsheng Gao, Xian Gao, Ting Liu, Yuzhuo Fu

Abstract: In the field of large language model (LLM) compression, singular value decomposition (SVD) is a widely studied and adopted low-rank decomposition technique. Since SVD operates exclusively on linear modules, and these modules in LLMs are separated by nonlinear components, SVD can only be applied independently to each linear module. Under a global compression ratio constraint, determining the appropriate rank for different linear modules becomes a critical problem. Existing approaches, such as heuristic algorithms and mask-based training, have made progress in addressing this challenge. However, these methods still suffer from several limitations: heuristic algorithms explore the solution space within restricted regions, while mask-based training struggles to efficiently capture the relationship between singular value spectra and trainable parameters. More importantly, current methods overlook the key property that the gain function is non-smooth at a compression ratio of 1, which often leads the training process to suboptimal local minima. To address these issues, we propose an Adaptive Rank Allocation (ARA) method. Specifically, (1) ARA introduces a dedicated mask design that enables efficient mapping and updating between retained ranks and trainable parameters; and (2) it employs an additional loss function to guide parameter selection toward globally optimal solutions. Experimental results demonstrate that ARA achieves state-of-the-art performance. On the LLaMA2-7B model with a 80\% compression ratio, ARA reduces perplexity on WikiText2 from 8.38 to 6.42 and improves average zero-shot task accuracy by 9.72 percentage points compared with uniform compression. These results highlight the effectiveness of our method for rank allocation in SVD-based LLM compression.

Comment: Matches Compression/Efficiency: Adaptive Rank Allocation for SVD-based LLM compression with a new mask design and loss to optimize per-layer ranks under global constraints.

Relevance: 10 Novelty: 7

4. Transformers are almost optimal metalearners for linear classification

ArXiv ID: 2510.19797

Authors: Roey Magen, Gal Vardi

Abstract: Transformers have demonstrated impressive in-context learning (ICL) capabilities, raising the question of whether they can serve as metalearners that adapt to new tasks using only a small number of in-context examples, without any further training. While recent theoretical work has studied transformers' ability to perform ICL, most of these analyses do not address the formal metalearning setting, where the objective is to solve a collection of related tasks more efficiently than would be possible by solving each task individually. In this paper, we provide the first theoretical analysis showing that a simplified transformer architecture trained via gradient descent can act as a near-optimal metalearner in a linear classification setting. We consider a natural family of tasks where each task corresponds to a class-conditional Gaussian mixture model, with the mean vectors lying in a shared $k$-dimensional subspace of $R^d$. After training on a sufficient number of such tasks, we show that the transformer can generalize to a new task using only $O(k / R^4)$ in-context examples, where $R$ denotes the signal strength at test time. This performance (almost) matches that of an optimal learner that knows exactly the shared subspace and significantly outperforms any learner that only has access to the in-context data, which requires $\Omega(d / R^4)$ examples to generalize. Importantly, our bounds on the number of training tasks and examples per task needed to achieve this result are independent of the ambient dimension $d$.

Comment: Representation Learning/Architecture Theory: theoretical proof that (simplified) transformers are near-optimal metalearners for linear classification.

Relevance: 9 Novelty: 8

5. MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs

ArXiv ID: 2510.19366

Authors: Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li

Abstract: Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a "quality cliff", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained "sub-experts." This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9\% under a strict latency budget or reduce latency by up to 10.36\% under limited resources. MoE-Prism provides the critical "control knob" to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.

Comment: Model Architecture and Systems Efficiency: MoE expert partitioning into fine-grained sub-experts plus QoS-aware scheduling for elastic inference.

Relevance: 9 Novelty: 8

6. NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning

ArXiv ID: 2510.18940

Authors: Zhi Zhang, Yixian Shen, Congfeng Cao, Ekaterina Shutova

Abstract: Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption. To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen. Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as $\leq \textbf{0.02}\%$ trainable parameters, while reducing CUDA memory usage by up to 60%. We release our code here: https://github.com/FightingFighting/NeuroAda.git.

Comment: Model Compression/Efficiency: PEFT via bypass connections on selected parameters enabling ≤0.02% trainable weights.

Relevance: 9 Novelty: 8

7. Fast Inference via Hierarchical Speculative Decoding

ArXiv ID: 2510.19705

Authors: Amir Globerson, Haim Kaplan, Yishay Mansour, Clara Mohri, Tal Schuster

Abstract: Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.

Comment: High-Performance Inference: hierarchical speculative decoding with latency-optimal hierarchy selection.

Relevance: 9 Novelty: 8

8. A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

ArXiv ID: 2510.19382

Authors: Nikos Tsikouras, Yorgos Pantis, Ioannis Mitliagkas, Christos Tzamos

Abstract: Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge. The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer. This structural property is known to reduce sample complexity of generalization. Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions. In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP), e.g.\ perturbed gradient descent (PGD). At the core of our approach is a key $\textit{derandomization}$ lemma, which states that optimizing the function $\mathbb{E}{\mathbf{x}} \left[g$, under mild conditions. The fundamental nature of this lemma directly explains structure discovery and has immediate applications in other domains including an end-to-end approximation for MAXCUT, and computing Johnson-Lindenstrauss embeddings.}(\mathbf{W}\mathbf{x} + \mathbf{b})\right]$ converges to a point where $\mathbf{W} = \mathbf{0

Comment: Representation Learning/Training Dynamics: derandomization lemma explaining structure discovery (low-rank) in neural networks under broad conditions.

Relevance: 9 Novelty: 8

9. ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

ArXiv ID: 2510.19482

Authors: Xin Nie, Liang Dong, HaiCheng Zhang, JiaWang Xiao, G. Sun

Abstract: The deployment of Large Language Models (LLMs) on CPU-based edge devices is crucial for enabling on-device intelligence and expanding AI accessibility. However, it remains challenging due to limited memory and computational resources. During edge inference, memory usage and latency are the primary bottlenecks. Although weight quantization can effectively reduce memory consumption, existing hardware-friendly approaches often rely on uniform quantization, which poorly fits weight distributions and incurs high dequantization overhead at low bit widths. To address these limitations, we propose ELUTQ, an efficient quantization framework introducing a novel quantization format, Hierarchical Linear Quantization (HLQ). HLQ better captures the statistical characteristics of weights without increasing the computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating dequantization overhead. It is orthogonal to existing quantization algorithms and can be seamlessly integrated into various quantization pipelines. For efficient on-device deployment, ELUTQ provides optimized CPU kernels for end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training quantization, completing quantization within one hour. With efficient finetuning, HLQ further improves 2-bit performance within two hours. In terms of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an Apple M2 chip (4 threads, batch size = 1).

Comment: Model Compression/Efficiency: LUT-aware hierarchical linear quantization (HLQ) and optimized CPU kernels for LLM edge deployment.

Relevance: 9 Novelty: 8

10. When Do Transformers Learn Heuristics for Graph Connectivity?

ArXiv ID: 2510.19753

Authors: Qilin Ye, Deqing Fu, Robin Jia, Vatsal Sharan

Abstract: Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.

Comment: Matches Model Architecture and Representation Learning: theoretical and empirical analysis of when Transformers learn correct algorithms vs heuristics on graph connectivity, tied to depth/diameter capacity and training dynamics.

Relevance: 9 Novelty: 8

11. Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation

ArXiv ID: 2510.19498

Authors: Chenyu Wang, Zhanglu Yan, Zhi Zhou, Xu Chen, Weng-Fai Wong

Abstract: In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.

Comment: Model Compression and Efficiency: proposes dequantization-free mixed-precision quantization for LLMs via SNN-style spike encoding, reducing MAC energy.

Relevance: 9 Novelty: 7

12. Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge

ArXiv ID: 2510.19266

Authors: Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, Kai Wang

Abstract: State-space models (SSMs) have emerged as efficient alternatives to Transformers for sequence modeling, offering superior scalability through recurrent structures. However, their training remains costly and the ecosystem around them is far less mature than that of Transformers. Moreover, the structural heterogeneity between SSMs and Transformers makes it challenging to efficiently distill knowledge from pretrained attention models. In this work, we propose Cross-architecture distillation via Attention Bridge (CAB), a novel data-efficient distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models. Unlike conventional knowledge distillation that transfers knowledge only at the output level, CAB enables token-level supervision via a lightweight bridge and flexible layer-wise alignment, improving both efficiency and transferability. We further introduce flexible layer-wise alignment strategies to accommodate architectural discrepancies between teacher and student. Extensive experiments across vision and language domains demonstrate that our method consistently improves the performance of state-space models, even under limited training data, outperforming both standard and cross-architecture distillation methods. Our findings suggest that attention-based knowledge can be efficiently transferred to recurrent models, enabling rapid utilization of Transformer expertise for building a stronger SSM community.

Comment: Model Compression and Efficiency: cross-architecture distillation from Transformers to SSMs via an attention bridge with token-level supervision and layer-wise alignment.

Relevance: 9 Novelty: 7

13. CPSVD: Enhancing Large Language Model Compression via Column-Preserving Singular Value Decomposition

ArXiv ID: 2510.19385

Authors: Lin Xv, Jingsheng Gao, Xian Gao, Ting Li, Yuzhuo Fu

Abstract: The rapid advancement of Large Language Models (LLMs) faces a critical bottleneck in their immense size, necessitating efficient compression techniques. While Singular Value Decomposition (SVD) is a promising approach, existing SVD-based methods treat the entire parameter matrix uniformly, overlooking that SVD approximation errors vary significantly across different matrix parts, which often leads to suboptimal compression. To address this, we propose \textbf{C}olumn-\textbf{P}reserving \textbf{S}ingular \textbf{V}alue \textbf{D}ecomposition (CPSVD), a novel method that refines SVD-based LLM compression by intelligently segmenting the parameter matrix. Unlike traditional SVD, CPSVD identifies and directly preserves matrix columns with high decomposition errors, applying SVD only to columns with low decomposition errors, while precisely determining the optimal balance point between these two strategies to minimize error. Furthermore, leveraging the inherent heterogeneity in decomposition errors across different matrices within an LLM, CPSVD adaptively allocates non-uniform compression rates to modules within that layer, while adhering to a target layer-wise compression ratio, thereby further enhancing compression performance. Extensive experiments demonstrate that CPSVD consistently outperforms state-of-the-art SVD-based LLM compression methods, achieving lower perplexity and higher accuracy on zero-shot tasks.

Comment: Model Compression and Efficiency: column-preserving SVD with adaptive per-module compression for LLMs (low-rank plus selective column retention).

Relevance: 9 Novelty: 7

14. GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters

ArXiv ID: 2510.19778

Authors: Anand Choudhary, Yasser Sula{\i}man, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Antoine Bosselut

Abstract: Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.

Comment: Model Compression and Efficiency: sparse fine-tuning by selecting parameters with large gradients and low pre-trained magnitudes to preserve knowledge.

Relevance: 9 Novelty: 7

15. AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

ArXiv ID: 2510.19779

Authors: Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao

Abstract: Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

Comment: High-Performance Inference: selective knowledge distillation tailored to maximize token acceptance in speculative decoding.

Relevance: 9 Novelty: 7

16. Feature Space Adaptation for Robust Model Fine-Tuning

ArXiv ID: 2510.19155

Authors: Peng Wang, Minghao Gu, Qiang Huang

Abstract: Catastrophic forgetting is a common issue in model fine-tuning, especially when the downstream domain contains limited labeled data or differs greatly from the pre-training distribution. Existing parameter-efficient fine-tuning methods operate in the weight space by modifying or augmenting the pre-trained model's parameters, which can yield models overly specialized to the available downstream data. To mitigate the risk of overwriting pre-trained knowledge and enhance robustness, we propose to fine-tune the pre-trained model in the feature space. Two new fine-tuning methods are proposed: LoRFA (Low-Rank Feature Adaptation) and VeFA (Vector-Based Feature Adaptation). Feature space adaptation is inspired by the idea of effect equivalence modeling (EEM) of downstream lurking variables causing distribution shifts, which posits that unobserved factors can be represented as the total equivalent amount on observed features. By compensating for the effects of downstream lurking variables via a lightweight feature-level transformation, the pre-trained representations can be preserved, which improves model generalization under distribution shift. We evaluate LoRFA and VeFA versus LoRA on image classification, NLU, and NLG, covering both standard fine-tuning metrics and robustness. Feature space adaptation achieves comparable fine-tuning results and consistently stronger robustness.

Comment: Model Compression/Efficiency: PEFT in feature space (LoRFA/VeFA) with low-rank/vector transformations to preserve pretrained representations under shift.

Relevance: 9 Novelty: 7

17. Latent Space Factorization in LoRA

ArXiv ID: 2510.19640

Authors: Shashi Kumar, Yacouba Kaloga, John Mitros, Petr Motlicek, Ina Kodrasi

Abstract: Low-rank adaptation (LoRA) is a widely used method for parameter-efficient finetuning. However, existing LoRA variants lack mechanisms to explicitly disambiguate task-relevant information within the learned low-rank subspace, potentially limiting downstream performance. We propose Factorized Variational Autoencoder LoRA (FVAE-LoRA), which leverages a VAE to learn two distinct latent spaces. Our novel Evidence Lower Bound formulation explicitly promotes factorization between the latent spaces, dedicating one latent space to task-salient features and the other to residual information. Extensive experiments on text, audio, and image tasks demonstrate that FVAE-LoRA consistently outperforms standard LoRA. Moreover, spurious correlation evaluations confirm that FVAE-LoRA better isolates task-relevant signals, leading to improved robustness under distribution shifts. Our code is publicly available at: https://github.com/idiap/FVAE-LoRA

Comment: Matches Compression/Efficiency and Model Architecture: a LoRA variant (FVAE-LoRA) that factorizes task-salient vs residual latent spaces via a new ELBO for parameter-efficient finetuning with improved robustness.

Relevance: 9 Novelty: 7

18. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

ArXiv ID: 2510.19338

Authors: Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou

Abstract: In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.

Comment: Matches Model Architecture and Efficiency: proposes a hybrid linear+softmax attention architecture for long-context with FP8 operator support, reducing compute/I-O while maintaining reasoning performance.

Relevance: 9 Novelty: 7

19. MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network

ArXiv ID: 2510.19105

Authors: Matthew Raffel, Adwaith Renjith, Lizhong Chen

Abstract: Kolmogorov-Arnold Networks (KANs) replace scalar weights with per-edge vectors of basis coefficients, thereby boosting expressivity and accuracy but at the same time resulting in a multiplicative increase in parameters and memory. We propose MetaCluster, a framework that makes KANs highly compressible without sacrificing accuracy. Specifically, a lightweight meta-learner, trained jointly with the KAN, is used to map low-dimensional embedding to coefficient vectors, shaping them to lie on a low-dimensional manifold that is amenable to clustering. We then run K-means in coefficient space and replace per-edge vectors with shared centroids. Afterwards, the meta-learner can be discarded, and a brief fine-tuning of the centroid codebook recovers any residual accuracy loss. The resulting model stores only a small codebook and per-edge indices, exploiting the vector nature of KAN parameters to amortize storage across multiple coefficients. On MNIST, CIFAR-10, and CIFAR-100, across standard KANs and ConvKANs using multiple basis functions, MetaCluster achieves a reduction of up to 80$\times$ in parameter storage, with no loss in accuracy. Code will be released upon publication.

Comment: Model Compression and Efficiency: codebook-based weight sharing for KANs via meta-learner-induced clustering enables up to 80x parameter compression.