Personalized Daily Arxiv Papers 02/21/2025

[gpt-4o]	Prompt	Completion	Total
Token	60143	8255	68398
Cost	$0.15	$0.08	$0.23

Total ArXiv papers: 524

Total scanned papers: 325

Total relevant papers: 46

Table of contents with paper titles:

OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes Authors: F\'elix Therrien, Jamal Abou Haibeh, Divya Sharma, Rhiannon Hendley, Alex Hern\'andez-Garc\'ia, Sun Sun, Alain Tchagang, Jiang Su, Samuel Huberman, Yoshua Bengio, Hongyu Guo, Homin Shin
Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models Authors: Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun
Zero loss guarantees and explicit minimizers for generic overparametrized Deep Learning networks Authors: Thomas Chen, Andrew G. Moore
Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing Authors: Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu
Weighted Low-rank Approximation via Stochastic Gradient Descent on Manifolds Authors: Conglong Xu, Peiqi Yang, Hao Wu
Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling Authors: Eric Egli, Matteo Manica, Jannis Born
Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective Authors: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji
Which Attention Heads Matter for In-Context Learning? Authors: Kayo Yin, Jacob Steinhardt
Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression Authors: Xuheng Li, Quanquan Gu
Fundamental Limitations in Defending LLM Finetuning APIs Authors: Xander Davies, Eric Winsor, Tomek Korbak, Alexandra Souly, Robert Kirk, Christian Schroeder de Witt, Yarin Gal
Towards a Learning Theory of Representation Alignment Authors: Francesco Insulla, Shuo Huang, Lorenzo Rosasco
Towards Efficient Automatic Self-Pruning of Large Language Models Authors: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji
PEARL: Towards Permutation-Resilient LLMs Authors: Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong
MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures Authors: Jiayu Qin, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Wei Wang
Ray-Tracing for Conditionally Activated Neural Networks Authors: Claudio Gallicchio, Giuseppe Nuti
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression Authors: Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs Authors: Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui
Dynamic Low-Rank Sparse Adaptation for Large Language Models Authors: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Yang Liu, Jing Lin, Yiwu Yao, Rongrong Ji
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Authors: Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han
Confidence Estimation via Sequential Likelihood Mixing Authors: Johannes Kirschner, Andreas Krause, Michele Meziu, Mojmir Mutny
Generalization Error of $f$-Divergence Stabilized Algorithms via Duality Authors: Francisco Daunas, I\~naki Esnaola, Samir M. Perlaza, Gholamali Aminian
A Theory for Conditional Generative Modeling on Multiple Data Sources Authors: Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu
A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms Authors: Zaiwei Chen, Sheng Zhang, Zhe Zhang, Shaan Ul Haque, Siva Theja Maguluri
Generalization Certificates for Adversarially Robust Bayesian Linear Regression Authors: Mahalakshmi Sabanayagam, Russell Tsuchida, Cheng Soon Ong, Debarghya Ghoshdastidar
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling Authors: Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
seqKAN: Sequence processing with Kolmogorov-Arnold Networks Authors: Tatiana Boura, Stasinos Konstantopoulos
Data-Efficient Pretraining with Group-Level Data Influence Modeling Authors: Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong
Multi-Faceted Studies on Data Poisoning can Advance LLM Development Authors: Pengfei He, Yue Xing, Han Xu, Zhen Xiang, Jiliang Tang
Prompt-to-Leaderboard Authors: Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica
CER: Confidence Enhanced Reasoning in LLMs Authors: Ali Razghandi, Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah
Affinity and Diversity: A Unified Metric for Demonstration Selection via Internal Representations Authors: Mariko Kato, Hakaze Cho, Yoshihiro Sakai, Naoya Inoue
Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models Authors: Yanggan Gu, Junzhuo Li, Sirui Huang, Xin Zou, Zhenghua Li, Xuming Hu
PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models Authors: Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang
From RAG to Memory: Non-Parametric Continual Learning for Large Language Models Authors: Bernal Jim\'enez Guti\'errez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su
Disentangled Latent Spaces for Reduced Order Models using Deterministic Autoencoders Authors: Henning Schwarz, Pyei Phyo Lin, Jens-Peter M. Zemke, Thomas Rung
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information Authors: Yein Park, Chanwoong Yoon, Jungwoo Park, Minbyul Jeong, Jaewoo Kang
Reward Models Identify Consistency, Not Causality Authors: Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li
HPS: Hard Preference Sampling for Human Preference Alignment Authors: Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification Authors: Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack
Rectified Lagrangian for Out-of-Distribution Detection in Modern Hopfield Networks Authors: Ryo Moriai, Nakamasa Inoue, Masayuki Tanaka, Rei Kawakami, Satoshi Ikehata, Ikuro Sato
EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts Authors: Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, Matthew Riemer
Temporal Misalignment and Probabilistic Neurons Authors: Velibor Bojkovi\'c, Xiaofeng Wu, Bin Gu
Dynamic Activation with Knowledge Distillation for Energy-Efficient Spiking NN Ensembles Authors: Orestis Konstantaropoulos, Theodoris Mallios, Maria Papadopouli
On Theoretical Limits of Learning with Label Differential Privacy Authors: Puning Zhao, Chuan Ma, Li Shen, Shaowei Wang, Rongfei Fan
General Uncertainty Estimation with Delta Variances Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt
Revealing and Mitigating Over-Attention in Knowledge Editing Authors: Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang

1. OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes

ArXiv ID: 2502.14234

Authors: F\'elix Therrien, Jamal Abou Haibeh, Divya Sharma, Rhiannon Hendley, Alex Hern\'andez-Garc\'ia, Sun Sun, Alain Tchagang, Jiang Su, Samuel Huberman, Yoshua Bengio, Hongyu Guo, Homin Shin

Abstract: Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of $\sim$600 synthesized solid electrolyte materials and their experimentally measured room temperature ionic conductivities gathered from literature. Each material is described by their measured composition, space group and lattice parameters. A full-crystal description in the form of a crystallographic information file (CIF) is provided for ~320 structures for which atomic positions were available. We discuss various statistics and features of the dataset and provide training and testing splits that avoid data leakage. Finally, we benchmark seven existing ML models on the task of predicting ionic conductivity and discuss their performance. The goal of this work is to facilitate the use of machine learning for solid-state electrolyte materials discovery.

Comment: Author match

2. Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

ArXiv ID: 2502.14819

Authors: Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun

Abstract: A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations. In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties-such as data diversity, trajectory quality, and environment variability-affect the performance of these approaches. Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment layouts, trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.

Comment: Author match

3. Zero loss guarantees and explicit minimizers for generic overparametrized Deep Learning networks

ArXiv ID: 2502.14114

Authors: Thomas Chen, Andrew G. Moore

Abstract: We determine sufficient conditions for overparametrized deep learning (DL) networks to guarantee the attainability of zero loss in the context of supervised learning, for the $\mathcal{L}^2$ cost and {\em generic} training data. We present an explicit construction of the zero loss minimizers without invoking gradient descent. On the other hand, we point out that increase of depth can deteriorate the efficiency of cost minimization using a gradient descent algorithm by analyzing the conditions for rank loss of the training Jacobian. Our results clarify key aspects on the dichotomy between zero loss reachability in underparametrized versus overparametrized DL.

Comment: The paper provides theoretical insights into overparameterized deep learning networks, focusing on zero loss guarantees and training dynamics, which aligns with the Representation Learning criterion.

Relevance: 9 Novelty: 8

4. Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

ArXiv ID: 2502.14458

Authors: Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu

Abstract: We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.

Comment: The paper presents a recurrent language model architecture optimized for efficiency, which aligns with the Model Architecture and Model Compression criteria.

Relevance: 9 Novelty: 8

5. Weighted Low-rank Approximation via Stochastic Gradient Descent on Manifolds

ArXiv ID: 2502.14174

Authors: Conglong Xu, Peiqi Yang, Hao Wu

Abstract: We solve a regularized weighted low-rank approximation problem by a stochastic gradient descent on a manifold. To guarantee the convergence of our stochastic gradient descent, we establish a convergence theorem on manifolds for retraction-based stochastic gradient descents admitting confinements. On sample data from the Netflix Prize training dataset, our algorithm outperforms the existing stochastic gradient descent on Euclidean spaces. We also compare the accelerated line search on this manifold to the existing accelerated line search on Euclidean spaces.

Comment: The paper addresses weighted low-rank approximation using stochastic gradient descent on manifolds, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

6. Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

ArXiv ID: 2502.14553

Authors: Eric Egli, Matteo Manica, Jannis Born

Abstract: Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm

Comment: The paper introduces a hierarchical architecture for byte-level sequence modeling, which aligns with foundational research in model architecture and efficiency.

Relevance: 9 Novelty: 8

7. Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

ArXiv ID: 2502.14770

Authors: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji

Abstract: In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.

Comment: This paper addresses layer-wise sparsity in LLMs, providing a theoretical perspective and a novel sparsity allocation method. It directly aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

8. Which Attention Heads Matter for In-Context Learning?

ArXiv ID: 2502.14010

Authors: Kayo Yin, Jacob Steinhardt

Abstract: Large language models (LLMs) exhibit impressive in-context learning (ICL) capability, enabling them to perform new tasks using only a few demonstrations in the prompt. Two different mechanisms have been proposed to explain ICL: induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task. To better understand which of the two distinct mechanisms drives ICL, we study and compare induction heads and FV heads in 12 language models. Through detailed ablations, we discover that few-shot ICL performance depends primarily on FV heads, especially in larger models. In addition, we uncover that FV and induction heads are connected: many FV heads start as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction facilitates learning the more complex FV mechanism that ultimately drives ICL.

Comment: This paper investigates the mechanisms behind in-context learning in LLMs, focusing on the role of specific attention heads. It provides theoretical insights into LLM behavior and training dynamics.

Relevance: 9 Novelty: 8

9. Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

ArXiv ID: 2502.14123

Authors: Xuheng Li, Quanquan Gu

Abstract: Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.

Comment: The paper provides theoretical insights into the effectiveness of Exponential Moving Average (EMA) in SGD, which aligns with the training dynamics in neural networks under representation learning.

Relevance: 9 Novelty: 8

10. Fundamental Limitations in Defending LLM Finetuning APIs

ArXiv ID: 2502.14828

Authors: Xander Davies, Eric Winsor, Tomek Korbak, Alexandra Souly, Robert Kirk, Christian Schroeder de Witt, Yarin Gal

Abstract: LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.

Comment: The paper discusses fundamental limitations in defending LLM fine-tuning APIs, providing theoretical insights into LLM security and robustness, which aligns with foundational research in LLM behavior.

Relevance: 9 Novelty: 8

11. Towards a Learning Theory of Representation Alignment

ArXiv ID: 2502.14047

Authors: Francesco Insulla, Shuo Huang, Lorenzo Rosasco

Abstract: It has recently been argued that AI models' representations are becoming aligned as their scale and performance increase. Empirical analyses have been designed to support this idea and conjecture the possible alignment of different representations toward a shared statistical model of reality. In this paper, we propose a learning-theoretic perspective to representation alignment. First, we review and connect different notions of alignment based on metric, probabilistic, and spectral ideas. Then, we focus on stitching, a particular approach to understanding the interplay between different representations in the context of a task. Our main contribution here is relating properties of stitching to the kernel alignment of the underlying representation. Our results can be seen as a first step toward casting representation alignment as a learning-theoretic problem.

Comment: The paper provides a learning-theoretic perspective on representation alignment, which aligns closely with the 'Representation Learning' criterion, particularly in understanding how representations are encoded and aligned.

Relevance: 9 Novelty: 8

12. Towards Efficient Automatic Self-Pruning of Large Language Models

ArXiv ID: 2502.14413

Authors: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji

Abstract: Despite exceptional capabilities, Large Language Models (LLMs) still face deployment challenges due to their enormous size. Post-training structured pruning is a promising solution that prunes LLMs without the need for retraining, reducing computational overhead, and it is hardware-deployment friendly. However, the training-free nature of post-training structured pruning leads to significant performance degradation. We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer. Meanwhile, we find that LLMs may have prior knowledge about their own redundancy. Based on this insight, we introduce $\textbf{Self-Pruner}$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates. Specifically, $\textbf{Self-Pruner}$ leverages LLMs to autonomously execute the entire evolutionary search process to search for pruning rate configurations. In this process, LLMs are used to generate populations, select parent solutions from the current population, and perform crossover and mutation operations to produce offspring solutions. In this way, LLMs automatically generate and evaluate a large number of candidate solutions, effectively converging to find the pruning rate configurations with minimal human intervention. Extensive experiments demonstrate $\textbf{Self-Pruner}$'s better performance compared to existing state-of-the-art methods. Notably, $\textbf{Self-Pruner}$ prunes LLaMA-2-70B to 49B level with only 0.80$\%$ drop in accuracy across seven commonsense reasoning tasks, achieving a 1.39$\times$ speedup on NVIDIA A100 80GB GPU. Further pruning to 35B level resulted in only a 3.80$\%$ decrease in accuracy while obtaining a 1.70$\times$ speedup.

Comment: The paper introduces an automatic self-pruning framework for LLMs, which aligns closely with the 'Model Compression' criterion, particularly in pruning and efficiency improvements.

Relevance: 9 Novelty: 8

13. PEARL: Towards Permutation-Resilient LLMs

ArXiv ID: 2502.14628

Authors: Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong

Abstract: The in-context learning (ICL) capability of large language models (LLMs) enables them to perform challenging tasks using provided demonstrations. However, ICL is highly sensitive to the ordering of demonstrations, leading to instability in predictions. This paper shows that this vulnerability can be exploited to design a natural attack - difficult for model providers to detect - that achieves nearly 80% success rate on LLaMA-3 by simply permuting the demonstrations. Existing mitigation methods primarily rely on post-processing and fail to enhance the model's inherent robustness to input permutations, raising concerns about safety and reliability of LLMs. To address this issue, we propose Permutation-resilient learning (PEARL), a novel framework based on distributionally robust optimization (DRO), which optimizes model performance against the worst-case input permutation. Specifically, PEARL consists of a permutation-proposal network (P-Net) and the LLM. The P-Net generates the most challenging permutations by treating it as an optimal transport problem, which is solved using an entropy-constrained Sinkhorn algorithm. Through minimax optimization, the P-Net and the LLM iteratively optimize against each other, progressively improving the LLM's robustness. Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate that PEARL effectively mitigates permutation attacks and enhances performance. Notably, despite being trained on fewer shots and shorter contexts, PEARL achieves performance gains of up to 40% when scaled to many-shot and long-context scenarios, highlighting its efficiency and generalization capabilities.

Comment: The paper introduces PEARL, a novel framework for improving LLM robustness to input permutations using distributionally robust optimization. This aligns with foundational research in LLM behavior and training dynamics.

Relevance: 9 Novelty: 8

14. MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures

ArXiv ID: 2502.14008

Authors: Jiayu Qin, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Wei Wang

Abstract: The remarkable performance of large language models (LLMs) in various language tasks has attracted considerable attention. However, the ever-increasing size of these models presents growing challenges for deployment and inference. Structured pruning, an effective model compression technique, is gaining increasing attention due to its ability to enhance inference efficiency. Nevertheless, most previous optimization-based structured pruning methods sacrifice the uniform structure across layers for greater flexibility to maintain performance. The heterogeneous structure hinders the effective utilization of off-the-shelf inference acceleration techniques and impedes efficient configuration for continued training. To address this issue, we propose a novel masking learning paradigm based on minimax optimization to obtain the uniform pruned structure by optimizing the masks under sparsity regularization. Extensive experimental results demonstrate that our method can maintain high performance while ensuring the uniformity of the pruned model structure, thereby outperforming existing SOTA methods.

Comment: The MaskPrune method introduces a novel structured pruning approach for LLMs, focusing on uniformity across layers, which is highly relevant to model compression.

Relevance: 9 Novelty: 8

15. Ray-Tracing for Conditionally Activated Neural Networks

ArXiv ID: 2502.14788

Authors: Claudio Gallicchio, Giuseppe Nuti

Abstract: In this paper, we introduce a novel architecture for conditionally activated neural networks combining a hierarchical construction of multiple Mixture of Experts (MoEs) layers with a sampling mechanism that progressively converges to an optimized configuration of expert activation. This methodology enables the dynamic unfolding of the network's architecture, facilitating efficient path-specific training. Experimental results demonstrate that this approach achieves competitive accuracy compared to conventional baselines while significantly reducing the parameter count required for inference. Notably, this parameter reduction correlates with the complexity of the input patterns, a property naturally emerging from the network's operational dynamics without necessitating explicit auxiliary penalty functions.

Comment: This paper introduces a novel hierarchical Mixture of Experts (MoE) architecture with dynamic activation, which is highly relevant to model architecture innovations and efficiency improvements.

Relevance: 9 Novelty: 8

16. RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

ArXiv ID: 2502.14051

Authors: Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

Abstract: Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase. RocketKV contains two consecutive stages. In the first stage, it performs coarse-grain KV cache eviction on the input sequence tokens with SnapKV++, a method improved upon SnapKV by introducing adaptive pooling size and full compatibility with grouped-query attention. In the second stage, it adopts a hybrid attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensional reductions. Combining these two stages, RocketKV achieves significant KV cache fetching bandwidth and storage savings while maintaining comparable accuracy to full KV cache attention. We show that RocketKV provides end-to-end speedup by up to 3$\times$ as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks.

Comment: The paper introduces a two-stage KV cache compression strategy for LLMs, which is highly relevant to model compression and efficiency improvements in large language models.

Relevance: 9 Novelty: 8

17. Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

ArXiv ID: 2502.14837

Authors: Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui

Abstract: Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.

Comment: The paper introduces Multi-head Latent Attention (MLA) and proposes a novel fine-tuning method for transitioning from MHA to MLA, which aligns with the Model Compression criterion due to its focus on KV cache compression and efficiency.

Relevance: 9 Novelty: 8

18. Dynamic Low-Rank Sparse Adaptation for Large Language Models

ArXiv ID: 2502.14816

Authors: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Yang Liu, Jing Lin, Yiwu Yao, Rongrong Ji

Abstract: Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$\%$, achieving a 2.60$\times$ speedup on CPU and 2.23$\times$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.

Comment: Presents a novel method for integrating low-rank adaptation with sparsity in LLMs, addressing efficiency and performance degradation. This aligns closely with model compression and sparsity criteria.

Relevance: 9 Novelty: 8

19. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

ArXiv ID: 2502.14866

Authors: Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han

Abstract: Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

Comment: Proposes a unified sparse attention framework for efficient LLM serving, addressing both computational and memory efficiency. This aligns well with model compression and sparsity criteria.

Relevance: 9 Novelty: 8

20. Confidence Estimation via Sequential Likelihood Mixing

ArXiv ID: 2502.14689

Authors: Johannes Kirschner, Andreas Krause, Michele Meziu, Mojmir Mutny

Abstract: We present a universal framework for constructing confidence sets based on sequential likelihood mixing. Building upon classical results from sequential analysis, we provide a unifying perspective on several recent lines of work, and establish fundamental connections between sequential mixing, Bayesian inference and regret inequalities from online estimation. The framework applies to any realizable family of likelihood functions and allows for non-i.i.d. data and anytime validity. Moreover, the framework seamlessly integrates standard approximate inference techniques, such as variational inference and sampling-based methods, and extends to misspecified model classes, while preserving provable coverage guarantees. We illustrate the power of the framework by deriving tighter confidence sequences for classical settings, including sequential linear regression and sparse estimation, with simplified proofs.

Comment: The paper provides a framework for constructing confidence sets with theoretical insights, which aligns with foundational research in emerging trends.

Relevance: 8 Novelty: 8

21. Generalization Error of $f$-Divergence Stabilized Algorithms via Duality

ArXiv ID: 2502.14544

Authors: Francisco Daunas, I\~naki Esnaola, Samir M. Perlaza, Gholamali Aminian

Abstract: The solution to empirical risk minimization with $f$-divergence regularization (ERM-$f$DR) is extended to constrained optimization problems, establishing conditions for equivalence between the solution and constraints. A dual formulation of ERM-$f$DR is introduced, providing a computationally efficient method to derive the normalization function of the ERM-$f$DR solution. This dual approach leverages the Legendre-Fenchel transform and the implicit function theorem, enabling explicit characterizations of the generalization error for general algorithms under mild conditions, and another for ERM-$f$DR solutions.

Comment: The paper explores generalization error with $f$-divergence regularization, providing theoretical insights into optimization, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 8

22. A Theory for Conditional Generative Modeling on Multiple Data Sources

ArXiv ID: 2502.14583

Authors: Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu

Abstract: The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments validate our theory. Code is available at: \url{https://github.com/ML-GSAI/Multi-Source-GM}.

Comment: The theoretical analysis of multi-source training in conditional generative modeling provides foundational insights into generative model training dynamics.

Relevance: 8 Novelty: 8

23. A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms

ArXiv ID: 2502.14208

Authors: Zaiwei Chen, Sheng Zhang, Zhe Zhang, Shaan Ul Haque, Siva Theja Maguluri

Abstract: We study the problem of solving fixed-point equations for seminorm-contractive operators and establish foundational results on the non-asymptotic behavior of iterative algorithms in both deterministic and stochastic settings. Specifically, in the deterministic setting, we prove a fixed-point theorem for seminorm-contractive operators, showing that iterates converge geometrically to the kernel of the seminorm. In the stochastic setting, we analyze the corresponding stochastic approximation (SA) algorithm under seminorm-contractive operators and Markovian noise, providing a finite-sample analysis for various stepsize choices. A benchmark for equation solving is linear systems of equations, where the convergence behavior of fixed-point iteration is closely tied to the stability of linear dynamical systems. In this special case, our results provide a complete characterization of system stability with respect to a seminorm, linking it to the solution of a Lyapunov equation in terms of positive semi-definite matrices. In the stochastic setting, we establish a finite-sample analysis for linear Markovian SA without requiring the Hurwitzness assumption. Our theoretical results offer a unified framework for deriving finite-sample bounds for various reinforcement learning algorithms in the average reward setting, including TD($\lambda$) for policy evaluation (which is a special case of solving a Poisson equation) and Q-learning for control.

Comment: The paper provides a theoretical framework for seminorm-contractive operators and iterative algorithms, which aligns with the Emerging Trends criterion due to its foundational theoretical contributions.

Relevance: 8 Novelty: 8

24. Generalization Certificates for Adversarially Robust Bayesian Linear Regression

ArXiv ID: 2502.14298

Authors: Mahalakshmi Sabanayagam, Russell Tsuchida, Cheng Soon Ong, Debarghya Ghoshdastidar

Abstract: Adversarial robustness of machine learning models is critical to ensuring reliable performance under data perturbations. Recent progress has been on point estimators, and this paper considers distributional predictors. First, using the link between exponential families and Bregman divergences, we formulate an adversarial Bregman divergence loss as an adversarial negative log-likelihood. Using the geometric properties of Bregman divergences, we compute the adversarial perturbation for such models in closed-form. Second, under such losses, we introduce \emph{adversarially robust posteriors}, by exploiting the optimization-centric view of generalized Bayesian inference. Third, we derive the \emph{first} rigorous generalization certificates in the context of an adversarial extension of Bayesian linear regression by leveraging the PAC-Bayesian framework. Finally, experiments on real and synthetic datasets demonstrate the superior robustness of the derived adversarially robust posterior over Bayes posterior, and also validate our theoretical guarantees.

Comment: The paper introduces adversarially robust Bayesian linear regression and provides theoretical guarantees, aligning with the Emerging Trends criterion due to its foundational contributions to robustness in machine learning.

Relevance: 8 Novelty: 8

25. FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

ArXiv ID: 2502.14856

Authors: Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun

Abstract: Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2.

Comment: The paper proposes a speculative sampling framework for LLMs, focusing on efficiency improvements, which aligns with the Model Compression criterion.