Personalized Daily Arxiv Papers 02/25/2025
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 92705 | 14000 | 106705 |
| Cost | $0.23 | $0.14 | $0.37 |
Total ArXiv papers: 1195
Total scanned papers: 745
Total relevant papers: 76
Table of contents with paper titles:
-
Forgotten Polygons: Multimodal Large Language Models are Shape-Blind Authors: William Rudman, Michal Golovanesky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh
-
Fractal Generative Models Authors: Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He
-
Compression Scaling Laws:Unifying Sparsity and Quantization Authors: Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh
-
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance Authors: Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li
-
Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression Authors: Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen
-
BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference Authors: Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, Cheng Li
-
Delta Decompression for MoE-based LLMs Compression Authors: Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee, Shengjie Sun, Wei Xue, Yike Guo
-
Compression Barriers for Autoregressive Transformers Authors: Themistoklis Haris, Krzysztof Onak
-
A General Error-Theoretical Analysis Framework for Constructing Compression Strategies Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang, Meiqi Tu, Fangmin Liu, Jiake Tian
-
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks Authors: Andrei Chernov
-
Distributional Scaling Laws for Emergent Capabilities Authors: Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra
-
Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization Authors: Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu
-
Reasoning with Latent Thoughts: On the Power of Looped Transformers Authors: Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi
-
Forecasting Rare Language Model Behaviors Authors: Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma
-
Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation Authors: Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini
-
UniDyG: A Unified and Effective Representation Learning Approach for Large Dynamic Graphs Authors: Yuanyuan Xu, Wenjie Zhang, Xuemin Lin, Ying Zhang
-
Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation Authors: Trung Nguyen, Yan Leng
-
Linear Attention for Efficient Bidirectional Sequence Modeling Authors: Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher
-
DISC: Dynamic Decomposition Improves LLM Inference Scaling Authors: Jonathan Light, Wei Cheng, Wu Yue, Masafumi Oyamada, Mengdi Wang, Santiago Paternain, Haifeng Chen
-
An explainable transformer circuit for compositional generalization Authors: Cheng Tang, Brenden Lake, Mehrdad Jazayeri
-
Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models Authors: Andrew DiGiugno, Ausif Mahmood
-
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam Authors: Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu
-
LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification Authors: Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An
-
Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements Authors: Arya Mazumdar, Neha Sangwan
-
When Can We Solve the Weighted Low Rank Approximation Problem in Truly Subquadratic Time? Authors: Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song
-
Signal Collapse in One-Shot Pruning: When Sparse Models Fail to Distinguish Neural Representations Authors: Dhananjay Saikumar, Blesson Varghese
-
Muon is Scalable for LLM Training Authors: Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang
-
The Role of Sparsity for Length Generalization in Transformers Authors: Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
-
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs Authors: Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness
-
Function-Space Learning Rates Authors: Edward Milsom, Ben Anson, Laurence Aitchison
-
Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems Authors: Maksim Zhdanov, Max Welling, Jan-Willem van de Meent
-
Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer Authors: Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo
-
Entropy-Lens: The Information Signature of Transformer Computations Authors: Riccardo Ali, Francesco Caso, Christopher Irwin, Pietro Li`o
-
Sequence-level Large Language Model Training with Contrastive Preference Optimization Authors: Zhili Feng, Dhananjay Ram, Cole Hawkins, Aditya Rawal, Jinman Zhao, Sheng Zha
-
A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis Authors: Akash Kumar, Rahul Parhi, Mikhail Belkin
-
Low-rank bias, weight decay, and model merging in neural networks Authors: Ilja Kuzborskij, Yasin Abbasi Yadkori
-
Geometric Kolmogorov-Arnold Superposition Theorem Authors: Francesco Alesiani, Takashi Maruyama, Henrik Christiansen, Viktor Zaverkin
-
The Empirical Impact of Reducing Symmetries on the Performance of Deep Ensembles and MoE Authors: Andrei Chernov, Oleg Novitskij
-
Sparsity May Be All You Need: Sparse Random Parameter Adaptation Authors: Jesus Rios, Pierre Dognin, Ronny Luss, Karthikeyan N. Ramamurthy
-
Pruning as a Defense: Reducing Memorization in Large Language Models Authors: Mansi Gupta, Nikhar Waghela, Sarthak Gupta, Shourya Goel, Sanjif Shanmugavelu
-
Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability Authors: Ashhadul Islam, Samir Brahim Belhaouari, Amine Bermak
-
Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks Authors: Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett
-
Learning to Reason from Feedback at Test-Time Authors: Yanyang Li, Michael Lyu, Liwei Wang
-
Category-free Out-of-Distribution Node Detection with Feature Resonance Authors: Shenzhi Yang, Junbo Zhao, Shouqing Yang, Yixuan Li, Dingyu Yang, Xiaofang Zhang, Haobo Wang
-
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence Authors: Tom Wollschl\"ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G\"unnemann, Johannes Gasteiger
-
Since Faithfulness Fails: The Performance Limits of Neural Causal Discovery Authors: Mateusz Olko, Mateusz Gajewski, Joanna Wojciechowska, Miko{\l}aj Morzy, Piotr Sankowski, Piotr Mi{\l}o\'s
-
When to Forget? Complexity Trade-offs in Machine Unlearning Authors: Martin Van Waerebeke, Marco Lorenzi, Giovanni Neglia, Kevin Scaman
-
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter Authors: Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, Zhongchao Shi
-
Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models Authors: Taj Jones-McCormick, Aukosh Jagannath, Subhabrata Sen
-
Dynamic Parallel Tree Search for Efficient LLM Reasoning Authors: Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao
-
Brain-Model Evaluations Need the NeuroAI Turing Test Authors: Jenelle Feather, Meenakshi Khosla, N. Apurva Ratan Murty, Aran Nayebi
-
A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models Authors: Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang
-
R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
-
Understanding the Emergence of Multimodal Representation Alignment Authors: Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang
-
Verifying Quantized Graph Neural Networks is PSPACE-complete Authors: Marco S\"alzer, Fran\c{c}ois Schwarzentruber, Nicolas Troquard
-
Subsampling Graphs with GNN Performance Guarantees Authors: Mika Sarkin Jain, Stefanie Jegelka, Ishani Karmarkar, Luana Ruiz, Ellen Vitercik
-
Verification of Bit-Flip Attacks against Quantized Neural Networks Authors: Yedi Zhang, Lei Huang, Pengfei Gao, Fu Song, Jun Sun, Jin Song Dong
-
Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
-
Hierarchical Residuals Exploit Brain-Inspired Compositionality Authors: Francisco M. L\'opez, Jochen Triesch
-
To Share or Not to Share: Investigating Weight Sharing in Variational Graph Autoencoders Authors: Guillaume Salha-Galvan, Jiaying Xu
-
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models Authors: Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, Evgeny Burnaev
-
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Authors: Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
-
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow Authors: Behrooz Azarkhalili, Maxwell Libbrecht
-
Quantifying Logical Consistency in Transformers via Query-Key Alignment Authors: Eduard Tulchinskii, Anastasia Voznyuk, Laida Kushnareva, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
-
Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations Authors: Chunyang Li, Weiqi Wang, Tianshi Zheng, Yangqiu Song
-
CoME: An Unlearning-based Approach to Conflict-free Model Editing Authors: Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, Heuiseok Lim
-
Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin Authors: Akshay Kumar, Jarvis Haupt
-
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference Authors: Bingzhe Zhao, Ke Cheng, Aomufei Yuan, Yuxuan Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu
-
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought Authors: Boxuan Zhang, Ruqi Zhang
-
Graph Self-Supervised Learning with Learnable Structural and Positional Encodings Authors: Asiri Wijesinghe, Hao Zhu, Piotr Koniusz
-
MaxSup: Overcoming Representation Collapse in Label Smoothing Authors: Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Mario Fritz, Margret Keuper
-
Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness Authors: Sangil Han, Kyoowon Kim, Sungkyu Jung
-
NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional Interactions Authors: Tue M. Cao, Nhat X. Hoang, Hieu H. Pham, Phi Le Nguyen, My T. Thai
-
Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology Authors: Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, Hua Wei
-
PLS-based approach for fair representation learning Authors: Elena M. De-Diego, Adri\'an Perez-Suay, Paula Gordaliza, Jean-Michel Loubes
-
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations Authors: Md Saidul Hoque Anik, Ariful Azad
1. Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
ArXiv ID: 2502.15969
Authors: William Rudman, Michal Golovanesky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh
Abstract: Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.
Comment: Author match
2. Fractal Generative Models
ArXiv ID: 2502.17437
Authors: Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He
Abstract: Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.
Comment: Author match
3. Compression Scaling Laws:Unifying Sparsity and Quantization
ArXiv ID: 2502.16440
Authors: Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh
Abstract: We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this "effective parameter" scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.
Comment: The paper investigates compression scaling laws, unifying sparsity and quantization under a common framework, which directly aligns with model compression and provides theoretical insights.
Relevance: 10 Novelty: 9
4. DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance
ArXiv ID: 2502.16886
Authors: Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li
Abstract: To alleviate memory burden during inference of large language models (LLMs), numerous studies have focused on compressing the KV cache by exploring aspects such as attention sparsity. However, these techniques often require a pre-defined cache budget; as the optimal budget varies with different input lengths and task types, it limits their practical deployment accepting open-domain instructions. To address this limitation, we propose a new KV cache compression objective: to always ensure the full-cache performance regardless of specific inputs, while maximizing KV cache pruning as much as possible. To achieve this goal, we introduce a novel KV cache compression method dubbed DBudgetKV, which features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Empirical evaluation spanning diverse context lengths, task types, and model sizes suggests that our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average. Furthermore, our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
Comment: The paper proposes a novel KV cache compression method for LLMs, which directly aligns with model compression and efficiency breakthroughs.
Relevance: 10 Novelty: 8
5. Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
ArXiv ID: 2502.16638
Authors: Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen
Abstract: Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.
Comment: The paper introduces a framework for joint structured pruning and quantization, which aligns with foundational research in model compression.
Relevance: 10 Novelty: 8
6. BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference
ArXiv ID: 2502.16927
Authors: Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, Cheng Li
Abstract: The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient \textbf{d}escend-\textbf{c}ommunicate-\textbf{c}ommunicate-\textbf{a}scend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09$\times$ for training and increases the throughput by up to 3.11$\times$ for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.
Comment: BigMac introduces a communication-efficient MoE structure, directly aligning with architectural innovations in MoE and efficiency improvements.
Relevance: 10 Novelty: 8
7. Delta Decompression for MoE-based LLMs Compression
ArXiv ID: 2502.17298
Authors: Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee, Shengjie Sun, Wei Xue, Yike Guo
Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compact MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60% compression rates. Codes are available in https://github.com/lliai/D2MoE.
Comment: The paper focuses on a novel compression method for MoE-based LLMs, aligning with the 'Model Compression' criterion. It introduces delta decompression and low-rank SVD techniques, which are foundational contributions.
Relevance: 10 Novelty: 8
8. Compression Barriers for Autoregressive Transformers
ArXiv ID: 2502.15955
Authors: Themistoklis Haris, Krzysztof Onak
Abstract: A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache, but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d = \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(d\cdot e^d)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation's time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.
Comment: The paper provides theoretical insights into the compression barriers for autoregressive Transformers, directly addressing model compression and efficiency.
Relevance: 9 Novelty: 9
9. A General Error-Theoretical Analysis Framework for Constructing Compression Strategies
ArXiv ID: 2502.15802
Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang, Meiqi Tu, Fangmin Liu, Jiake Tian
Abstract: The exponential growth in parameter size and computational complexity of deep models poses significant challenges for efficient deployment. The core problem of existing compression methods is that different layers of the model have significant differences in their tolerance to compression levels. For instance, the first layer of a model can typically sustain a higher compression level compared to the last layer without compromising performance. Thus, the key challenge lies in how to allocate compression levels across layers in a way that minimizes performance loss while maximizing parameter reduction. To address this challenge, we propose a Compression Error Theory (CET) framework, designed to determine the optimal compression level for each layer. Taking quantization as an example, CET leverages differential expansion and algebraic geometry to reconstruct the quadratic form of quantization error as ellipsoids and hyperbolic paraboloids, and utilizes their geometric structures to define an error subspace. To identify the error subspace with minimal performance loss, by performing orthogonal decomposition of the geometric space, CET transforms the optimization process of the error subspace into a complementary problem. The final theoretical analysis shows that constructing the quantization subspace along the major axis results in minimal performance degradation. Through experimental verification of the theory, CET can greatly retain performance while compressing. Specifically, on the ResNet-34 model, CET achieves nearly 11$\times$ parameter compression while even surpassing performance comparable to the original model.
Comment: The paper introduces a theoretical framework for constructing compression strategies, which aligns with foundational research in model compression and efficiency.
Relevance: 9 Novelty: 9
10. Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks
ArXiv ID: 2502.17187
Authors: Andrei Chernov
Abstract: Recently, Large Language Models (LLMs) with Mixture of Experts (MoE) layers have gained significant attention. Currently, state-of-the-art LLMs utilize this architecture. There is a substantial amount of research on how to train such models and how to select hyperparameters for this architecture. However, there is a lack of studies focusing on post-evaluation analysis of MoE layer properties. In this paper, we take a first step toward closing this gap by evaluating expert contributions on the quiz-based MMLU benchmark. We show that most experts were never activated during inference on this benchmark. Additionally, the output distribution of gating networks is much closer to uniform than sparse. Finally, we demonstrate that the average performance of some experts within the same layer varies significantly.
Comment: The paper evaluates MoE LLMs, focusing on expert contributions and gating network behavior, which directly aligns with the model architecture topic, particularly MoE analysis.
Relevance: 10 Novelty: 7
11. Distributional Scaling Laws for Emergent Capabilities
ArXiv ID: 2502.17356
Authors: Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra
Abstract: In this paper, we explore the nature of sudden breakthroughs in language model performance at scale, which stands in contrast to smooth improvements governed by scaling laws. While advocates of "emergence" view abrupt performance gains as capabilities unlocking at specific scales, others have suggested that they are produced by thresholding effects and alleviated by continuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes, particularly when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. Furthermore, we provide a case study of inverse scaling and show that even as the probability of a successful run declines, the average performance of a successful run continues to increase monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LLM populations. These insights emphasize the role of random variation in the effect of scale on LLM capabilities.
Comment: The paper explores emergent capabilities in LLMs and provides theoretical insights into scaling laws and random seed effects, aligning with the 'Large Language Models' criterion.
Relevance: 9 Novelty: 8
12. Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization
ArXiv ID: 2502.17024
Authors: Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu
Abstract: Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. (b) Lack of Emergence Explanation. Most literature answers what ICL does from an implicit optimization perspective but falls short in elucidating how ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, auto-regressive next-token prediction (AR-NTP), which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that ICL emerges from the generalization of sequences and topics. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
Comment: The paper provides theoretical insights into in-context learning and generalization in LLMs, aligning with the 'Large Language Models' criterion.
Relevance: 9 Novelty: 8
13. Reasoning with Latent Thoughts: On the Power of Looped Transformers
ArXiv ID: 2502.17416
Authors: Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi
Abstract: Large language models have shown remarkable reasoning abilities and scaling laws suggest that large parameter count, especially along the depth axis, is the primary driver. In this work, we make a stronger claim -- many reasoning problems require a large depth but not necessarily many parameters. This unlocks a novel application of looped models for reasoning. Firstly, we show that for many synthetic reasoning problems like addition, $p$-hop induction, and math problems, a $k$-layer transformer looped $L$ times nearly matches the performance of a $kL$-layer non-looped model, and is significantly better than a $k$-layer model. This is further corroborated by theoretical results showing that many such reasoning problems can be solved via iterative algorithms, and thus, can be solved effectively using looped models with nearly optimal depth. Perhaps surprisingly, these benefits also translate to practical settings of language modeling -- on many downstream reasoning tasks, a language model with $k$-layers looped $L$ times can be competitive to, if not better than, a $kL$-layer language model. In fact, our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth, akin to the inference-time scaling of chain-of-thought (CoT) reasoning. We further elucidate the connection to CoT reasoning by proving that looped models implicitly generate latent thoughts and can simulate $T$ steps of CoT with $T$ loops. Inspired by these findings, we also present an interesting dichotomy between reasoning and memorization, and design a looping-based regularization that is effective on both fronts.
Comment: The paper introduces looped transformers for reasoning tasks and connects them to CoT reasoning, aligning with 'Model Architecture' and 'Large Language Models' criteria.
Relevance: 9 Novelty: 8
14. Forecasting Rare Language Model Behaviors
ArXiv ID: 2502.16797
Authors: Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma
Abstract: Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.
Comment: The paper introduces a method to forecast rare LLM behaviors, which provides theoretical insights into LLM behavior and interpretability.
Relevance: 9 Novelty: 8
15. Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation
ArXiv ID: 2502.15734
Authors: Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini
Abstract: Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention-layers in LLMs fully compute the key values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose Cache-Craft, a system for managing and reusing precomputed KVs corresponding to the text chunks (we call chunk-caches) in RAG-based systems. We present how to identify chunk-caches that are reusable, how to efficiently perform a small fraction of recomputation to fix the cache to maintain output quality, and how to efficiently store and evict chunk-caches in the hardware for maximizing reuse while masking any overheads. With real production workloads as well as synthetic datasets, we show that Cache-Craft reduces redundant computation by 51% over SOTA prefix-caching and 75% over full recomputation. Additionally, with continuous batching on a real production workload, we get a 1.6X speed up in throughput and a 2X reduction in end-to-end response latency over prefix-caching while maintaining quality, for both the LLaMA-3-8B and LLaMA-3-70B models.
Comment: The paper proposes Cache-Craft, a system for managing and reusing KV caches in RAG-based systems, which aligns with the model compression criterion by addressing efficiency and computational redundancy.
Relevance: 9 Novelty: 8
16. UniDyG: A Unified and Effective Representation Learning Approach for Large Dynamic Graphs
ArXiv ID: 2502.16431
Authors: Yuanyuan Xu, Wenjie Zhang, Xuemin Lin, Ying Zhang
Abstract: Dynamic graphs are formulated in continuous-time or discrete-time dynamic graphs. They differ in temporal granularity: Continuous-Time Dynamic Graphs (CTDGs) exhibit rapid, localized changes, while Discrete-Time Dynamic Graphs (DTDGs) show gradual, global updates. This difference leads to isolated developments in representation learning for each type. To advance representation learning, recent research attempts to design a unified model capable of handling both CTDGs and DTDGs. However, it typically focuses on local dynamic propagation for temporal structure learning in the time domain, failing to accurately capture the structural evolution associated with each temporal granularity. In addition, existing works-whether specific or unified-often overlook the issue of temporal noise, compromising the model robustness and effectiveness. To better model both types of dynamic graphs, we propose UniDyG, a unified and effective representation learning approach, which scales to large dynamic graphs. We first propose a novel Fourier Graph Attention (FGAT) mechanism that can model local and global structural correlations based on recent neighbors and complex-number selective aggregation, while theoretically ensuring consistent representations of dynamic graphs over time. Based on approximation theory, we demonstrate that FGAT is well-suited to capture the underlying structures in CTDGs and DTDGs. We further enhance FGAT to resist temporal noise by designing an energy-gated unit, which adaptively filters out high-frequency noise according to the energy. Last, we leverage our FGAT mechanisms for temporal structure learning and employ the frequency-enhanced linear function for node-level dynamic updates, facilitating the generation of high-quality temporal embeddings. Extensive experiments show that our UniDyG achieves an average improvement of 14.4% over sixteen baselines across nine dynamic graphs.
Comment: The paper proposes UniDyG, a unified representation learning approach for dynamic graphs, which aligns with representation learning and introduces a novel Fourier Graph Attention mechanism.
Relevance: 9 Novelty: 8
17. Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation
ArXiv ID: 2502.16385
Authors: Trung Nguyen, Yan Leng
Abstract: Linear representation hypothesis posits that high-level concepts are encoded as linear directions in the representation spaces of LLMs. Park et al. (2024) formalize this notion by unifying multiple interpretations of linear representation, such as 1-dimensional subspace representation and interventions, using a causal inner product. However, their framework relies on single-token counterfactual pairs and cannot handle ambiguous contrasting pairs, limiting its applicability to complex or context-dependent concepts. We introduce a new notion of binary concepts as unit vectors in a canonical representation space, and utilize LLMs' (neural) activation differences along with maximum likelihood estimation (MLE) to compute concept directions (i.e., steering vectors). Our method, Sum of Activation-base Normalized Difference (SAND), formalizes the use of activation differences modeled as samples from a von Mises-Fisher (vMF) distribution, providing a principled approach to derive concept directions. We extend the applicability of Park et al. (2024) by eliminating the dependency on unembedding representations and single-token pairs. Through experiments with LLaMA models across diverse concepts and benchmarks, we demonstrate that our lightweight approach offers greater flexibility, superior performance in activation engineering tasks like monitoring and manipulation.
Comment: The paper introduces a flexible framework for linear representation hypothesis using maximum likelihood estimation, which aligns with representation learning and provides a principled approach to concept directions.
Relevance: 9 Novelty: 8
18. Linear Attention for Efficient Bidirectional Sequence Modeling
ArXiv ID: 2502.16249
Authors: Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher
Abstract: Transformers with linear attention enable fast and parallel training. Moreover, they can be formulated as Recurrent Neural Networks (RNNs), for efficient linear-time inference. While extensively evaluated in causal sequence modeling, they have yet to be extended to the bidirectional setting. This work introduces the LION framework, establishing new theoretical foundations for linear transformers in bidirectional sequence modeling. LION constructs a bidirectional RNN equivalent to full Linear Attention. This extends the benefits of linear transformers: parallel training, and efficient inference, into the bidirectional setting. Using LION, we cast three linear transformers to their bidirectional form: LION-LIT, the bidirectional variant corresponding to (Katharopoulos et al., 2020); LION-D, extending RetNet (Sun et al., 2023); and LION-S, a linear transformer with a stable selective mask inspired by selectivity of SSMs (Dao & Gu, 2024). Replacing the attention block with LION (-LIT, -D, -S) achieves performance on bidirectional tasks that approaches that of Transformers and State-Space Models (SSMs), while delivering significant improvements in training speed. Our implementation is available in http://github.com/LIONS-EPFL/LION.
Comment: The paper introduces LION, a framework for linear attention in bidirectional sequence modeling, which aligns with model architecture innovations and provides theoretical foundations for efficient transformers.
Relevance: 9 Novelty: 8
19. DISC: Dynamic Decomposition Improves LLM Inference Scaling
ArXiv ID: 2502.16706
Authors: Jonathan Light, Wei Cheng, Wu Yue, Masafumi Oyamada, Mengdi Wang, Santiago Paternain, Haifeng Chen
Abstract: Many inference scaling methods work by breaking a problem into smaller steps (or groups of tokens), then sampling and choosing the best next step. However, these steps and their sizes are usually predetermined based on human intuition or domain knowledge. This paper introduces dynamic decomposition, a method that automatically and adaptively splits solution and reasoning traces into steps during inference. This approach improves computational efficiency by focusing more resources on difficult steps, breaking them down further and prioritizing their sampling. Experiments on coding and math benchmarks (APPS, MATH, and LiveCodeBench) show that dynamic decomposition performs better than static methods, which rely on fixed steps like token-level, sentence-level, or single-step decompositions. These results suggest that dynamic decomposition can enhance many inference scaling techniques.
Comment: The paper introduces a novel method for dynamic decomposition in LLM inference, which aligns with foundational research in efficiency and scaling techniques for LLMs.
Relevance: 9 Novelty: 8
20. An explainable transformer circuit for compositional generalization
ArXiv ID: 2502.15801
Authors: Cheng Tang, Brenden Lake, Mehrdad Jazayeri
Abstract: Compositional generalization-the systematic combination of known components into novel structures-remains a core challenge in cognitive science and machine learning. Although transformer-based large language models can exhibit strong performance on certain compositional tasks, the underlying mechanisms driving these abilities remain opaque, calling into question their interpretability. In this work, we identify and mechanistically interpret the circuit responsible for compositional induction in a compact transformer. Using causal ablations, we validate the circuit and formalize its operation using a program-like description. We further demonstrate that this mechanistic understanding enables precise activation edits to steer the model's behavior predictably. Our findings advance the understanding of complex behaviors in transformers and highlight such insights can provide a direct pathway for model control.
Comment: The paper provides mechanistic insights into compositional generalization in transformers, which aligns with understanding and interpretability of model architectures.
Relevance: 9 Novelty: 8
21. Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models
ArXiv ID: 2502.17206
Authors: Andrew DiGiugno, Ausif Mahmood
Abstract: Transformer models typically calculate attention matrices using dot products, which have limitations when capturing nonlinear relationships between embedding vectors. We propose Neural Attention, a technique that replaces dot products with feed-forward networks, enabling a more expressive representation of relationships between tokens. This approach modifies only the attention matrix calculation while preserving the matrix dimensions, making it easily adaptable to existing transformer-based architectures. We provide a detailed mathematical justification for why Neural Attention increases representational capacity and conduct controlled experiments to validate this claim. When comparing Neural Attention and Dot-Product Attention, NLP experiments on WikiText-103 show a reduction in perplexity of over 5 percent. Similarly, experiments on CIFAR-10 and CIFAR-100 show comparable improvements for image classification tasks. While Neural Attention introduces higher computational demands, we develop techniques to mitigate these challenges, ensuring practical usability without sacrificing the increased expressivity it provides. This work establishes Neural Attention as an effective means of enhancing the predictive capabilities of transformer models across a variety of applications.
Comment: The paper proposes Neural Attention as an enhancement to transformer models, which aligns with architectural innovations in transformers.
Relevance: 9 Novelty: 8
22. Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
ArXiv ID: 2502.17055
Authors: Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu
Abstract: This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to $2$ perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.
Comment: The paper focuses on 4-bit training stability and introduces Stable-SPAM, which aligns with model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 8
23. LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification
ArXiv ID: 2502.17421
Authors: Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An
Abstract: Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models (LLMs). Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges: the increasing memory demands of the draft model, the distribution shift between the short-training corpora and long-context inference, and inefficiencies in attention implementation. In this work, we enhance the performance of speculative decoding in long-context settings by addressing these challenges. First, we propose a memory-efficient draft model with a constant-sized Key-Value (KV) cache. Second, we introduce novel position indices for short-training data, enabling seamless adaptation from short-context training to long-context inference. Finally, we present an innovative attention aggregation method that combines fast implementations for prefix computation with standard attention for tree mask handling, effectively resolving the latency and memory inefficiencies of tree decoding. Our approach achieves strong results on various long-context tasks, including repository-level code completion, long-context summarization, and o1-like long reasoning tasks, demonstrating significant improvements in latency reduction. The code is available at https://github.com/sail-sg/LongSpec.
Comment: The paper introduces a memory-efficient draft model with a constant-sized KV cache and novel attention methods, which aligns with the model compression and efficiency criteria.
Relevance: 9 Novelty: 8
24. Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements
ArXiv ID: 2502.16008
Authors: Arya Mazumdar, Neha Sangwan
Abstract: We consider the problem of exact recovery of a $k$-sparse binary vector from generalized linear measurements (such as logistic regression). We analyze the linear estimation algorithm (Plan, Vershynin, Yudovina, 2017), and also show information theoretic lower bounds on the number of required measurements. As a consequence of our results, for noisy one bit quantized linear measurements ($\mathsf{1bCSbinary}$), we obtain a sample complexity of $O((k+\sigma^2)\log{n})$, where $\sigma^2$ is the noise variance. This is shown to be optimal due to the information theoretic lower bound. We also obtain tight sample complexity characterization for logistic regression. Since $\mathsf{1bCSbinary}$ is a strictly harder problem than noisy linear measurements ($\mathsf{SparseLinearReg}$) because of added quantization, the same sample complexity is achievable for $\mathsf{SparseLinearReg}$. While this sample complexity can be obtained via the popular lasso algorithm, linear estimation is computationally more efficient. Our lower bound holds for any set of measurements for $\mathsf{SparseLinearReg}$, (similar bound was known for Gaussian measurement matrices) and is closely matched by the maximum-likelihood upper bound. For $\mathsf{SparseLinearReg}$, it was conjectured in Gamarnik and Zadik, 2017 that there is a statistical-computational gap and the number of measurements should be at least $(2k+\sigma^2)\log{n}$ for efficient algorithms to exist. It is worth noting that our results imply that there is no such statistical-computational gap for $\mathsf{1bCSbinary}$ and logistic regression.
Comment: The paper addresses sparse binary vector recovery with theoretical guarantees, which aligns with the sparsity and efficiency criteria in model compression.
Relevance: 9 Novelty: 8
25. When Can We Solve the Weighted Low Rank Approximation Problem in Truly Subquadratic Time?
ArXiv ID: 2502.16912
Authors: Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Abstract: The weighted low-rank approximation problem is a fundamental numerical linear algebra problem and has many applications in machine learning. Given a $n \times n$ weight matrix $W$ and a $n \times n$ matrix $A$, the goal is to find two low-rank matrices $U, V \in \mathbb{R}^{n \times k}$ such that the cost of $| W \circ (U V^\top - A) |_F^2$ is minimized. Previous work has to pay $\Omega(n^2)$ time when matrices $A$ and $W$ are dense, e.g., having $\Omega(n^2)$ non-zero entries. In this work, we show that there is a certain regime, even if $A$ and $W$ are dense, we can still hope to solve the weighted low-rank approximation problem in almost linear $n^{1+o(1)}$ time.
Comment: The paper focuses on weighted low-rank approximation, which aligns with the model compression topic, particularly low-rank approaches.
Relevance: 9 Novelty: 8
26. Signal Collapse in One-Shot Pruning: When Sparse Models Fail to Distinguish Neural Representations
ArXiv ID: 2502.15790
Authors: Dhananjay Saikumar, Blesson Varghese
Abstract: Neural network pruning is essential for reducing model complexity to enable deployment on resource constrained hardware. While performance loss of pruned networks is often attributed to the removal of critical parameters, we identify signal collapse a reduction in activation variance across layers as the root cause. Existing one shot pruning methods focus on weight selection strategies and rely on computationally expensive second order approximations. In contrast, we demonstrate that mitigating signal collapse, rather than optimizing weight selection, is key to improving accuracy of pruned networks. We propose REFLOW that addresses signal collapse without updating trainable weights, revealing high quality sparse sub networks within the original parameter space. REFLOW enables magnitude pruning to achieve state of the art performance, restoring ResNeXt101 accuracy from under 4.1% to 78.9% on ImageNet with only 20% of the weights retained, surpassing state of the art approaches.
Comment: The paper identifies signal collapse in one-shot pruning and proposes a novel method to address it, aligning with model compression and sparsity topics.
Relevance: 9 Novelty: 8
27. Muon is Scalable for LLM Training
ArXiv ID: 2502.16982
Authors: Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang
Abstract: Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.
Comment: The paper introduces Muon, a scalable optimizer for LLM training, and demonstrates its application in training a Mixture-of-Experts (MoE) model. This aligns closely with the 'Model Architecture' and 'Model Compression' criteria due to its focus on MoE and computational efficiency.
Relevance: 9 Novelty: 8
28. The Role of Sparsity for Length Generalization in Transformers
ArXiv ID: 2502.16792
Authors: Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
Abstract: Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call $k$-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model justifies certain techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling. We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is a ``sparse'' dependency structure of each token on the previous ones. Inspired by our theory, we introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which position coupling can successfully be applied to achieve length generalization.
Comment: The paper investigates the role of sparsity in length generalization for transformers, which aligns with 'Representation Learning' and 'Model Architecture' criteria due to its theoretical insights into transformer behavior.
Relevance: 9 Novelty: 8
29. Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
ArXiv ID: 2502.15938
Authors: Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness
Abstract: LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal peak LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. D2Z is superior across a range of model sizes, batch sizes, datasets, and vocabularies. Benefits increase as dataset size increases. Leveraging a novel interpretation of AdamW as an exponential moving average of weight updates, we show how linear D2Z optimally balances the demands of early training (moving away from initial conditions) and late training (averaging over more updates in order to mitigate gradient noise). In experiments, a 610M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% compute savings. Models such as Llama2-7B, trained for 286 TPP with 10x decay, could likely have saved a majority of compute by training with D2Z.
Comment: The paper demonstrates that a linear decay-to-zero learning rate schedule outperforms other schedules for LLM training, aligning with 'Large Language Models' and 'Training Dynamics' criteria.
Relevance: 9 Novelty: 8
30. Function-Space Learning Rates
ArXiv ID: 2502.17405
Authors: Edward Milsom, Ben Anson, Laurence Aitchison
Abstract: We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.
Comment: The paper introduces a novel concept of function-space learning rates and proposes FLeRM, a method for hyperparameter transfer across model scales. This aligns with the Representation Learning and Model Architecture criteria, as it provides insights into training dynamics and scaling behavior.
Relevance: 9 Novelty: 8
31. Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems
ArXiv ID: 2502.17019
Authors: Maksim Zhdanov, Max Welling, Jan-Willem van de Meent
Abstract: Large-scale physical systems defined on irregular grids pose significant scalability challenges for deep learning methods, especially in the presence of long-range interactions and multi-scale coupling. Traditional approaches that compute all pairwise interactions, such as attention, become computationally prohibitive as they scale quadratically with the number of nodes. We present Erwin, a hierarchical transformer inspired by methods from computational many-body physics, which combines the efficiency of tree-based algorithms with the expressivity of attention mechanisms. Erwin employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size. Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, it captures both fine-grained local details and global features. We demonstrate Erwin's effectiveness across multiple domains, including cosmology, molecular dynamics, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.
Comment: The paper presents Erwin, a hierarchical transformer for large-scale physical systems, combining tree-based algorithms with attention mechanisms. This aligns with the Model Architecture criterion, particularly in architectural innovations for scalability.
Relevance: 9 Novelty: 8
32. Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer
ArXiv ID: 2502.15779
Authors: Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo
Abstract: We propose Rotate, Clip, and Partition (RCP), a quantization-aware training (QAT) approach that first realizes extreme compression of LLMs with W2A4KV4(2-bit weight, 4-bit activation, and 4-bit KV cache) configuration. RCP integrates recent rotation techniques with a novel non-uniform weight quantizer design, by quantitatively analyzing the impact of random rotation on 2-bit weight quantization. Our weight quantizer features Learnable Direct Partitioning (LDP), which introduces learnable parameters to directly learn non-uniform intervals jointly with LLM weights. We also present a specialized GPU kernel that supports GEMV on non-uniform W2A4. Experiments show that RCP can compress LLaMA-2-7B to W2A4KV4 with a loss of only 2.84 WikiText2 ppl and 5.29 times reduced memory footprint. Furthermore, RCP can quantize challenging mobile-targeted LLaMA-3.2 models and domain-specific WizardCoder-7B and MetaMath-7B with no critical problems such as convergence failure and repetition. Code will be made available at blind_review.
Comment: The paper introduces RCP, a QAT approach for extreme compression of LLMs, including W2A4KV4 quantization. This aligns with the Model Compression criterion, particularly in advancing quantization techniques.
Relevance: 9 Novelty: 8
33. Entropy-Lens: The Information Signature of Transformer Computations
ArXiv ID: 2502.16570
Authors: Riccardo Ali, Francesco Caso, Christopher Irwin, Pietro Li`o
Abstract: Transformer models have revolutionized fields from natural language processing to computer vision, yet their internal computational dynamics remain poorly understood raising concerns about predictability and robustness. In this work, we introduce Entropy-Lens, a scalable, model-agnostic framework that leverages information theory to interpret frozen, off-the-shelf large-scale transformers. By quantifying the evolution of Shannon entropy within intermediate residual streams, our approach extracts computational signatures that distinguish model families, categorize task-specific prompts, and correlate with output accuracy. We further demonstrate the generality of our method by extending the analysis to vision transformers. Our results suggest that entropy-based metrics can serve as a principled tool for unveiling the inner workings of modern transformer architectures.
Comment: The paper introduces an entropy-based framework to analyze transformer computations, aligning with foundational research in understanding LLM behavior and interpretability.
Relevance: 9 Novelty: 8
34. Sequence-level Large Language Model Training with Contrastive Preference Optimization
ArXiv ID: 2502.16433
Authors: Zhili Feng, Dhananjay Ram, Cole Hawkins, Aditya Rawal, Jinman Zhao, Sheng Zha
Abstract: The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks. However, upon closer investigation of this objective, we find that it lacks an understanding of sequence-level signals, leading to a mismatch between training and inference processes. To bridge this gap, we introduce a contrastive preference optimization (CPO) procedure that can inject sequence-level information into the language model at any training stage without expensive human labeled data. Our experiments show that the proposed objective surpasses the next token prediction in terms of win rate in the instruction-following and text generation tasks.
Comment: The paper introduces a contrastive preference optimization procedure for sequence-level LLM training, which aligns with foundational research in LLM training dynamics.
Relevance: 9 Novelty: 8
35. A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis
ArXiv ID: 2502.16331
Authors: Akash Kumar, Rahul Parhi, Mikhail Belkin
Abstract: Recent works have characterized the function-space inductive bias of infinite-width bounded-norm single-hidden-layer neural networks as a kind of bounded-variation-type space. This novel neural network Banach space encompasses many classical multivariate function spaces including certain Sobolev spaces and the spectral Barron spaces. Notably, this Banach space also includes functions that exhibit less classical regularity such as those that only vary in a few directions. On bounded domains, it is well-established that the Gaussian reproducing kernel Hilbert space (RKHS) strictly embeds into this Banach space, demonstrating a clear gap between the Gaussian RKHS and the neural network Banach space. It turns out that when investigating these spaces on unbounded domains, e.g., all of $\mathbb{R}^d$, the story is fundamentally different. We establish the following fundamental result: Certain functions that lie in the Gaussian RKHS have infinite norm in the neural network Banach space. This provides a nontrivial gap between kernel methods and neural networks by the exhibition of functions in which kernel methods can do strictly better than neural networks.
Comment: The paper investigates the gap between Gaussian RKHS and neural networks, providing theoretical insights into function spaces. This aligns with 'Representation Learning' and foundational research.
Relevance: 9 Novelty: 8
36. Low-rank bias, weight decay, and model merging in neural networks
ArXiv ID: 2502.17340
Authors: Ilja Kuzborskij, Yasin Abbasi Yadkori
Abstract: We explore the low-rank structure of the weight matrices in neural networks originating from training with Gradient Descent (GD) and Gradient Flow (GF) with $L2$ regularization (also known as weight decay). We show several properties of GD-trained deep neural networks, induced by $L2$ regularization. In particular, for a stationary point of GD we show alignment of the parameters and the gradient, norm preservation across layers, and low-rank bias: properties previously known in the context of GF solutions. Experiments show that the assumptions made in the analysis only mildly affect the observations. In addition, we investigate a multitask learning phenomenon enabled by $L2$ regularization and low-rank bias. In particular, we show that if two networks are trained, such that the inputs in the training set of one network are approximately orthogonal to the inputs in the training set of the other network, the new network obtained by simply summing the weights of the two networks will perform as well on both training sets as the respective individual networks. We demonstrate this for shallow ReLU neural networks trained by GD, as well as deep linear and deep ReLU networks trained by GF.
Comment: The paper explores low-rank structures in neural networks induced by weight decay, which aligns with foundational research in model compression and efficiency.
Relevance: 9 Novelty: 8
37. Geometric Kolmogorov-Arnold Superposition Theorem
ArXiv ID: 2502.16664
Authors: Francesco Alesiani, Takashi Maruyama, Henrik Christiansen, Viktor Zaverkin
Abstract: The Kolmogorov-Arnold Theorem (KAT), or more generally, the Kolmogorov Superposition Theorem (KST), establishes that any non-linear multivariate function can be exactly represented as a finite superposition of non-linear univariate functions. Unlike the universal approximation theorem, which provides only an approximate representation without guaranteeing a fixed network size, KST offers a theoretically exact decomposition. The Kolmogorov-Arnold Network (KAN) was introduced as a trainable model to implement KAT, and recent advancements have adapted KAN using concepts from modern neural networks. However, KAN struggles to effectively model physical systems that require inherent equivariance or invariance to $E(3)$ transformations, a key property for many scientific and engineering applications. In this work, we propose a novel extension of KAT and KAN to incorporate equivariance and invariance over $O(n)$ group actions, enabling accurate and efficient modeling of these systems. Our approach provides a unified approach that bridges the gap between mathematical theory and practical architectures for physical systems, expanding the applicability of KAN to a broader class of problems.
Comment: The paper extends the Kolmogorov-Arnold Superposition Theorem to incorporate equivariance and invariance, which is a significant theoretical contribution relevant to model architecture.
Relevance: 8 Novelty: 9
38. The Empirical Impact of Reducing Symmetries on the Performance of Deep Ensembles and MoE
ArXiv ID: 2502.17391
Authors: Andrei Chernov, Oleg Novitskij
Abstract: Recent studies have shown that reducing symmetries in neural networks enhances linear mode connectivity between networks without requiring parameter space alignment, leading to improved performance in linearly interpolated neural networks. However, in practical applications, neural network interpolation is rarely used; instead, ensembles of networks are more common. In this paper, we empirically investigate the impact of reducing symmetries on the performance of deep ensembles and Mixture of Experts (MoE) across five datasets. Additionally, to explore deeper linear mode connectivity, we introduce the Mixture of Interpolated Experts (MoIE). Our results show that deep ensembles built on asymmetric neural networks achieve significantly better performance as ensemble size increases compared to their symmetric counterparts. In contrast, our experiments do not provide conclusive evidence on whether reducing symmetries affects both MoE and MoIE architectures.
Comment: The paper investigates the impact of reducing symmetries on MoE and introduces a novel architecture (MoIE), aligning with the 'Model Architecture' criterion.
Relevance: 9 Novelty: 7
39. Sparsity May Be All You Need: Sparse Random Parameter Adaptation
ArXiv ID: 2502.15975
Authors: Jesus Rios, Pierre Dognin, Ronny Luss, Karthikeyan N. Ramamurthy
Abstract: Full fine-tuning of large language models for alignment and task adaptation has become prohibitively expensive as models have grown in size. Parameter-Efficient Fine-Tuning (PEFT) methods aim at significantly reducing the computational and memory resources needed for fine-tuning these models by only training on a small number of parameters instead of all model parameters. Currently, the most popular PEFT method is the Low-Rank Adaptation (LoRA), which freezes the parameters of the model to be fine-tuned and introduces a small set of trainable parameters in the form of low-rank matrices. We propose simply reducing the number of trainable parameters by randomly selecting a small proportion of the model parameters to train on. In this paper, we compare the efficiency and performance of our proposed approach with PEFT methods, including LoRA, as well as full parameter fine-tuning.
Comment: The paper proposes a sparse parameter adaptation method for LLM fine-tuning, which aligns with sparsity and efficiency in model compression.
Relevance: 9 Novelty: 7
40. Pruning as a Defense: Reducing Memorization in Large Language Models
ArXiv ID: 2502.15796
Authors: Mansi Gupta, Nikhar Waghela, Sarthak Gupta, Shourya Goel, Sanjif Shanmugavelu
Abstract: Large language models have been shown to memorize significant portions of their training data, which they can reproduce when appropriately prompted. This work investigates the impact of simple pruning techniques on this behavior. Our findings reveal that pruning effectively reduces the extent of memorization in LLMs, demonstrating its potential as a foundational approach for mitigating membership inference attacks.
Comment: The paper investigates pruning as a method to reduce memorization in LLMs, which aligns with the model compression and sparsity criteria.
Relevance: 9 Novelty: 7
41. Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability
ArXiv ID: 2502.17071
Authors: Ashhadul Islam, Samir Brahim Belhaouari, Amine Bermak
Abstract: The exponential growth of large language models (LLMs) like ChatGPT has revolutionized artificial intelligence, offering unprecedented capabilities in natural language processing. However, the extensive computational resources required for training these models have significant environmental implications, including high carbon emissions, energy consumption, and water usage. This research presents a novel approach to LLM pruning, focusing on the systematic evaluation of individual weight importance throughout the training process. By monitoring parameter evolution over time, we propose a method that effectively reduces model size without compromising performance. Extensive experiments with both a scaled-down LLM and a large multimodal model reveal that moderate pruning enhances efficiency and reduces loss, while excessive pruning drastically deteriorates model performance. These findings highlight the critical need for optimized AI models to ensure sustainable development, balancing technological advancement with environmental responsibility.
Comment: The paper proposes a systematic weight evaluation for pruning LLMs, aligning with 'Model Compression'. It emphasizes sustainability and efficiency, which are relevant contributions.
Relevance: 9 Novelty: 7
42. Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks
ArXiv ID: 2502.16075
Authors: Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett
Abstract: We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by Ji and Telgarsky (2020).
Comment: The paper provides theoretical insights into the implicit bias of gradient descent for non-homogeneous deep networks, aligning with 'Representation Learning' and training dynamics.
Relevance: 8 Novelty: 8
43. Learning to Reason from Feedback at Test-Time
ArXiv ID: 2502.15771
Authors: Yanyang Li, Michael Lyu, Liwei Wang
Abstract: Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.
Comment: The paper introduces a novel test-time optimization paradigm for LLMs to utilize feedback effectively, which aligns with foundational research in LLM behavior and reasoning.
Relevance: 8 Novelty: 8
44. Category-free Out-of-Distribution Node Detection with Feature Resonance
ArXiv ID: 2502.16076
Authors: Shenzhi Yang, Junbo Zhao, Shouqing Yang, Yixuan Li, Dingyu Yang, Xiaofang Zhang, Haobo Wang
Abstract: Detecting out-of-distribution (OOD) nodes in the graph-based machine-learning field is challenging, particularly when in-distribution (ID) node multi-category labels are unavailable. Thus, we focus on feature space rather than label space and find that, ideally, during the optimization of known ID samples, unknown ID samples undergo more significant representation changes than OOD samples, even if the model is trained to fit random targets, which we called the Feature Resonance phenomenon. The rationale behind it is that even without gold labels, the local manifold may still exhibit smooth resonance. Based on this, we further develop a novel graph OOD framework, dubbed Resonance-based Separation and Learning (RSL), which comprises two core modules: (i) a more practical micro-level proxy of feature resonance that measures the movement of feature vectors in one training step. (ii) integrate with synthetic OOD nodes strategy to train an effective OOD classifier. Theoretically, we derive an error bound showing the superior separability of OOD nodes during the resonance period. Empirically, RSL achieves state-of-the-art performance, reducing the FPR95 metric by an average of 18.51% across five real-world datasets.
Comment: The paper proposes a novel framework for OOD node detection in graphs using feature resonance, which aligns with representation learning and introduces a theoretically grounded approach.
Relevance: 8 Novelty: 8
45. The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
ArXiv ID: 2502.17420
Authors: Tom Wollschl\"ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G\"unnemann, Johannes Gasteiger
Abstract: The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.
Comment: The paper explores refusal mechanisms in LLMs using gradient-based representation engineering, which provides insights into LLM behavior and interpretability.
Relevance: 8 Novelty: 8
46. Since Faithfulness Fails: The Performance Limits of Neural Causal Discovery
ArXiv ID: 2502.16056
Authors: Mateusz Olko, Mateusz Gajewski, Joanna Wojciechowska, Miko{\l}aj Morzy, Piotr Sankowski, Piotr Mi{\l}o\'s
Abstract: Neural causal discovery methods have recently improved in terms of scalability and computational efficiency. However, our systematic evaluation highlights significant room for improvement in their accuracy when uncovering causal structures. We identify a fundamental limitation: neural networks cannot reliably distinguish between existing and non-existing causal relationships in the finite sample regime. Our experiments reveal that neural networks, as used in contemporary causal discovery approaches, lack the precision needed to recover ground-truth graphs, even for small graphs and relatively large sample sizes. Furthermore, we identify the faithfulness property as a critical bottleneck: (i) it is likely to be violated across any reasonable dataset size range, and (ii) its violation directly undermines the performance of neural discovery methods. These findings lead us to conclude that progress within the current paradigm is fundamentally constrained, necessitating a paradigm shift in this domain.
Comment: The paper critiques neural causal discovery methods, identifying fundamental limitations, which aligns with emerging trends in challenging established assumptions.
Relevance: 8 Novelty: 8
47. When to Forget? Complexity Trade-offs in Machine Unlearning
ArXiv ID: 2502.17323
Authors: Martin Van Waerebeke, Marco Lorenzi, Giovanni Neglia, Kevin Scaman
Abstract: Machine Unlearning (MU) aims at removing the influence of specific data points from a trained model, striving to achieve this at a fraction of the cost of full model retraining. In this paper, we analyze the efficiency of unlearning methods and establish the first upper and lower bounds on minimax computation times for this problem, characterizing the performance of the most efficient algorithm against the most difficult objective function. Specifically, for strongly convex objective functions and under the assumption that the forget data is inaccessible to the unlearning method, we provide a phase diagram for the unlearning complexity ratio -- a novel metric that compares the computational cost of the best unlearning method to full model retraining. The phase diagram reveals three distinct regimes: one where unlearning at a reduced cost is infeasible, another where unlearning is trivial because adding noise suffices, and a third where unlearning achieves significant computational advantages over retraining. These findings highlight the critical role of factors such as data dimensionality, the number of samples to forget, and privacy constraints in determining the practical feasibility of unlearning.
Comment: The paper analyzes machine unlearning and provides theoretical bounds on unlearning complexity. This aligns with the Emerging Trends criterion, as it challenges established assumptions in model retraining.
Relevance: 8 Novelty: 8
48. CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
ArXiv ID: 2502.16880
Authors: Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, Zhongchao Shi
Abstract: Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training and inference. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge. To address this, we propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting. CORAL introduces Cross-Step Representation Alignment, a method that enhances consistency across multiple training steps, significantly improving speculative drafting performance. Additionally, we identify the LM head as a major bottleneck in the inference speed of the draft model. We introduce a weight-grouping mechanism that selectively activates a subset of LM head parameters during inference, substantially reducing the latency of the draft model. We evaluate CORAL on three LLM families and three benchmark datasets, achieving speedup ratios of 2.50x-4.07x, outperforming state-of-the-art methods such as EAGLE-2 and HASS. Our results demonstrate that CORAL effectively mitigates training-inference misalignment and delivers significant speedup for modern LLMs with large vocabularies.
Comment: The paper introduces CORAL, a framework for speculative decoding in LLMs, addressing training-inference misalignment and efficiency. This aligns with the Model Compression criterion, particularly in improving inference efficiency.
Relevance: 8 Novelty: 8
49. Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models
ArXiv ID: 2502.16849
Authors: Taj Jones-McCormick, Aukosh Jagannath, Subhabrata Sen
Abstract: Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.
Comment: The paper provides theoretical insights into the benefits of unsupervised pre-training and transfer learning, aligning with foundational research in representation learning.
Relevance: 8 Novelty: 8
50. Dynamic Parallel Tree Search for Efficient LLM Reasoning
ArXiv ID: 2502.16235
Authors: Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao
Abstract: Tree of Thoughts (ToT) enhances Large Language Model (LLM) reasoning by structuring problem-solving as a spanning tree. However, recent methods focus on search accuracy while overlooking computational efficiency. The challenges of accelerating the ToT lie in the frequent switching of reasoning focus, and the redundant exploration of suboptimal solutions. To alleviate this dilemma, we propose Dynamic Parallel Tree Search (DPTS), a novel parallelism framework that aims to dynamically optimize the reasoning path in inference. It includes the Parallelism Streamline in the generation phase to build up a flexible and adaptive parallelism with arbitrary paths by fine-grained cache management and alignment. Meanwhile, the Search and Transition Mechanism filters potential candidates to dynamically maintain the reasoning focus on more possible solutions and have less redundancy. Experiments on Qwen-2.5 and Llama-3 with Math500 and GSM8K datasets show that DPTS significantly improves efficiency by 2-4x on average while maintaining or even surpassing existing reasoning algorithms in accuracy, making ToT-based reasoning more scalable and computationally efficient.
Comment: The paper introduces a novel parallelism framework for efficient LLM reasoning, which aligns with foundational research in efficiency and LLM behavior.
Relevance: 8 Novelty: 8
51. Brain-Model Evaluations Need the NeuroAI Turing Test
ArXiv ID: 2502.16238
Authors: Jenelle Feather, Meenakshi Khosla, N. Apurva Ratan Murty, Aran Nayebi
Abstract: What makes an artificial system a good model of intelligence? The classical test proposed by Alan Turing focuses on behavior, requiring that an artificial agent's behavior be indistinguishable from that of a human. While behavioral similarity provides a strong starting point, two systems with very different internal representations can produce the same outputs. Thus, in modeling biological intelligence, the field of NeuroAI often aims to go beyond behavioral similarity and achieve representational convergence between a model's activations and the measured activity of a biological system. This position paper argues that the standard definition of the Turing Test is incomplete for NeuroAI, and proposes a stronger framework called the ``NeuroAI Turing Test'', a benchmark that extends beyond behavior alone and \emph{additionally} requires models to produce internal neural representations that are empirically indistinguishable from those of a brain up to measured individual variability, i.e. the differences between a computational model and the brain is no more than the difference between one brain and another brain. While the brain is not necessarily the ceiling of intelligence, it remains the only universally agreed-upon example, making it a natural reference point for evaluating computational models. By proposing this framework, we aim to shift the discourse from loosely defined notions of brain inspiration to a systematic and testable standard centered on both behavior and internal representations, providing a clear benchmark for neuroscientific modeling and AI development.
Comment: The paper proposes a NeuroAI Turing Test framework, which introduces a novel paradigm for evaluating models based on representational convergence, aligning with foundational research in representation learning.
Relevance: 8 Novelty: 8
52. A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models
ArXiv ID: 2502.15828
Authors: Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang
Abstract: In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning procedure by multi-space projections. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available at https://github.com/THUDM/MoELoRA_Riemannian.
Comment: The paper introduces a stronger mixture of low-rank experts for fine-tuning foundation models, aligning with 'Model Compression' and 'Model Architecture' criteria.
Relevance: 8 Novelty: 7
53. R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression
ArXiv ID: 2502.15957
Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
Abstract: Memory plays a key role in enhancing LLMs' performance when deployed to real-world applications. Existing solutions face trade-offs: explicit memory designs based on external storage require complex management and incur storage overhead, while implicit memory designs that store information via parameters struggle with reliable retrieval. In this paper, we propose R$^3$Mem, a memory network that optimizes both information Retention and Retrieval through Reversible context compression. Specifically, R$^3$Mem employs virtual memory tokens to compress and encode infinitely long histories, further enhanced by a hierarchical compression strategy that refines information from document- to entity-level for improved assimilation across granularities. For retrieval, R$^3$Mem employs a reversible architecture, reconstructing raw data by invoking the model backward with compressed information. Implemented via parameter-efficient fine-tuning, it can integrate seamlessly with any Transformer-based model. Experiments demonstrate that our memory design achieves state-of-the-art performance in long-context language modeling and retrieval-augmented generation tasks. It also significantly outperforms conventional memory modules in long-horizon interaction tasks like conversational agents, showcasing its potential for next-generation retrieval systems.
Comment: The paper introduces a memory network for LLMs with reversible compression, aligning with 'Model Compression' and 'Large Language Models' criteria.
Relevance: 8 Novelty: 7
54. Understanding the Emergence of Multimodal Representation Alignment
ArXiv ID: 2502.16282
Authors: Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang
Abstract: Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research primarily focused on explicitly aligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can become implicitly aligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not necessarily limited to, the degree of similarity between the modalities and the balance between redundant and unique information they provide for the task. Our findings suggest that alignment may not be universally beneficial; rather, its impact on performance varies depending on the dataset and task. These insights can help practitioners determine whether increasing alignment between modalities is advantageous or, in some cases, detrimental to achieving optimal performance. Code is released at https://github.com/MeganTj/multimodal_alignment.
Comment: The paper investigates the emergence of multimodal representation alignment, which aligns with representation learning and provides insights into training dynamics.
Relevance: 8 Novelty: 7
55. Verifying Quantized Graph Neural Networks is PSPACE-complete
ArXiv ID: 2502.16244
Authors: Marco S\"alzer, Fran\c{c}ois Schwarzentruber, Nicolas Troquard
Abstract: In this paper, we investigate verification of quantized Graph Neural Networks (GNNs), where some fixed-width arithmetic is used to represent numbers. We introduce the linear-constrained validity (LVP) problem for verifying GNNs properties, and provide an efficient translation from LVP instances into a logical language. We show that LVP is in PSPACE, for any reasonable activation functions. We provide a proof system. We also prove PSPACE-hardness, indicating that while reasoning about quantized GNNs is feasible, it remains generally computationally challenging.
Comment: The paper investigates the verification of quantized GNNs, which aligns with model compression and theoretical insights into quantization.
Relevance: 8 Novelty: 7
56. Subsampling Graphs with GNN Performance Guarantees
ArXiv ID: 2502.16703
Authors: Mika Sarkin Jain, Stefanie Jegelka, Ishani Karmarkar, Luana Ruiz, Ellen Vitercik
Abstract: How can we subsample graph data so that a graph neural network (GNN) trained on the subsample achieves performance comparable to training on the full dataset? This question is of fundamental interest, as smaller datasets reduce labeling costs, storage requirements, and computational resources needed for training. Selecting an effective subset is challenging: a poorly chosen subsample can severely degrade model performance, and empirically testing multiple subsets for quality obviates the benefits of subsampling. Therefore, it is critical that subsampling comes with guarantees on model performance. In this work, we introduce new subsampling methods for graph datasets that leverage the Tree Mover's Distance to reduce both the number of graphs and the size of individual graphs. To our knowledge, our approach is the first that is supported by rigorous theoretical guarantees: we prove that training a GNN on the subsampled data results in a bounded increase in loss compared to training on the full dataset. Unlike existing methods, our approach is both model-agnostic, requiring minimal assumptions about the GNN architecture, and label-agnostic, eliminating the need to label the full training set. This enables subsampling early in the model development pipeline (before data annotation, model selection, and hyperparameter tuning) reducing costs and resources needed for storage, labeling, and training. We validate our theoretical results with experiments showing that our approach outperforms existing subsampling methods across multiple datasets.
Comment: The paper introduces a graph subsampling method with theoretical guarantees, which aligns with representation learning and efficiency in GNNs.
Relevance: 8 Novelty: 7
57. Verification of Bit-Flip Attacks against Quantized Neural Networks
ArXiv ID: 2502.16286
Authors: Yedi Zhang, Lei Huang, Pengfei Gao, Fu Song, Jun Sun, Jin Song Dong
Abstract: In the rapidly evolving landscape of neural network security, the resilience of neural networks against bit-flip attacks (i.e., an attacker maliciously flips an extremely small amount of bits within its parameter storage memory system to induce harmful behavior), has emerged as a relevant area of research. Existing studies suggest that quantization may serve as a viable defense against such attacks. Recognizing the documented susceptibility of real-valued neural networks to such attacks and the comparative robustness of quantized neural networks (QNNs), in this work, we introduce BFAVerifier, the first verification framework designed to formally verify the absence of bit-flip attacks or to identify all vulnerable parameters in a sound and rigorous manner. BFAVerifier comprises two integral components: an abstraction-based method and an MILP-based method. Specifically, we first conduct a reachability analysis with respect to symbolic parameters that represent the potential bit-flip attacks, based on a novel abstract domain with a sound guarantee. If the reachability analysis fails to prove the resilience of such attacks, then we encode this verification problem into an equivalent MILP problem which can be solved by off-the-shelf solvers. Therefore, BFAVerifier is sound, complete, and reasonably efficient. We conduct extensive experiments, which demonstrate its effectiveness and efficiency across various network architectures, quantization bit-widths, and adversary capabilities.
Comment: The paper introduces a verification framework for bit-flip attacks on quantized neural networks, which aligns with model compression and theoretical insights into quantization.
Relevance: 8 Novelty: 7
58. Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation
ArXiv ID: 2502.17380
Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
Abstract: Language diversity presents a significant challenge in speech-to-text (S2T) tasks, such as automatic speech recognition and translation. Traditional multi-task training approaches aim to address this by jointly optimizing multiple speech recognition and translation tasks across various languages. While models like Whisper, built on these strategies, demonstrate strong performance, they still face issues of high computational cost, language interference, suboptimal training configurations, and limited extensibility. To overcome these challenges, we introduce LoRS-Merging (low-rank and sparse model merging), a novel technique designed to efficiently integrate models trained on different languages or tasks while preserving performance and reducing computational overhead. LoRS-Merging combines low-rank and sparse pruning to retain essential structures while eliminating redundant parameters, mitigating language and task interference, and enhancing extensibility. Experimental results across a range of languages demonstrate that LoRS-Merging significantly outperforms conventional multi-lingual multi-task training baselines. Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for S2T applications.
Comment: The paper introduces a low-rank and sparse model merging technique for multi-lingual speech tasks, which aligns with sparsity and efficiency in model compression.
Relevance: 8 Novelty: 7
59. Hierarchical Residuals Exploit Brain-Inspired Compositionality
ArXiv ID: 2502.16003
Authors: Francisco M. L\'opez, Jochen Triesch
Abstract: We present Hierarchical Residual Networks (HiResNets), deep convolutional neural networks with long-range residual connections between layers at different hierarchical levels. HiResNets draw inspiration on the organization of the mammalian brain by replicating the direct connections from subcortical areas to the entire cortical hierarchy. We show that the inclusion of hierarchical residuals in several architectures, including ResNets, results in a boost in accuracy and faster learning. A detailed analysis of our models reveals that they perform hierarchical compositionality by learning feature maps relative to the compressed representations provided by the skip connections.
Comment: The introduction of Hierarchical Residual Networks (HiResNets) provides architectural innovation inspired by biological systems, aligning with the model architecture criterion.
Relevance: 8 Novelty: 7
60. To Share or Not to Share: Investigating Weight Sharing in Variational Graph Autoencoders
ArXiv ID: 2502.16724
Authors: Guillaume Salha-Galvan, Jiaying Xu
Abstract: This paper investigates the understudied practice of weight sharing (WS) in variational graph autoencoders (VGAE). WS presents both benefits and drawbacks for VGAE model design and node embedding learning, leaving its overall relevance unclear and the question of whether it should be adopted unresolved. We rigorously analyze its implications and, through extensive experiments on a wide range of graphs and VGAE variants, demonstrate that the benefits of WS consistently outweigh its drawbacks. Based on our findings, we recommend WS as an effective approach to optimize, regularize, and simplify VGAE models without significant performance loss.
Comment: The investigation of weight sharing in variational graph autoencoders (VGAE) aligns with representation learning and architectural analysis.
Relevance: 8 Novelty: 7
61. Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models
ArXiv ID: 2502.15799
Authors: Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, Evgeny Burnaev
Abstract: Large Language Models (LLMs) have emerged as powerful tools for addressing modern challenges and enabling practical applications. However, their computational expense remains a significant barrier to widespread adoption. Quantization has emerged as a promising technique to democratize access and enable low resource device deployment. Despite these advancements, the safety and trustworthiness of quantized models remain underexplored, as prior studies often overlook contemporary architectures and rely on overly simplistic benchmarks and evaluations. To address this gap, we introduce OpenSafetyMini, a novel open-ended safety dataset designed to better distinguish between models. We evaluate 4 state-of-the-art quantization techniques across LLaMA and Mistral models using 4 benchmarks, including human evaluations. Our findings reveal that the optimal quantization method varies for 4-bit precision, while vector quantization techniques deliver the best safety and trustworthiness performance at 2-bit precision, providing foundation for future research.
Comment: The paper evaluates quantization methods for LLMs with a focus on safety and reliability, which aligns with the model compression and efficiency criteria.
Relevance: 8 Novelty: 7
62. Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
ArXiv ID: 2502.16681
Authors: Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
Abstract: Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs' basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs' utility on other tasks, our findings highlight the shortcomings of current SAEs and the need to rigorously evaluate interpretability methods on downstream tasks with strong baselines.
Comment: The paper evaluates sparse autoencoders for probing LLM activations, which aligns with representation learning and interpretability criteria.
Relevance: 8 Novelty: 7
63. Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow
ArXiv ID: 2502.15765
Authors: Behrooz Azarkhalili, Maxwell Libbrecht
Abstract: This paper introduces Generalized Attention Flow (GAF), a novel feature attribution method for Transformer-based models to address the limitations of current approaches. By extending Attention Flow and replacing attention weights with the generalized Information Tensor, GAF integrates attention weights, their gradients, the maximum flow problem, and the barrier method to enhance the performance of feature attributions. The proposed method exhibits key theoretical properties and mitigates the shortcomings of prior techniques that rely solely on simple aggregation of attention weights. Our comprehensive benchmarking on sequence classification tasks demonstrates that a specific variant of GAF consistently outperforms state-of-the-art feature attribution methods in most evaluation settings, providing a more reliable interpretation of Transformer model outputs.
Comment: The paper introduces a novel feature attribution method for Transformers, which aligns with the model architecture topic, specifically analysis of existing architectures.
Relevance: 8 Novelty: 7
64. Quantifying Logical Consistency in Transformers via Query-Key Alignment
ArXiv ID: 2502.17017
Authors: Eduard Tulchinskii, Anastasia Voznyuk, Laida Kushnareva, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
Abstract: Large language models (LLMs) have demonstrated impressive performance in various natural language processing tasks, yet their ability to perform multi-step logical reasoning remains an open challenge. Although Chain-of-Thought prompting has improved logical reasoning by enabling models to generate intermediate steps, it lacks mechanisms to assess the coherence of these logical transitions. In this paper, we propose a novel, lightweight evaluation strategy for logical reasoning that uses query-key alignments inside transformer attention heads. By computing a single forward pass and extracting a "QK-score" from carefully chosen heads, our method reveals latent representations that reliably separate valid from invalid inferences, offering a scalable alternative to traditional ablation-based techniques. We also provide an empirical validation on multiple logical reasoning benchmarks, demonstrating improved robustness of our evaluation method against distractors and increased reasoning depth. The experiments were conducted on a diverse set of models, ranging from 1.5B to 70B parameters.
Comment: The paper proposes a novel evaluation strategy for logical reasoning in Transformers, which is relevant to understanding LLM behavior and interpretability.
Relevance: 8 Novelty: 7
65. Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations
ArXiv ID: 2502.16169
Authors: Chunyang Li, Weiqi Wang, Tianshi Zheng, Yangqiu Song
Abstract: Inductive reasoning, a cornerstone of human cognition, enables generalization from limited data but hasn't yet been fully achieved by large language models (LLMs). While modern LLMs excel at reasoning tasks, their ability to maintain stable and consistent rule abstraction under imperfect observations remains underexplored. To fill this gap, in this work, we introduce Robust Rule Induction, a task that evaluates LLMs' capability in inferring rules from data that are fused with noisy examples. To address this task, we further propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback. Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise (e.g., 0% accuracy change with only 70% consistent score); (3) Counterfactual task gaps highlight LLMs' reliance on memorized patterns over genuine abstraction. Our findings challenge LLMs' reasoning robustness, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems. Code and data are available at \href{https://github.com/lcy2723/Robust-Rule-Induction}{https://github.com/lcy2723/Robust-Rule-Induction}.
Comment: The paper evaluates inductive reasoning in LLMs under noisy observations, which provides insights into LLM behavior and interpretability, aligning with the LLM topic.
Relevance: 8 Novelty: 7
66. CoME: An Unlearning-based Approach to Conflict-free Model Editing
ArXiv ID: 2502.15826
Authors: Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, Heuiseok Lim
Abstract: Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a novel framework that enhances the accuracy of knowledge updates in LLMs by selectively removing outdated knowledge. CoME leverages unlearning to mitigate knowledge interference, allowing new information to be integrated without compromising relevant linguistic features. Through experiments on GPT-J and LLaMA-3 using Counterfact and ZsRE datasets, we demonstrate that CoME improves both editing accuracy and model reliability when applied to existing editing methods. Our results highlight that the targeted removal of outdated knowledge is crucial for enhancing model editing effectiveness and maintaining the model's generative performance.
Comment: The paper introduces a model editing framework for LLMs, which aligns with 'Model Compression' and 'Representation Learning' due to its focus on unlearning and knowledge updates.
Relevance: 8 Novelty: 7
67. Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin
ArXiv ID: 2502.15952
Authors: Akshay Kumar, Jarvis Haupt
Abstract: Recent works exploring the training dynamics of homogeneous neural network weights under gradient flow with small initialization have established that in the early stages of training, the weights remain small and near the origin, but converge in direction. Building on this, the current paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin. Insights gained from this analysis are used to characterize the first saddle point encountered by gradient flow after escaping the origin. Also, it is shown that for homogeneous feed-forward neural networks, under certain conditions, the sparsity structure emerging among the weights before the escape is preserved after escaping the origin and until reaching the next saddle point.
Comment: The paper provides insights into gradient flow dynamics of homogeneous neural networks, which aligns with 'Representation Learning' due to its focus on training dynamics and sparsity structure.
Relevance: 8 Novelty: 7
68. FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
ArXiv ID: 2502.15804
Authors: Bingzhe Zhao, Ke Cheng, Aomufei Yuan, Yuxuan Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu
Abstract: KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.
Comment: The paper focuses on KV cache compression and introduces FairKV, which addresses load imbalance in multi-GPU inference. This aligns with the Model Compression criterion, particularly in the context of efficiency improvements.
Relevance: 8 Novelty: 7
69. CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought
ArXiv ID: 2502.17214
Authors: Boxuan Zhang, Ruqi Zhang
Abstract: Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on LLaMA Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: https://github.com/ZBox1005/CoT-UQ.
Comment: The paper proposes a framework for uncertainty quantification in LLMs using Chain-of-Thought reasoning, which aligns with foundational research in LLM behavior and interpretability.
Relevance: 8 Novelty: 7
70. Graph Self-Supervised Learning with Learnable Structural and Positional Encodings
ArXiv ID: 2502.16233
Authors: Asiri Wijesinghe, Hao Zhu, Piotr Koniusz
Abstract: Traditional Graph Self-Supervised Learning (GSSL) struggles to capture complex structural properties well. This limitation stems from two main factors: (1) the inadequacy of conventional Graph Neural Networks (GNNs) in representing sophisticated topological features, and (2) the focus of self-supervised learning solely on final graph representations. To address these issues, we introduce \emph{GenHopNet}, a GNN framework that integrates a $k$-hop message-passing scheme, enhancing its ability to capture local structural information without explicit substructure extraction. We theoretically demonstrate that \emph{GenHopNet} surpasses the expressiveness of the classical Weisfeiler-Lehman (WL) test for graph isomorphism. Furthermore, we propose a structural- and positional-aware GSSL framework that incorporates topological information throughout the learning process. This approach enables the learning of representations that are both sensitive to graph topology and invariant to specific structural and feature augmentations. Comprehensive experiments on graph classification datasets, including those designed to test structural sensitivity, show that our method consistently outperforms the existing approaches and maintains computational efficiency. Our work significantly advances GSSL's capability in distinguishing graphs with similar local structures but different global topologies.
Comment: The paper proposes a GNN framework with enhanced structural and positional encodings, which aligns with foundational research in representation learning.
Relevance: 8 Novelty: 7
71. MaxSup: Overcoming Representation Collapse in Label Smoothing
ArXiv ID: 2502.15798
Authors: Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Mario Fritz, Margret Keuper
Abstract: Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization. However, previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions. More recent findings suggest that LS also induces overconfidence in misclassifications, yet the precise mechanism remained unclear. In this work, we decompose the loss term introduced by LS, revealing two key components: (i) a regularization term that functions only when the prediction is correct, and (ii) an error-enhancement term that emerges under misclassifications. This latter term compels the model to reinforce incorrect predictions with exaggerated certainty, further collapsing the feature space. To address these issues, we propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions by penalizing the top-1 logit instead of the ground-truth logit. Through feature analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Extensive experiments on image classification and downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization.
Comment: The paper addresses representation collapse in label smoothing, which is relevant to 'Representation Learning'. The proposed MaxSup method offers a novel approach to regularization.
Relevance: 8 Novelty: 7
72. Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness
ArXiv ID: 2502.16391
Authors: Sangil Han, Kyoowon Kim, Sungkyu Jung
Abstract: In this paper, we explore the theoretical properties of subspace recovery using Winsorized Principal Component Analysis (WPCA), utilizing a common data transformation technique that caps extreme values to mitigate the impact of outliers. Despite the widespread use of winsorization in various tasks of multivariate analysis, its theoretical properties, particularly for subspace recovery, have received limited attention. We provide a detailed analysis of the accuracy of WPCA, showing that increasing the number of samples while decreasing the proportion of outliers guarantees the consistency of the sample subspaces from WPCA with respect to the true population subspace. Furthermore, we establish perturbation bounds that ensure the WPCA subspace obtained from contaminated data remains close to the subspace recovered from pure data. Additionally, we extend the classical notion of breakdown points to subspace-valued statistics and derive lower bounds for the breakdown points of WPCA. Our analysis demonstrates that WPCA exhibits strong robustness to outliers while maintaining consistency under mild assumptions. A toy example is provided to numerically illustrate the behavior of the upper bounds for perturbation bounds and breakdown points, emphasizing winsorization's utility in subspace recovery.
Comment: The paper explores Winsorized PCA for robust subspace recovery, which aligns with 'Representation Learning' through its focus on theoretical robustness and accuracy in feature space.
Relevance: 8 Novelty: 7
73. NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional Interactions
ArXiv ID: 2502.16105
Authors: Tue M. Cao, Nhat X. Hoang, Hieu H. Pham, Phi Le Nguyen, My T. Thai
Abstract: Understanding the inner workings of neural networks is essential for enhancing model performance and interpretability. Current research predominantly focuses on examining the connection between individual neurons and the model's final predictions. Which suffers from challenges in interpreting the internal workings of the model, particularly when neurons encode multiple unrelated features. In this paper, we propose a novel framework that transitions the focus from analyzing individual neurons to investigating groups of neurons, shifting the emphasis from neuron-output relationships to functional interaction between neurons. Our automated framework, NeurFlow, first identifies core neurons and clusters them into groups based on shared functional relationships, enabling a more coherent and interpretable view of the network's internal processes. This approach facilitates the construction of a hierarchical circuit representing neuron interactions across layers, thus improving interpretability while reducing computational costs. Our extensive empirical studies validate the fidelity of our proposed NeurFlow. Additionally, we showcase its utility in practical applications such as image debugging and automatic concept labeling, thereby highlighting its potential to advance the field of neural network explainability.
Comment: The paper proposes a framework for interpreting neural networks through neuron groups and functional interactions, which aligns with foundational research in representation learning and interpretability.
Relevance: 8 Novelty: 7
74. Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology
ArXiv ID: 2502.17026
Authors: Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, Hua Wei
Abstract: Understanding the uncertainty in large language model (LLM) explanations is important for evaluating their faithfulness and reasoning consistency, and thus provides insights into the reliability of LLM's output regarding a question. In this work, we propose a novel framework that quantifies uncertainty in LLM explanations through a reasoning topology perspective. By designing a structural elicitation strategy, we guide the LLMs to frame the explanations of an answer into a graph topology. This process decomposes the explanations into the knowledge related sub-questions and topology-based reasoning structures, which allows us to quantify uncertainty not only at the semantic level but also from the reasoning path. It further brings convenience to assess knowledge redundancy and provide interpretable insights into the reasoning process. Our method offers a systematic way to interpret the LLM reasoning, analyze limitations, and provide guidance for enhancing robustness and faithfulness. This work pioneers the use of graph-structured uncertainty measurement in LLM explanations and demonstrates the potential of topology-based quantification.
Comment: The paper introduces a novel framework for quantifying uncertainty in LLM explanations using a reasoning topology perspective. While it provides insights into LLM behavior, it focuses on interpretability and reasoning rather than foundational breakthroughs in LLM training or architecture.
Relevance: 7 Novelty: 7
75. PLS-based approach for fair representation learning
ArXiv ID: 2502.16263
Authors: Elena M. De-Diego, Adri\'an Perez-Suay, Paula Gordaliza, Jean-Michel Loubes
Abstract: We revisit the problem of fair representation learning by proposing Fair Partial Least Squares (PLS) components. PLS is widely used in statistics to efficiently reduce the dimension of the data by providing representation tailored for the prediction. We propose a novel method to incorporate fairness constraints in the construction of PLS components. This new algorithm provides a feasible way to construct such features both in the linear and the non linear case using kernel embeddings. The efficiency of our method is evaluated on different datasets, and we prove its superiority with respect to standard fair PCA method.
Comment: The paper proposes a PLS-based approach for fair representation learning, introducing fairness constraints in dimensionality reduction. This aligns with the Representation Learning criterion, particularly in the context of fair feature learning.
Relevance: 7 Novelty: 7
76. SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations
ArXiv ID: 2502.16949
Authors: Md Saidul Hoque Anik, Ariful Azad
Abstract: Knowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding is one of the dominant functions in the translation-based KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. We create a general framework for training KG models using sparse kernels and implement four models, namely TransE, TransR, TransH, and TorusE. Our sparse implementations exhibit up to 5.3x speedup on the CPU and up to 4.2x speedup on the GPU with a significantly low GPU memory footprint. The speedups are consistent across large and small datasets for a given model. Our proposed sparse approach can also be extended to accelerate other translation-based (such as TransC, TransM, etc.) and non-translational (such as DistMult, ComplEx, RotatE, etc.) models as well.
Comment: The paper focuses on efficient training of knowledge graph embeddings using sparse matrix operations, which is relevant to model compression and sparsity.
Relevance: 7 Novelty: 6
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application Work: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.