Personalized Daily Arxiv Papers 02/25/2025

[gpt-4o]	Prompt	Completion	Total
Token	92705	14000	106705
Cost	$0.23	$0.14	$0.37

Total ArXiv papers: 1195

Total scanned papers: 745

Total relevant papers: 76

Table of contents with paper titles:

Forgotten Polygons: Multimodal Large Language Models are Shape-Blind Authors: William Rudman, Michal Golovanesky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh
Fractal Generative Models Authors: Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He
Compression Scaling Laws:Unifying Sparsity and Quantization Authors: Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance Authors: Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li
Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression Authors: Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen
BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference Authors: Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, Cheng Li
Delta Decompression for MoE-based LLMs Compression Authors: Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee, Shengjie Sun, Wei Xue, Yike Guo
Compression Barriers for Autoregressive Transformers Authors: Themistoklis Haris, Krzysztof Onak
A General Error-Theoretical Analysis Framework for Constructing Compression Strategies Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang, Meiqi Tu, Fangmin Liu, Jiake Tian
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks Authors: Andrei Chernov
Distributional Scaling Laws for Emergent Capabilities Authors: Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra
Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization Authors: Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu
Reasoning with Latent Thoughts: On the Power of Looped Transformers Authors: Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi
Forecasting Rare Language Model Behaviors Authors: Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma
Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation Authors: Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini
UniDyG: A Unified and Effective Representation Learning Approach for Large Dynamic Graphs Authors: Yuanyuan Xu, Wenjie Zhang, Xuemin Lin, Ying Zhang
Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation Authors: Trung Nguyen, Yan Leng
Linear Attention for Efficient Bidirectional Sequence Modeling Authors: Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher
DISC: Dynamic Decomposition Improves LLM Inference Scaling Authors: Jonathan Light, Wei Cheng, Wu Yue, Masafumi Oyamada, Mengdi Wang, Santiago Paternain, Haifeng Chen
An explainable transformer circuit for compositional generalization Authors: Cheng Tang, Brenden Lake, Mehrdad Jazayeri
Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models Authors: Andrew DiGiugno, Ausif Mahmood
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam Authors: Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu
LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification Authors: Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An
Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements Authors: Arya Mazumdar, Neha Sangwan
When Can We Solve the Weighted Low Rank Approximation Problem in Truly Subquadratic Time? Authors: Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Signal Collapse in One-Shot Pruning: When Sparse Models Fail to Distinguish Neural Representations Authors: Dhananjay Saikumar, Blesson Varghese
Muon is Scalable for LLM Training Authors: Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang
The Role of Sparsity for Length Generalization in Transformers Authors: Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs Authors: Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness
Function-Space Learning Rates Authors: Edward Milsom, Ben Anson, Laurence Aitchison
Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems Authors: Maksim Zhdanov, Max Welling, Jan-Willem van de Meent
Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer Authors: Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo
Entropy-Lens: The Information Signature of Transformer Computations Authors: Riccardo Ali, Francesco Caso, Christopher Irwin, Pietro Li`o
Sequence-level Large Language Model Training with Contrastive Preference Optimization Authors: Zhili Feng, Dhananjay Ram, Cole Hawkins, Aditya Rawal, Jinman Zhao, Sheng Zha
A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis Authors: Akash Kumar, Rahul Parhi, Mikhail Belkin
Low-rank bias, weight decay, and model merging in neural networks Authors: Ilja Kuzborskij, Yasin Abbasi Yadkori
Geometric Kolmogorov-Arnold Superposition Theorem Authors: Francesco Alesiani, Takashi Maruyama, Henrik Christiansen, Viktor Zaverkin
The Empirical Impact of Reducing Symmetries on the Performance of Deep Ensembles and MoE Authors: Andrei Chernov, Oleg Novitskij
Sparsity May Be All You Need: Sparse Random Parameter Adaptation Authors: Jesus Rios, Pierre Dognin, Ronny Luss, Karthikeyan N. Ramamurthy
Pruning as a Defense: Reducing Memorization in Large Language Models Authors: Mansi Gupta, Nikhar Waghela, Sarthak Gupta, Shourya Goel, Sanjif Shanmugavelu
Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability Authors: Ashhadul Islam, Samir Brahim Belhaouari, Amine Bermak
Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks Authors: Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett
Learning to Reason from Feedback at Test-Time Authors: Yanyang Li, Michael Lyu, Liwei Wang
Category-free Out-of-Distribution Node Detection with Feature Resonance Authors: Shenzhi Yang, Junbo Zhao, Shouqing Yang, Yixuan Li, Dingyu Yang, Xiaofang Zhang, Haobo Wang
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence Authors: Tom Wollschl\"ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G\"unnemann, Johannes Gasteiger
Since Faithfulness Fails: The Performance Limits of Neural Causal Discovery Authors: Mateusz Olko, Mateusz Gajewski, Joanna Wojciechowska, Miko{\l}aj Morzy, Piotr Sankowski, Piotr Mi{\l}o\'s
When to Forget? Complexity Trade-offs in Machine Unlearning Authors: Martin Van Waerebeke, Marco Lorenzi, Giovanni Neglia, Kevin Scaman
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter Authors: Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, Zhongchao Shi
Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models Authors: Taj Jones-McCormick, Aukosh Jagannath, Subhabrata Sen
Dynamic Parallel Tree Search for Efficient LLM Reasoning Authors: Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao
Brain-Model Evaluations Need the NeuroAI Turing Test Authors: Jenelle Feather, Meenakshi Khosla, N. Apurva Ratan Murty, Aran Nayebi
A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models Authors: Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang
R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
Understanding the Emergence of Multimodal Representation Alignment Authors: Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang
Verifying Quantized Graph Neural Networks is PSPACE-complete Authors: Marco S\"alzer, Fran\c{c}ois Schwarzentruber, Nicolas Troquard
Subsampling Graphs with GNN Performance Guarantees Authors: Mika Sarkin Jain, Stefanie Jegelka, Ishani Karmarkar, Luana Ruiz, Ellen Vitercik
Verification of Bit-Flip Attacks against Quantized Neural Networks Authors: Yedi Zhang, Lei Huang, Pengfei Gao, Fu Song, Jun Sun, Jin Song Dong
Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
Hierarchical Residuals Exploit Brain-Inspired Compositionality Authors: Francisco M. L\'opez, Jochen Triesch
To Share or Not to Share: Investigating Weight Sharing in Variational Graph Autoencoders Authors: Guillaume Salha-Galvan, Jiaying Xu
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models Authors: Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, Evgeny Burnaev
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Authors: Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow Authors: Behrooz Azarkhalili, Maxwell Libbrecht
Quantifying Logical Consistency in Transformers via Query-Key Alignment Authors: Eduard Tulchinskii, Anastasia Voznyuk, Laida Kushnareva, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations Authors: Chunyang Li, Weiqi Wang, Tianshi Zheng, Yangqiu Song
CoME: An Unlearning-based Approach to Conflict-free Model Editing Authors: Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, Heuiseok Lim
Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin Authors: Akshay Kumar, Jarvis Haupt
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference Authors: Bingzhe Zhao, Ke Cheng, Aomufei Yuan, Yuxuan Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought Authors: Boxuan Zhang, Ruqi Zhang
Graph Self-Supervised Learning with Learnable Structural and Positional Encodings Authors: Asiri Wijesinghe, Hao Zhu, Piotr Koniusz
MaxSup: Overcoming Representation Collapse in Label Smoothing Authors: Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Mario Fritz, Margret Keuper
Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness Authors: Sangil Han, Kyoowon Kim, Sungkyu Jung
NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional Interactions Authors: Tue M. Cao, Nhat X. Hoang, Hieu H. Pham, Phi Le Nguyen, My T. Thai
Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology Authors: Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, Hua Wei
PLS-based approach for fair representation learning Authors: Elena M. De-Diego, Adri\'an Perez-Suay, Paula Gordaliza, Jean-Michel Loubes
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations Authors: Md Saidul Hoque Anik, Ariful Azad

ArXiv ID: 2502.15969

Authors: William Rudman, Michal Golovanesky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh

Abstract: Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.

Comment: Author match

2. Fractal Generative Models

ArXiv ID: 2502.17437

Authors: Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He

Abstract: Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.

Comment: Author match

3. Compression Scaling Laws:Unifying Sparsity and Quantization

ArXiv ID: 2502.16440

Authors: Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh

Abstract: We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this "effective parameter" scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.

Comment: The paper investigates compression scaling laws, unifying sparsity and quantization under a common framework, which directly aligns with model compression and provides theoretical insights.

Relevance: 10 Novelty: 9

4. DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance

ArXiv ID: 2502.16886

Authors: Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li

Abstract: To alleviate memory burden during inference of large language models (LLMs), numerous studies have focused on compressing the KV cache by exploring aspects such as attention sparsity. However, these techniques often require a pre-defined cache budget; as the optimal budget varies with different input lengths and task types, it limits their practical deployment accepting open-domain instructions. To address this limitation, we propose a new KV cache compression objective: to always ensure the full-cache performance regardless of specific inputs, while maximizing KV cache pruning as much as possible. To achieve this goal, we introduce a novel KV cache compression method dubbed DBudgetKV, which features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Empirical evaluation spanning diverse context lengths, task types, and model sizes suggests that our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average. Furthermore, our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.

Comment: The paper proposes a novel KV cache compression method for LLMs, which directly aligns with model compression and efficiency breakthroughs.

Relevance: 10 Novelty: 8

5. Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

ArXiv ID: 2502.16638

Authors: Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen

Abstract: Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.

Comment: The paper introduces a framework for joint structured pruning and quantization, which aligns with foundational research in model compression.

Relevance: 10 Novelty: 8

6. BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

ArXiv ID: 2502.16927

Authors: Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, Cheng Li

Abstract: The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient \textbf{d}escend-\textbf{c}ommunicate-\textbf{c}ommunicate-\textbf{a}scend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09$\times$ for training and increases the throughput by up to 3.11$\times$ for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.

Comment: BigMac introduces a communication-efficient MoE structure, directly aligning with architectural innovations in MoE and efficiency improvements.

Relevance: 10 Novelty: 8

7. Delta Decompression for MoE-based LLMs Compression

ArXiv ID: 2502.17298

Authors: Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee, Shengjie Sun, Wei Xue, Yike Guo

Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compact MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60% compression rates. Codes are available in https://github.com/lliai/D2MoE.

Comment: The paper focuses on a novel compression method for MoE-based LLMs, aligning with the 'Model Compression' criterion. It introduces delta decompression and low-rank SVD techniques, which are foundational contributions.

Relevance: 10 Novelty: 8

8. Compression Barriers for Autoregressive Transformers

ArXiv ID: 2502.15955

Authors: Themistoklis Haris, Krzysztof Onak

Abstract: A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache, but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d = \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(d\cdot e^d)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation's time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.

Comment: The paper provides theoretical insights into the compression barriers for autoregressive Transformers, directly addressing model compression and efficiency.

Relevance: 9 Novelty: 9

9. A General Error-Theoretical Analysis Framework for Constructing Compression Strategies

ArXiv ID: 2502.15802

Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang, Meiqi Tu, Fangmin Liu, Jiake Tian

Abstract: The exponential growth in parameter size and computational complexity of deep models poses significant challenges for efficient deployment. The core problem of existing compression methods is that different layers of the model have significant differences in their tolerance to compression levels. For instance, the first layer of a model can typically sustain a higher compression level compared to the last layer without compromising performance. Thus, the key challenge lies in how to allocate compression levels across layers in a way that minimizes performance loss while maximizing parameter reduction. To address this challenge, we propose a Compression Error Theory (CET) framework, designed to determine the optimal compression level for each layer. Taking quantization as an example, CET leverages differential expansion and algebraic geometry to reconstruct the quadratic form of quantization error as ellipsoids and hyperbolic paraboloids, and utilizes their geometric structures to define an error subspace. To identify the error subspace with minimal performance loss, by performing orthogonal decomposition of the geometric space, CET transforms the optimization process of the error subspace into a complementary problem. The final theoretical analysis shows that constructing the quantization subspace along the major axis results in minimal performance degradation. Through experimental verification of the theory, CET can greatly retain performance while compressing. Specifically, on the ResNet-34 model, CET achieves nearly 11$\times$ parameter compression while even surpassing performance comparable to the original model.

Comment: The paper introduces a theoretical framework for constructing compression strategies, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 9

10. Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks

ArXiv ID: 2502.17187

Authors: Andrei Chernov

Abstract: Recently, Large Language Models (LLMs) with Mixture of Experts (MoE) layers have gained significant attention. Currently, state-of-the-art LLMs utilize this architecture. There is a substantial amount of research on how to train such models and how to select hyperparameters for this architecture. However, there is a lack of studies focusing on post-evaluation analysis of MoE layer properties. In this paper, we take a first step toward closing this gap by evaluating expert contributions on the quiz-based MMLU benchmark. We show that most experts were never activated during inference on this benchmark. Additionally, the output distribution of gating networks is much closer to uniform than sparse. Finally, we demonstrate that the average performance of some experts within the same layer varies significantly.

Comment: The paper evaluates MoE LLMs, focusing on expert contributions and gating network behavior, which directly aligns with the model architecture topic, particularly MoE analysis.

Relevance: 10 Novelty: 7

11. Distributional Scaling Laws for Emergent Capabilities

ArXiv ID: 2502.17356

Authors: Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra

Abstract: In this paper, we explore the nature of sudden breakthroughs in language model performance at scale, which stands in contrast to smooth improvements governed by scaling laws. While advocates of "emergence" view abrupt performance gains as capabilities unlocking at specific scales, others have suggested that they are produced by thresholding effects and alleviated by continuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes, particularly when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. Furthermore, we provide a case study of inverse scaling and show that even as the probability of a successful run declines, the average performance of a successful run continues to increase monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LLM populations. These insights emphasize the role of random variation in the effect of scale on LLM capabilities.

Comment: The paper explores emergent capabilities in LLMs and provides theoretical insights into scaling laws and random seed effects, aligning with the 'Large Language Models' criterion.