Personalized Daily ArXiv Papers 2025-06-02

[gpt-4o]	Prompt	Completion	Total
Token	45926	5505	51431
Cost	$0.11	$0.06	$0.17

Total arXiv papers: 808

Total scanned papers: 463

Total relevant papers: 40

Table of contents with paper titles:

On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks Authors: Mingze Wang, Weinan E
Learning geometry and topology via multi-chart flows Authors: Hanlin Yu, S{\o}ren Hauberg, Marcelo Hartmann, Arto Klami, Georgios Arvanitidis
HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts Authors: Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying
Representational Difference Explanations Authors: Neehar Kondapaneni, Oisin Mac Aodha, Pietro Perona
Differential Gated Self-Attention Authors: Elpiniki Maria Lygizou, M\'onika Farsang, Radu Grosu
DenseLoRA: Dense Low-Rank Adaptation of Large Language Models Authors: Lin Mu, Xiaoyu Wang, Li Ni, Yang Li, Zhize Wu, Peiquan Jin, Yiwen Zhang
Hyperbolic Dataset Distillation Authors: Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Characterising the Inductive Biases of Neural Networks on Boolean Data Authors: Chris Mingard, Lukas Seier, Niclas G\"oring, Andrei-Vlad Badelita, Charles London, Ard Louis
TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks Authors: Xiang Meng, Mehdi Makni, Rahul Mazumder
DLP: Dynamic Layerwise Pruning in Large Language Models Authors: Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang
Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution Authors: Qiao Xiao, Alan Ansell, Boqian Wu, Lu Yin, Mykola Pechenizkiy, Shiwei Liu, Decebal Constantin Mocanu
GradPower: Powering Gradients for Faster Language Model Pre-Training Authors: Mingze Wang, Jinbo Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, Lei Wu
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling Authors: Xiaodong Ji, Hailin Zhang, Fangcheng Fu, Bin Cui
Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Peijie Jiang, Jia Liu, Xuming Hu
Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws Authors: Hidetaka Kamigaito, Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration Authors: Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, Xiaokang Yang
A Mathematical Perspective On Contrastive Learning Authors: Ricardo Baptista, Andrew M. Stuart, Son Tran
Disentangling Granularity: An Implicit Inductive Bias in Factorized VAEs Authors: Zihao Chen, Yu Xiang, Wenyong Wang
R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration Authors: Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu
From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching Authors: Ruqi Bai, Yao Ji, Zeyu Zhou, David I. Inouye
Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization Authors: Chengli Tan, Yubo Zhou, Haishan Ye, Guang Dai, Junmin Liu, Zengjie Song, Jiangshe Zhang, Zixiang Zhao, Yunda Hao, Yong Xu
NeuronTune: Towards Self-Guided Spurious Bias Mitigation Authors: Guangtao Zheng, Wenqian Ye, Aidong Zhang
Revisiting Uncertainty Estimation and Calibration of Large Language Models Authors: Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, Chang Xu
Rethinking Continual Learning with Progressive Neural Collapse Authors: Zheng Wang, Wanhao Yu, Li Yang, Sen Lin
Mamba Knockout for Unraveling Factual Information Flow Authors: Nir Endy, Idan Daniel Grosbard, Yuval Ran-Milo, Yonatan Slutzky, Itay Tshuva, Raja Giryes
Conservation-preserved Fourier Neural Operator through Adaptive Correction Authors: Chaoyu Liu, Yangming Li, Zhongying Deng, Chris Budd, Carola-Bibiane Sch\"onlieb
Cross-Attention Speculative Decoding Authors: Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Nikhil Verma, Yipeng Ji, Chul Lee
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts Authors: Xin He, Xumeng Han, Longhui Wei, Lingxi Xie, Qi Tian
Model Unlearning via Sparse Autoencoder Subspace Guided Projections Authors: Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou
Mixture-of-Experts for Personalized and Semantic-Aware Next Location Prediction Authors: Shuai Liu, Ning Cao, Yile Chen, Yue Jiang, Gao Cong
The Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models Authors: Adam Stein, Aaditya Naik, Neelay Velingker, Mayur Naik, Eric Wong
Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding Authors: Jiaru Zhang, Juanwu Lu, Ziran Wang, Ruqi Zhang
BIRD: Behavior Induction via Representation-structure Distillation Authors: Galen Pogoncheff, Michael Beyeler
PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations Authors: Benjamin Holzschuh, Qiang Liu, Georg Kohl, Nils Thuerey
Cartan Networks: Group theoretical Hyperbolic Deep Learning Authors: Federico Milanesio, Matteo Santoro, Pietro G. Fr\'e, Guido Sanguinetti
Mind the Gap: A Practical Attack on GGUF Quantization Authors: Kazuki Egashira, Robin Staab, Mark Vero, Jingxuan He, Martin Vechev
Knockoff-Guided Compressive Sensing: A Statistical Machine Learning Framework for Support-Assured Signal Recovery Authors: Xiaochen Zhang, Haoyi Xiong
The Rich and the Simple: On the Implicit Bias of Adam and SGD Authors: Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, Mahdi Soltanolkotabi
Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields Authors: Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber
Weisfeiler and Leman Follow the Arrow of Time: Expressive Power of Message Passing in Temporal Event Graphs Authors: Franziska Heeg, Jonas Sauer, Petra Mutzel, Ingo Scholtes

1. On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks

ArXiv ID: 2505.24205

Authors: Mingze Wang, Weinan E

Abstract: Mixture-of-experts networks (MoEs) have demonstrated remarkable efficiency in modern deep learning. Despite their empirical success, the theoretical foundations underlying their ability to model complex tasks remain poorly understood. In this work, we conduct a systematic study of the expressive power of MoEs in modeling complex tasks with two common structural priors: low-dimensionality and sparsity. For shallow MoEs, we prove that they can efficiently approximate functions supported on low-dimensional manifolds, overcoming the curse of dimensionality. For deep MoEs, we show that $\cO(L)$-layer MoEs with $E$ experts per layer can approximate piecewise functions comprising $E^L$ pieces with compositional sparsity, i.e., they can exhibit an exponential number of structured tasks. Our analysis reveals the roles of critical architectural components and hyperparameters in MoEs, including the gating mechanism, expert networks, the number of experts, and the number of layers, and offers natural suggestions for MoE variants.

Comment: The paper provides a theoretical study on the expressive power of Mixture-of-Experts (MoE), directly relevant to model architecture.

Relevance: 10 Novelty: 8

2. Learning geometry and topology via multi-chart flows

ArXiv ID: 2505.24665

Authors: Hanlin Yu, S{\o}ren Hauberg, Marcelo Hartmann, Arto Klami, Georgios Arvanitidis

Abstract: Real world data often lie on low-dimensional Riemannian manifolds embedded in high-dimensional spaces. This motivates learning degenerate normalizing flows that map between the ambient space and a low-dimensional latent space. However, if the manifold has a non-trivial topology, it can never be correctly learned using a single flow. Instead multiple flows must be `glued together'. In this paper, we first propose the general training scheme for learning such a collection of flows, and secondly we develop the first numerical algorithms for computing geodesics on such manifolds. Empirically, we demonstrate that this leads to highly significant improvements in topology estimation.

Comment: The paper introduces a method for learning geometry and topology via multi-chart flows, which is relevant to representation learning and emerging trends in foundational research.

Relevance: 9 Novelty: 9

3. HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

ArXiv ID: 2505.24722

Authors: Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying

Abstract: Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

Comment: The paper introduces HELM, a hyperbolic LLM with Mixture-of-Curvature Experts, relevant to LLM architecture innovations.

Relevance: 9 Novelty: 9

4. Representational Difference Explanations

ArXiv ID: 2505.23917

Authors: Neehar Kondapaneni, Oisin Mac Aodha, Pietro Perona

Abstract: We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, which we call Representational Differences Explanations (RDX), by using it to compare models with known conceptual differences and demonstrate that it recovers meaningful distinctions where existing explainable AI (XAI) techniques fail. Applied to state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX reveals both insightful representational differences and subtle patterns in the data. Although comparison is a cornerstone of scientific analysis, current tools in machine learning, namely post hoc XAI methods, struggle to support model comparison effectively. Our work addresses this gap by introducing an effective and explainable tool for contrasting model representations.

Comment: The paper introduces Representational Differences Explanations (RDX), a method for comparing learned representations, which aligns with representation learning by providing insights into how models encode information.

Relevance: 9 Novelty: 8

5. Differential Gated Self-Attention

ArXiv ID: 2505.24054

Authors: Elpiniki Maria Lygizou, M\'onika Farsang, Radu Grosu

Abstract: Transformers excel across a large variety of tasks but remain susceptible to corrupted inputs, since standard self-attention treats all query-key interactions uniformly. Inspired by lateral inhibition in biological neural circuits and building on the recent use by the Differential Transformer's use of two parallel softmax subtraction for noise cancellation, we propose Multihead Differential Gated Self-Attention (M-DGSA) that learns per-head input-dependent gating to dynamically suppress attention noise. Each head splits into excitatory and inhibitory branches whose dual softmax maps are fused by a sigmoid gate predicted from the token embedding, yielding a context-aware contrast enhancement. M-DGSA integrates seamlessly into existing Transformer stacks with minimal computational overhead. We evaluate on both vision and language benchmarks, demonstrating consistent robustness gains over vanilla Transformer, Vision Transformer, and Differential Transformer baselines. Our contributions are (i) a novel input-dependent gating mechanism for self-attention grounded in lateral inhibition, (ii) a principled synthesis of biological contrast-enhancement and self-attention theory, and (iii) comprehensive experiments demonstrating noise resilience and cross-domain applicability.

Comment: The paper proposes a novel input-dependent gating mechanism for self-attention in Transformers, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8

6. DenseLoRA: Dense Low-Rank Adaptation of Large Language Models

ArXiv ID: 2505.23808

Authors: Lin Mu, Xiaoyu Wang, Li Ni, Yang Li, Zhize Wu, Peiquan Jin, Yiwen Zhang

Abstract: Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA's 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA's components on overall model performance. Code is available at https://github.com/mulin-ahu/DenseLoRA.

Comment: DenseLoRA enhances parameter efficiency in low-rank adaptation of LLMs, which is relevant to model compression and large language models.

Relevance: 9 Novelty: 8

7. Hyperbolic Dataset Distillation

ArXiv ID: 2505.24623

Authors: Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness.

Comment: The paper introduces Hyperbolic Dataset Distillation (HDD), which is relevant to representation learning and model compression by addressing dataset distillation in hyperbolic space.

Relevance: 9 Novelty: 8

8. Characterising the Inductive Biases of Neural Networks on Boolean Data

ArXiv ID: 2505.24060

Authors: Chris Mingard, Lukas Seier, Niclas G\"oring, Andrei-Vlad Badelita, Charles London, Ard Louis

Abstract: Deep neural networks are renowned for their ability to generalise well across diverse tasks, even when heavily overparameterized. Existing works offer only partial explanations (for example, the NTK-based task-model alignment explanation neglects feature learning). Here, we provide an end-to-end, analytically tractable case study that links a network's inductive prior, its training dynamics including feature learning, and its eventual generalisation. Specifically, we exploit the one-to-one correspondence between depth-2 discrete fully connected networks and disjunctive normal form (DNF) formulas by training on Boolean functions. Under a Monte Carlo learning algorithm, our model exhibits predictable training dynamics and the emergence of interpretable features. This framework allows us to trace, in detail, how inductive bias and feature formation drive generalisation.

Comment: The paper provides an analytical case study on the inductive biases of neural networks on Boolean data, which is relevant to representation learning and emerging trends.

Relevance: 9 Novelty: 8

9. TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks

ArXiv ID: 2505.23949

Authors: Xiang Meng, Mehdi Makni, Rahul Mazumder

Abstract: Network pruning reduces the computational requirements of large neural networks, with N:M sparsity -- retaining only N out of every M consecutive weights -- offering a compelling balance between compressed model quality and hardware acceleration. However, N:M sparsity only accelerates forward-pass computations, as N:M patterns are not preserved during matrix transposition, limiting efficiency during training where both passes are computationally intensive. While transposable N:M sparsity has been proposed to address this limitation, existing methods for finding transposable N:M sparse masks either fail to scale to large models or are restricted to M=4 which results in suboptimal compression-accuracy trade-off. We introduce an efficient solver for transposable N:M masks that scales to billion-parameter models. We formulate mask generation as optimal transport problems and solve through entropy regularization and Dykstra's algorithm, followed by a rounding procedure. Our tensor-based implementation exploits GPU parallelism, achieving up to 100x speedup with only 1-10% error compared to existing methods. Our approach can be integrated with layer-wise N:M pruning frameworks including Wanda, SparseGPT and ALPS to produce transposable N:M sparse models with arbitrary N:M values. Experiments show that LLaMA3.2-8B with transposable 16:32 sparsity maintains performance close to its standard N:M counterpart and outperforms standard 2:4 sparse model, showing the practical value of our approach.

Comment: The paper presents a novel approach to network pruning with transposable N:M sparsity, which is relevant to model compression.

Relevance: 9 Novelty: 8

10. DLP: Dynamic Layerwise Pruning in Large Language Models

ArXiv ID: 2505.23807

Authors: Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang

Abstract: Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.

Comment: The paper proposes Dynamic Layerwise Pruning for LLMs, relevant to model compression.

Relevance: 9 Novelty: 8

11. Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution

ArXiv ID: 2505.24037

Authors: Qiao Xiao, Alan Ansell, Boqian Wu, Lu Yin, Mykola Pechenizkiy, Shiwei Liu, Decebal Constantin Mocanu

Abstract: Large language models (LLMs) have achieved remarkable success across various tasks but face deployment challenges due to their massive computational demands. While post-training pruning methods like SparseGPT and Wanda can effectively reduce the model size, but struggle to maintain model performance at high sparsity levels, limiting their utility for downstream tasks. Existing fine-tuning methods, such as full fine-tuning and LoRA, fail to preserve sparsity as they require updating the whole dense metrics, not well-suited for sparse LLMs. In this paper, we propose Sparsity Evolution Fine-Tuning (SEFT), a novel method designed specifically for sparse LLMs. SEFT dynamically evolves the sparse topology of pruned models during fine-tuning, while preserving the overall sparsity throughout the process. The strengths of SEFT lie in its ability to perform task-specific adaptation through a weight drop-and-grow strategy, enabling the pruned model to self-adapt its sparse connectivity pattern based on the target dataset. Furthermore, a sensitivity-driven pruning criterion is employed to ensure that the desired sparsity level is consistently maintained throughout fine-tuning. Our experiments on various LLMs, including LLaMA families, DeepSeek, and Mistral, across a diverse set of benchmarks demonstrate that SEFT achieves stronger performance while offering superior memory and time efficiency compared to existing baselines. Our code is publicly available at: https://github.com/QiaoXiao7282/SEFT.

Comment: The paper proposes a novel method, SEFT, for fine-tuning sparse LLMs, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

12. GradPower: Powering Gradients for Faster Language Model Pre-Training

ArXiv ID: 2505.24275

Authors: Mingze Wang, Jinbo Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, Lei Wu

Abstract: We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlights the influence of gradient noise.

Comment: The paper introduces GradPower, a gradient-transformation technique for accelerating language model pre-training, which is relevant to efficiency improvements in LLMs.

Relevance: 9 Novelty: 8

13. SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling

ArXiv ID: 2505.24179

Authors: Xiaodong Ji, Hailin Zhang, Fangcheng Fu, Bin Cui

Abstract: Many advanced Large Language Model (LLM) applications require long-context processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, rendering considerable loss in model accuracy. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention weight estimation through 4-bit quantized query-key products, followed by block-sparse attention to accelerate prefilling computations. For importance evaluation for query-key pairs, we adopt our Relative Attention Score metric, which offers significantly higher efficiency within our framework. We implement a custom CUDA kernel optimized for our approach for hardware efficiency, reducing the additional overhead to approximately 11% of the full attention latency. Notably, SALE requires no parameter training and can be seamlessly integrated into existing systems with trivial code modifications. Experiments on long-context benchmarks demonstrate that our method outperforms existing approaches in accuracy-efficiency trade-offs, achieving at least 3.36x speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.

Comment: The paper proposes SALE, a sparse attention method for LLMs, focusing on efficiency improvements through quantization and sparse attention, aligning with the model compression criterion.

Relevance: 9 Novelty: 8

ArXiv ID: 2505.24593

Authors: Junzhuo Li, Bo Wang, Xiuze Zhou, Peijie Jiang, Jia Liu, Xuming Hu

Abstract: The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mixtral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a "mid-activation, late-amplification" pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a "basic-refinement" framework--shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.

Comment: The paper explores the interpretability of Mixture-of-Experts (MoE) models, providing insights into their architecture and efficiency, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 8

15. Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws

ArXiv ID: 2505.24009

Authors: Hidetaka Kamigaito, Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe

Abstract: Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer's output from the ground truth, and (ii) diversity, which indicates how much the outputs of each layer differ from each other. Analyzing Transformers under this theory reveals that performance improves when individual layers make predictions close to the correct answer and remain mutually diverse. We show that diversity becomes especially critical when individual layers' outputs are far from the ground truth. Finally, we introduce an information-theoretic diversity and show our main findings that adding layers enhances performance only when those layers behave differently, i.e., are diverse. We also reveal the performance gains from increasing the number of layers exhibit submodularity: marginal improvements diminish as additional layers increase, mirroring the logarithmic convergence predicted by the parameter scaling laws. Experiments on multiple semantic-understanding tasks with various LLMs empirically confirm the theoretical properties derived in this study.

Comment: The paper investigates the diversity of Transformer layers and their impact on parameter scaling laws, providing insights into model architecture and theoretical understanding of Transformers.

Relevance: 9 Novelty: 8

16. ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration

ArXiv ID: 2505.24357

Authors: Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, Xiaokang Yang

Abstract: Large language models (LLMs) have achieved remarkable performance, yet their capability on long-context reasoning is often constrained by the excessive memory required to store the Key-Value (KV) cache. This makes KV cache compression an essential step toward enabling efficient long-context reasoning. Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers or suffer from significant performance degradation under high compression ratios. To address these challenges, we propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache. We develop distinct compression strategies for Keys and Values based on their different roles and varying importance in the attention mechanism. For Keys, we propose Head-wise Similarity-aware Reordering (HSR), which clusters similar heads and applies grouped SVD to the key projection matrix, reducing additional computation while preserving accuracy. For Values, we propose Offline Calibration and Matrix Fusion (OCMF) to preserve accuracy without extra computational overhead. Experiments show that ReCalKV outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. Code is available at: https://github.com/XIANGLONGYAN/ReCalKV.

Comment: The paper focuses on KV cache compression, a relevant topic under model compression, introducing a novel method for low-rank compression with minimal performance loss.

Relevance: 9 Novelty: 8

17. A Mathematical Perspective On Contrastive Learning

ArXiv ID: 2505.24134

Authors: Ricardo Baptista, Andrew M. Stuart, Son Tran

Abstract: Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: the introduction of novel probabilistic loss functions, and the use of alternative metrics for measuring alignment in the common latent space. We study these generalizations of the classical approach in the multivariate Gaussian setting. In this context we view the latent space identification as a low-rank matrix approximation problem. This allows us to characterize the capabilities of loss functions and alignment metrics to approximate natural statistics, such as conditional means and covariances; doing so yields novel variants on contrastive learning algorithms for specific mode-seeking and for generative tasks. The framework we introduce is also studied through numerical experiments on multivariate Gaussians, the labeled MNIST dataset, and on a data assimilation application arising in oceanography.

Comment: The paper provides a mathematical perspective on contrastive learning, focusing on representation learning and introducing novel probabilistic loss functions and metrics.

Relevance: 9 Novelty: 8

18. Disentangling Granularity: An Implicit Inductive Bias in Factorized VAEs

ArXiv ID: 2505.24684

Authors: Zihao Chen, Yu Xiang, Wenyong Wang

Abstract: Despite the success in learning semantically meaningful, unsupervised disentangled representations, variational autoencoders (VAEs) and their variants face a fundamental theoretical challenge: substantial evidence indicates that unsupervised disentanglement is unattainable without implicit inductive bias, yet such bias remains elusive. In this work, we focus on exploring the implicit inductive bias that drive disentanglement in VAEs with factorization priors. By analyzing the total correlation in \b{eta}-TCVAE, we uncover a crucial implicit inductive bias called disentangling granularity, which leads to the discovery of an interesting "V"-shaped optimal Evidence Lower Bound (ELBO) trajectory within the parameter space. This finding is validated through over 100K experiments using factorized VAEs and our newly proposed model, \b{eta}-STCVAE. Notably, experimental results reveal that conventional factorized VAEs, constrained by fixed disentangling granularity, inherently tend to disentangle low-complexity feature. Whereas, appropriately tuning disentangling granularity, as enabled by \b{eta}-STCVAE, broadens the range of disentangled representations, allowing for the disentanglement of high-complexity features. Our findings unveil that disentangling granularity as an implicit inductive bias in factorized VAEs influence both disentanglement performance and the inference of the ELBO, offering fresh insights into the interpretability and inherent biases of VAEs.

Comment: The paper explores implicit inductive bias in factorized VAEs, contributing to representation learning by uncovering disentangling granularity as a bias influencing disentanglement performance.

Relevance: 9 Novelty: 8

19. R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

ArXiv ID: 2505.24133

Authors: Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu

Abstract: Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

Comment: The paper proposes a novel KV cache compression method for reasoning models, relevant to model compression.

Relevance: 8 Novelty: 8

20. From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching

ArXiv ID: 2505.24843

Authors: Ruqi Bai, Yao Ji, Zeyu Zhou, David I. Inouye

Abstract: Spurious correlations can cause model performance to degrade in new environments. Prior causality-inspired works aim to learn invariant representations (e.g., IRM) but typically underperform empirical risk minimization (ERM). Recent alternatives improve robustness by leveraging test-time data, but such data may be unavailable in practice. To address these issues, we take a data-centric approach by leveraging invariant data pairs, pairs of samples that would have the same prediction with the optimally robust classifier. We prove that certain counterfactual pairs will naturally satisfy this invariance property and introduce noisy counterfactual matching (NCM), a simple constraint-based method for leveraging invariant pairs for enhanced robustness, even with a small set of noisy pairs-in the ideal case, each pair can eliminate one spurious feature. For linear causal models, we prove that the test domain error can be upper bounded by the in-domain error and a term that depends on the counterfactuals' diversity and quality. We validate on a synthetic dataset and demonstrate on real-world benchmarks that linear probing on a pretrained backbone improves robustness.

Comment: The paper introduces a method for robustness to spurious correlations using invariant data pairs, which relates to representation learning by focusing on invariant representations.

Relevance: 8 Novelty: 8

21. Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization

ArXiv ID: 2505.23866

Authors: Chengli Tan, Yubo Zhou, Haishan Ye, Guang Dai, Junmin Liu, Zengjie Song, Jiangshe Zhang, Zixiang Zhao, Yunda Hao, Yong Xu

Abstract: Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches

Comment: The paper discusses Sharpness-Aware Minimization (SAM) and its variant CSAM, which are relevant to representation learning as they provide insights into training dynamics and model calibration.