Previous Day 2025-08-04
Monthly Overview 2025-08
Next Day 2025-08-06

Personalized Daily ArXiv Papers 2025-08-05

[gpt-4o] Prompt Completion Total
Token 57367 7731 65098
Cost $0.14 $0.08 $0.22

Total arXiv papers: 1006

Total scanned papers: 654

Total relevant papers: 40

Table of contents with paper titles:

  1. EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models Authors: Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng

  2. Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models Authors: Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang

  3. CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation Authors: Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang

  4. Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable, sustainable fine-tuning Authors: Yixin Shen

  5. Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules Authors: Yilun Liu, Yunpu Ma, Yuetian Lu, Shuo Chen, Zifeng Ding, Volker Tresp

  6. LOST: Low-rank and Sparse Pre-training for Large Language Models Authors: Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang

  7. Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction Authors: Enrico De Santis, Antonello Rizzi

  8. LeanK: Learnable K Cache Channel Pruning for Efficient Decoding Authors: Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu

  9. Trainable Dynamic Mask Sparse Attention Authors: Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo

  10. From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model Authors: Yeong-Joon Ju, Seong-Whan Lee

  11. CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

  12. The Geometry of Machine Learning Models Authors: Pawel Gajer, Jacques Ravel

  13. Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning Authors: En Yu, Jie Lu, Kun Wang, Xiaoyu Yang, Guangquan Zhang

  14. Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact Authors: Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien

  15. Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method Authors: Chenqing Lin, Mostafa Hussien, Chengyao Yu, Mohamed Cheriet, Osama Abdelrahman, Ruixing Ming

  16. CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge Authors: Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan

  17. Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation Authors: Hongze Sun, Wuque Cai, Duo Chen, Shifeng Mao, Jiayi He, Zhenxing Wang, Dezhong Yao, Daqing Guo

  18. Trustworthy scientific inference for inverse problems with generative models Authors: James Carzon, Luca Masserano, Joshua D. Ingram, Alex Shen, Antonio Carlos Herling Ribeiro Junior, Tommaso Dorigo, Michele Doro, Joshua S. Speagle, Rafael Izbicki, Ann B. Lee

  19. CellForge: Agentic Design of Virtual Cell Models Authors: Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein

  20. Superior resilience to poisoning and amenability to unlearning in quantum machine learning Authors: Yu-Qin Chen, Shi-Xin Zhang

  21. Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis Authors: Anzhe Cheng, Chenzhong Yin, Mingxi Cheng, Shukai Duan, Shahin Nazarian, Paul Bogdan

  22. What are you sinking? A geometric approach on attention sink Authors: Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

  23. LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning Authors: Yujia Tong, Tian Zhang, Jingling Yuan, Yuze Wang, Chuang Hu

  24. Compression-Induced Communication-Efficient Large Model Training and Inferencing Authors: Sudip K. Seal, Maksudul Alam, Jorge Ramirez, Sajal Dash, Hao Lu

  25. Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling Authors: Yan Sun, Faming Liang

  26. Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models Authors: Yijun Feng

  27. HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens Authors: Ivan Karpukhin, Andrey Savchenko

  28. DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging Authors: Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma

  29. MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification Authors: Ming Pok Ng, Junqi Jiang, Gabriel Freedman, Antonio Rago, Francesca Toni

  30. Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler Authors: Aleksandr Dremov, Alexander H\"agele, Atli Kosson, Martin Jaggi

  31. Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling Authors: Seyyed Saeid Cheshmi, Azal Ahmad Khan, Xinran Wang, Zirui Liu, Ali Anwar

  32. Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models Authors: Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish

  33. How Does Controllability Emerge In Language Models During Pretraining? Authors: Jianshu She, Xinyue Li, Eric Xing, Zhengzhong Liu, Qirong Ho

  34. Adaptive Riemannian Graph Neural Networks Authors: Xudong Wang, Tongxin Li, Chris Ding, Jicong Fan

  35. FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models Authors: Zishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, Hai Li

  36. Effects of Feature Correlations on Associative Memory Capacity Authors: Stefan Bielmeier, Gerald Friedland

  37. Graph Embedding in the Graph Fractional Fourier Transform Domain Authors: Changjie Sheng, Zhichao Zhang, Wei Yao

  38. ProCut: LLM Prompt Compression via Attribution Estimation Authors: Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang

  39. Expressive Power of Graph Transformers via Logic Authors: Veeti Ahvonen, Maurice Funk, Damian Heiman, Antti Kuusisto, Carsten Lutz

  40. Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations Authors: Dahee Kwon, Sehyun Lee, Jaesik Choi


1. EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models

ArXiv ID: 2508.01625

Authors: Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng

Abstract: Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.

Comment: The paper proposes EAC-MoE, a method for compressing Mixture-of-Experts models using quantization and pruning, which aligns with the model compression and MoE criteria.

Relevance: 9 Novelty: 8


2. Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models

ArXiv ID: 2508.02128

Authors: Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang

Abstract: In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation sparsity, providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.

Comment: The paper introduces Amber Pruner, a training-free N:M activation sparsity method for LLMs, which aligns with the model compression criterion focusing on sparsity and efficiency breakthroughs.

Relevance: 9 Novelty: 8


3. CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

ArXiv ID: 2508.02401

Authors: Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang

Abstract: Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV.git.

Comment: The paper presents CompressKV, a method for KV cache compression in LLMs, aligning with the model compression criterion focusing on efficiency breakthroughs.

Relevance: 9 Novelty: 8


4. Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable, sustainable fine-tuning

ArXiv ID: 2508.01961

Authors: Yixin Shen

Abstract: Fine-tuning massive pre-trained language models across many tasks demands adapters that are both parameter-efficient and highly expressive. We introduce \textbf{Kron-LoRA}, a two-stage adapter that first factorizes each frozen linear update as a Kronecker product [ \Delta W = A \otimes B ] and then compresses [ B \in \mathbb{R}^{d_{B2}\times d_{B1}} ] via an (r)-rank LoRA decomposition (B \approx B_{1}B_{2}). By leveraging [ \mathrm{rank}(A \otimes B) \;=\; \mathrm{rank}(A)\,\mathrm{rank}(B), ] Kron-LoRA retains the expressivity of the update while using up to $4!\times!$ fewer parameters than a standard rank-8 LoRA adapter. Its compact adapter matrices also quantize to 8- or 4-bit with less accuracy degradation than LoRA, enabling further memory and storage savings for on-device deployment. We benchmark on DistilBERT and Mistral-7B across five tasks (PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge) over multiple epochs of adapter-only tuning: on DistilBERT, an 840 K-parameter Kron-LoRA matches LoRA-16's performance, and on Mistral-7B, a 5.7 M-parameter Kron-LoRA rivals LoRA-8 with modest memory savings and only a 3-8\% speed overhead. In sequential fine-tuning from ARC-Challenge to ARC-Easy, Kron-LoRA retains 55.18\% accuracy versus 53.17\% for LoRA-8-despite using only one-quarter of the adapter parameters-underscoring its competitive cross-task transfer performance. By uniting Kronecker structure, low-rank compression, quantization-friendliness, and by providing transparent trade-off analysis, Kron-LoRA offers a scalable, sustainable, and continual-learning-ready solution for multi-task adaptation of large language models.

Comment: The paper introduces Kron-LoRA, a novel approach combining Kronecker product and low-rank decomposition for efficient fine-tuning of large language models, which is relevant to model compression.

Relevance: 9 Novelty: 8


5. Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules

ArXiv ID: 2508.02587

Authors: Yilun Liu, Yunpu Ma, Yuetian Lu, Shuo Chen, Zifeng Ding, Volker Tresp

Abstract: Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter- Efficient Fine-Tuning (PEFT) strategies fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE's multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8x7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify the optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.

Comment: The paper explores parameter-efficient fine-tuning for Mixture-of-Experts models, which is relevant to both model architecture and efficiency.

Relevance: 9 Novelty: 8


6. LOST: Low-rank and Sparse Pre-training for Large Language Models

ArXiv ID: 2508.02668

Authors: Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang

Abstract: While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures to enable effective training of LLMs from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise sparse components to complement the expressiveness of low-rank training. We evaluate LOST on LLM pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at \href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST Repo}

Comment: The paper proposes LOST, a method integrating low-rank and sparse structures for efficient pre-training of large language models, which is relevant to model compression.

Relevance: 9 Novelty: 8


7. Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction

ArXiv ID: 2508.02622

Authors: Enrico De Santis, Antonello Rizzi

Abstract: This paper introduces and formalizes Noosemia, a novel cognitive-phenomenological phenomenon emerging from human interaction with generative AI systems, particularly those enabling dialogic or multimodal exchanges. We propose a multidisciplinary framework to explain how, under certain conditions, users attribute intentionality, agency, and even interiority to these systems - a process grounded not in physical resemblance, but in linguistic performance, epistemic opacity, and emergent technological complexity. By linking an LLM declination of meaning holism to our technical notion of the LLM Contextual Cognitive Field, we clarify how LLMs construct meaning relationally and how coherence and a simulacrum of agency arise at the human-AI interface. The analysis situates noosemia alongside pareidolia, animism, the intentional stance and the uncanny valley, distinguishing its unique characteristics. We also introduce a-noosemia to describe the phenomenological withdrawal of such projections. The paper concludes with reflections on the broader philosophical, epistemological, and social implications of noosemic dynamics and directions for future research.

Comment: The paper introduces a novel cognitive-phenomenological phenomenon related to LLMs, which could provide theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8


8. LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

ArXiv ID: 2508.02215

Authors: Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu

Abstract: Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

Comment: LeanK proposes a learning-based method for pruning key cache channels in LLMs, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8


9. Trainable Dynamic Mask Sparse Attention

ArXiv ID: 2508.02124

Authors: Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo

Abstract: In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

Comment: The paper introduces a trainable dynamic mask sparse attention mechanism, relevant to model architecture and efficiency improvements.

Relevance: 9 Novelty: 8


10. From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

ArXiv ID: 2508.00955

Authors: Yeong-Joon Ju, Seong-Whan Lee

Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model's own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.

Comment: The paper proposes a framework for adapting multimodal LLMs for universal embedding tasks, which aligns with foundational research in representation learning and LLMs.

Relevance: 9 Novelty: 8


11. CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

ArXiv ID: 2508.02322

Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

Comment: The paper introduces CAMERA, a framework for MoE model compression, which is relevant to model compression and MoE architectures.

Relevance: 9 Novelty: 8


12. The Geometry of Machine Learning Models

ArXiv ID: 2508.02080

Authors: Pawel Gajer, Jacques Ravel

Abstract: This paper presents a mathematical framework for analyzing machine learning models through the geometry of their induced partitions. By representing partitions as Riemannian simplicial complexes, we capture not only adjacency relationships but also geometric properties including cell volumes, volumes of faces where cells meet, and dihedral angles between adjacent cells. For neural networks, we introduce a differential forms approach that tracks geometric structure through layers via pullback operations, making computations tractable by focusing on data-containing cells. The framework enables geometric regularization that directly penalizes problematic spatial configurations and provides new tools for model refinement through extended Laplacians and simplicial splines. We also explore how data distribution induces effective geometric curvature in model partitions, developing discrete curvature measures for vertices that quantify local geometric complexity and statistical Ricci curvature for edges that captures pairwise relationships between cells. While focused on mathematical foundations, this geometric perspective offers new approaches to model interpretation, regularization, and diagnostic tools for understanding learning dynamics.

Comment: The paper presents a mathematical framework for analyzing machine learning models through geometry, offering foundational insights into model interpretation and regularization.

Relevance: 9 Novelty: 8


13. Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning

ArXiv ID: 2508.01598

Authors: En Yu, Jie Lu, Kun Wang, Xiaoyu Yang, Guangquan Zhang

Abstract: Learning from multiple data streams in real-world scenarios is fundamentally challenging due to intrinsic heterogeneity and unpredictable concept drifts. Existing methods typically assume homogeneous streams and employ static architectures with indiscriminate knowledge fusion, limiting generalizability in complex dynamic environments. To tackle this gap, we propose CAMEL, a dynamic \textbf{C}ollaborative \textbf{A}ssistance \textbf{M}ixture of \textbf{E}xperts \textbf{L}earning framework. It addresses heterogeneity by assigning each stream an independent system with a dedicated feature extractor and task-specific head. Meanwhile, a dynamic pool of specialized private experts captures stream-specific idiosyncratic patterns. Crucially, collaboration across these heterogeneous streams is enabled by a dedicated assistance expert. This expert employs a multi-head attention mechanism to distill and integrate relevant context autonomously from all other concurrent streams. It facilitates targeted knowledge transfer while inherently mitigating negative transfer from irrelevant sources. Furthermore, we propose an Autonomous Expert Tuner (AET) strategy, which dynamically manages expert lifecycles in response to drift. It instantiates new experts for emerging concepts (freezing prior ones to prevent catastrophic forgetting) and prunes obsolete ones. This expert-level plasticity provides a robust and efficient mechanism for online model capacity adaptation. Extensive experiments demonstrate CAMEL's superior generalizability across diverse multistreams and exceptional resilience against complex concept drifts.

Comment: The paper introduces a dynamic mixture of experts framework for multistream learning, which aligns with the interest in mixture-of-experts architectures.

Relevance: 9 Novelty: 8


14. Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

ArXiv ID: 2508.00903

Authors: Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien

Abstract: We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining how these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via loss and KL divergence. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in deeper layers. These findings suggest stable and universal representational structures emerge during neural network training.

Comment: The paper investigates universal neurons in GPT-2 models, providing insights into representation learning and neural network training dynamics.

Relevance: 9 Novelty: 7


15. Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method

ArXiv ID: 2508.02291

Authors: Chenqing Lin, Mostafa Hussien, Chengyao Yu, Mohamed Cheriet, Osama Abdelrahman, Ruixing Ming

Abstract: Neural network pruning is a critical compression technique that facilitates the deployment of large-scale neural networks on resource-constrained edge devices, typically by identifying and eliminating redundant or insignificant parameters to reduce computational and memory overhead. This paper proposes the Flexible Automatic Identification and Removal (FAIR)-Pruner, a novel method for neural network structured pruning. Specifically, FAIR-Pruner first evaluates the importance of each unit (e.g., neuron or channel) through the Utilization Score quantified by the Wasserstein distance. To reflect the performance degradation after unit removal, it then introduces the Reconstruction Error, which is computed via the Taylor expansion of the loss function. Finally, FAIR-Pruner identifies superfluous units with negligible impact on model performance by controlling the proposed Tolerance of Difference, which measures differences between unimportant units and those that cause performance degradation. A major advantage of FAIR-Pruner lies in its capacity to automatically determine the layer-wise pruning rates, which yields a more efficient subnetwork structure compared to applying a uniform pruning rate. Another advantage of the FAIR-Pruner is its great one-shot performance without post-pruning fine-tuning. Furthermore, with utilization scores and reconstruction errors, users can flexibly obtain pruned models under different pruning ratios. Comprehensive experimental validation on diverse benchmark datasets (e.g., ImageNet) and various neural network architectures (e.g., VGG) demonstrates that FAIR-Pruner achieves significant model compression while maintaining high accuracy.

Comment: FAIR-Pruner introduces a novel method for neural network pruning, focusing on structured pruning and automatic determination of layer-wise pruning rates, relevant to model compression.

Relevance: 9 Novelty: 7


16. CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

ArXiv ID: 2508.02583

Authors: Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan

Abstract: Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbf{CA}usal \textbf{MA}thematician (\textbf{CAMA}), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbf{M}athematical \textbf{C}ausal \textbf{G}raph (\textbf{MCG}), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM's intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.

Comment: The paper focuses on enhancing mathematical reasoning in LLMs using a causal framework, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 7


17. Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation

ArXiv ID: 2508.01992

Authors: Hongze Sun, Wuque Cai, Duo Chen, Shifeng Mao, Jiayi He, Zhenxing Wang, Dezhong Yao, Daqing Guo

Abstract: As a foundational architecture of artificial intelligence models, Transformer has been recently adapted to spiking neural networks with promising performance across various tasks. However, existing spiking Transformer (ST)-based models require a substantial number of parameters and incur high computational costs, thus limiting their deployment in resource-constrained environments. To address these challenges, we propose combining synapse pruning with a synergistic learning-based compensation strategy to derive lightweight ST-based models. Specifically, two types of tailored pruning strategies are introduced to reduce redundancy in the weight matrices of ST blocks: an unstructured $\mathrm{L_{1}P}$ method to induce sparse representations, and a structured DSP method to induce low-rank representations. In addition, we propose an enhanced spiking neuron model, termed the synergistic leaky integrate-and-fire (sLIF) neuron, to effectively compensate for model pruning through synergistic learning between synaptic and intrinsic plasticity mechanisms. Extensive experiments on benchmark datasets demonstrate that the proposed methods significantly reduce model size and computational overhead while maintaining competitive performance. These results validate the effectiveness of the proposed pruning and compensation strategies in constructing efficient and high-performing ST-based models.

Comment: The paper proposes synapse pruning and synergistic learning for efficient spiking Transformers, contributing to model compression and efficiency.

Relevance: 9 Novelty: 7


18. Trustworthy scientific inference for inverse problems with generative models

ArXiv ID: 2508.02602

Authors: James Carzon, Luca Masserano, Joshua D. Ingram, Alex Shen, Antonio Carlos Herling Ribeiro Junior, Tommaso Dorigo, Michele Doro, Joshua S. Speagle, Rafael Izbicki, Ann B. Lee

Abstract: Generative artificial intelligence (AI) excels at producing complex data structures (text, images, videos) by learning patterns from training examples. Across scientific disciplines, researchers are now applying generative models to ``inverse problems'' to infer hidden parameters from observed data. While these methods can handle intractable models and large-scale studies, they can also produce biased or overconfident conclusions. We present a solution with Frequentist-Bayes (FreB), a mathematically rigorous protocol that reshapes AI-generated probability distributions into confidence regions that consistently include true parameters with the expected probability, while achieving minimum size when training and target data align. We demonstrate FreB's effectiveness by tackling diverse case studies in the physical sciences: identifying unknown sources under dataset shift, reconciling competing theoretical models, and mitigating selection bias and systematics in observational studies. By providing validity guarantees with interpretable diagnostics, FreB enables trustworthy scientific inference across fields where direct likelihood evaluation remains impossible or prohibitively expensive.

Comment: The paper introduces FreB, a protocol for trustworthy scientific inference with generative models, aligning with AI for Science and emerging trends in foundational research.

Relevance: 8 Novelty: 8


19. CellForge: Agentic Design of Virtual Cell Models

ArXiv ID: 2508.02276

Authors: Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein

Abstract: Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

Comment: The paper introduces CellForge, a multi-agent framework for virtual cell modeling, aligning with AI for Science and emerging trends in foundational research.

Relevance: 8 Novelty: 8


20. Superior resilience to poisoning and amenability to unlearning in quantum machine learning

ArXiv ID: 2508.02422

Authors: Yu-Qin Chen, Shi-Xin Zhang

Abstract: The reliability of artificial intelligence hinges on the integrity of its training data, a foundation often compromised by noise and corruption. Here, through a comparative study of classical and quantum neural networks on both classical and quantum data, we reveal a fundamental difference in their response to data corruption. We find that classical models exhibit brittle memorization, leading to a failure in generalization. In contrast, quantum models demonstrate remarkable resilience, which is underscored by a phase transition-like response to increasing label noise, revealing a critical point beyond which the model's performance changes qualitatively. We further establish and investigate the field of quantum machine unlearning, the process of efficiently forcing a trained model to forget corrupting influences. We show that the brittle nature of the classical model forms rigid, stubborn memories of erroneous data, making efficient unlearning challenging, while the quantum model is significantly more amenable to efficient forgetting with approximate unlearning methods. Our findings establish that quantum machine learning can possess a dual advantage of intrinsic resilience and efficient adaptability, providing a promising paradigm for the trustworthy and robust artificial intelligence of the future.

Comment: The paper explores quantum machine learning's resilience and unlearning capabilities, which is an emerging trend in AI with potential foundational implications.

Relevance: 8 Novelty: 8


21. Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis

ArXiv ID: 2508.01219

Authors: Anzhe Cheng, Chenzhong Yin, Mingxi Cheng, Shukai Duan, Shahin Nazarian, Paul Bogdan

Abstract: The remarkable success of Deep Neural Networks(DNN) is driven by gradient-based optimization, yet this process is often undermined by its tendency to produce disordered weight structures, which harms feature clarity and degrades learning dynamics. To address this fundamental representational flaw, we introduced the Eigen Neural Network (ENN), a novel architecture that reparameterizes each layer's weights in a layer-shared, learned orthonormal eigenbasis. This design enforces decorrelated, well-aligned weight dynamics axiomatically, rather than through regularization, leading to more structured and discriminative feature representations. When integrated with standard BP, ENN consistently outperforms state-of-the-art methods on large-scale image classification benchmarks, including ImageNet, and its superior representations generalize to set a new benchmark in cross-modal image-text retrieval. Furthermore, ENN's principled structure enables a highly efficient, backpropagation-free(BP-free) local learning variant, ENN-$\ell$. This variant not only resolves BP's procedural bottlenecks to achieve over 2$\times$ training speedup via parallelism, but also, remarkably, surpasses the accuracy of end-to-end backpropagation. ENN thus presents a new architectural paradigm that directly remedies the representational deficiencies of BP, leading to enhanced performance and enabling a more efficient, parallelizable training regime.

Comment: The paper introduces the Eigen Neural Network, a novel architecture that reparameterizes weights in a learned orthonormal eigenbasis, which aligns with the interest in model architecture innovations.

Relevance: 8 Novelty: 8


22. What are you sinking? A geometric approach on attention sink

ArXiv ID: 2508.02546

Authors: Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

Abstract: Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.

Comment: The paper provides a geometric analysis of attention mechanisms in transformers, offering insights into architectural components and their effects, aligning with the model architecture criterion.

Relevance: 8 Novelty: 7


23. LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning

ArXiv ID: 2508.01569

Authors: Yujia Tong, Tian Zhang, Jingling Yuan, Yuze Wang, Chuang Hu

Abstract: Vision Transformers (ViTs) have revolutionized computer vision tasks with their exceptional performance. However, the introduction of privacy regulations such as GDPR and CCPA has brought new challenges to them. These laws grant users the right to withdraw their data, necessitating not only the deletion of data but also the complete removal of its influence from trained models. Machine unlearning emerges as a critical solution, with exact unlearning being computationally prohibitive and approximate methods offering a more practical approach. This work addresses the particularly challenging scenario of random data forgetting in ViTs, where the model must forget specific samples while retaining others, even within the same class. We first reveal the core characteristics of ViTs through selective masking experiments: when high-attention areas are masked, the model retains its recognition capability but significantly weakens its memorization ability. Based on the above insights, we propose LetheViT, a contrastive unlearning method tailored for ViTs. LetheViT uses masked image inputs to generate positive logits and original image inputs to generate negative logits, guiding the model to forget specific details while retaining the general cl category outlines. Experimental results demonstrate that LetheViT achieves state-of-the-art performance, effectively balancing privacy compliance with model efficacy.

Comment: The paper proposes LetheViT, a method for machine unlearning in vision transformers, which aligns with model architecture and representation learning criteria.

Relevance: 8 Novelty: 7


24. Compression-Induced Communication-Efficient Large Model Training and Inferencing

ArXiv ID: 2508.00960

Authors: Sudip K. Seal, Maksudul Alam, Jorge Ramirez, Sajal Dash, Hao Lu

Abstract: Energy efficiency of training and inferencing with large neural network models is a critical challenge facing the future of sustainable large-scale machine learning workloads. This paper introduces an alternative strategy, called phantom parallelism, to minimize the net energy consumption of traditional tensor (model) parallelism, the most energy-inefficient component of large neural network training. The approach is presented in the context of feed-forward network architectures as a preliminary, but comprehensive, proof-of-principle study of the proposed methodology. We derive new forward and backward propagation operators for phantom parallelism, implement them as custom autograd operations within an end-to-end phantom parallel training pipeline and compare its parallel performance and energy-efficiency against those of conventional tensor parallel training pipelines. Formal analyses that predict lower bandwidth and FLOP counts are presented with supporting empirical results on up to 256 GPUs that corroborate these gains. Experiments are shown to deliver ~50% reduction in the energy consumed to train FFNs using the proposed phantom parallel approach when compared with conventional tensor parallel methods. Additionally, the proposed approach is shown to train smaller phantom models to the same model loss on smaller GPU counts as larger tensor parallel models on larger GPU counts offering the possibility for even greater energy savings.

Comment: The paper introduces phantom parallelism for energy-efficient training, aligning with the model compression criterion focusing on efficiency breakthroughs.

Relevance: 8 Novelty: 7


25. Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling

ArXiv ID: 2508.01217

Authors: Yan Sun, Faming Liang

Abstract: Deep learning has revolutionized modern data science. However, how to accurately quantify the uncertainty of predictions from large-scale deep neural networks (DNNs) remains an unresolved issue. To address this issue, we introduce a novel post-processing approach. This approach feeds the output from the last hidden layer of a pre-trained large-scale DNN model into a stochastic neural network (StoNet), then trains the StoNet with a sparse penalty on a validation dataset and constructs prediction intervals for future observations. We establish a theoretical guarantee for the validity of this approach; in particular, the parameter estimation consistency for the sparse StoNet is essential for the success of this approach. Comprehensive experiments demonstrate that the proposed approach can construct honest confidence intervals with shorter interval lengths compared to conformal methods and achieves better calibration compared to other post-hoc calibration techniques. Additionally, we show that the StoNet formulation provides us with a platform to adapt sparse learning theory and methods from linear models to DNNs.

Comment: The paper introduces a novel post-processing approach for uncertainty quantification in DNNs, aligning with representation learning and model architecture analysis.

Relevance: 8 Novelty: 7


26. Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models

ArXiv ID: 2508.01862

Authors: Yijun Feng

Abstract: Large Language Models have demonstrated remarkable capabilities across diverse tasks, yet they frequently generate hallucinations outputs that are fluent but factually incorrect or unsupported. We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in LLM outputs. Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations. We hypothesize that genuine knowledge exhibits robustness to counterfactual variations, while hallucinated content shows inconsistent confidence patterns when confronted with plausible alternatives. Our comprehensive evaluation on TruthfulQA, factual statement datasets, and curated hallucination examples demonstrates that counterfactual probing achieves superior detection performance compared to baseline methods, while our adaptive mitigation strategies reduce hallucination scores by an average of 24.5%. The approach requires no model retraining and can be integrated into existing LLM pipelines as a realtime verification mechanism.

Comment: The paper proposes a novel method for detecting and mitigating hallucinations in large language models, which is relevant to theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


27. HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens

ArXiv ID: 2508.01474

Authors: Ivan Karpukhin, Andrey Savchenko

Abstract: Deep learning has achieved remarkable success in modeling sequential data, including event sequences, temporal point processes, and irregular time series. Recently, transformers have largely replaced recurrent networks in these tasks. However, transformers often underperform RNNs in classification tasks where the objective is to predict future targets. The reason behind this performance gap remains largely unexplored. In this paper, we identify a key limitation of transformers: the absence of a single state vector that provides a compact and effective representation of the entire sequence. Additionally, we show that contrastive pretraining of embedding vectors fails to capture local context, which is crucial for accurate prediction. To address these challenges, we introduce history tokens, a novel concept that facilitates the accumulation of historical information during next-token prediction pretraining. Our approach significantly improves transformer-based models, achieving impressive results in finance, e-commerce, and healthcare tasks. The code is publicly available on GitHub.

Comment: The paper proposes a novel transformer-based model with history tokens, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


28. DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

ArXiv ID: 2508.01148

Authors: Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma

Abstract: Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent years. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models exhibiting the harmful traits -- where they would otherwise fail -- achieving significant performance gains.

Comment: The paper introduces a novel method for robust model merging using distillation, relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


29. MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification

ArXiv ID: 2508.02584

Authors: Ming Pok Ng, Junqi Jiang, Gabriel Freedman, Antonio Rago, Francesca Toni

Abstract: Leveraging outputs from multiple large language models (LLMs) is emerging as a method for harnessing their power across a wide range of tasks while mitigating their capacity for making errors, e.g., hallucinations. However, current approaches to combining insights from multiple LLMs often involve unstructured interactions (e.g., free debate), resulting in model generations that are not faithfully justifiable. In this work, we introduce MArgE, a novel framework to provide formal structure to the evidence from each LLM, in the form of a tree of extracted arguments, for the task of claim verification. We use a variant of Argumentative LLMs (ArgLLMs), i.e. LLMs driven by frameworks and semantics from the field of computational argumentation, to construct structured argument trees for given claims. This process creates an inspectable pathway from the initial arguments to the final claim verification decisions, providing a faithful justification thereof. We show experimentally that MArgE can significantly outperform single LLMs, including three open-source models (4B to 8B parameters), GPT-4o-mini and existing ArgLLMs, as well as prior methods for unstructured multi-LLM debates. We thus demonstrate the advantages of incorporating formal, argumentative reasoning mechanisms when combining multiple LLM outputs.

Comment: The paper introduces a framework for combining outputs from multiple LLMs using argumentative reasoning, relevant to theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


30. Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

ArXiv ID: 2508.01483

Authors: Aleksandr Dremov, Alexander H\"agele, Atli Kosson, Martin Jaggi

Abstract: Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations $\unicode{x2013}$ comparable to those from cooldown shape selection $\unicode{x2013}$ when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $\beta_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.

Comment: The paper provides insights into the training dynamics of learning rate schedulers in transformers, relevant to model architecture and training dynamics.

Relevance: 8 Novelty: 7


31. Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling

ArXiv ID: 2508.01969

Authors: Seyyed Saeid Cheshmi, Azal Ahmad Khan, Xinran Wang, Zirui Liu, Ali Anwar

Abstract: Large Language Models (LLMs) are increasingly relied upon for solving complex reasoning tasks in domains such as mathematics, logic, and multi-step question answering. A growing line of work seeks to improve reasoning quality by scaling inference time compute particularly through Process Reward Models (PRMs), used to reward the reasoning at intermediate steps. While effective, these methods introduce substantial computational overhead, especially when generating large numbers of solutions in parallel. In this paper, we investigate whether PRMs can be used mid-generation to provide early signals that enable the rejection of suboptimal candidates before full generation of step is complete. We introduce the hypothesis that PRMs are also Partial Reward Models, meaning that the scores they assign to partially completed reasoning step are predictive of final output quality. This allows for principled early rejection based on intermediate token-level signals. We support this hypothesis both theoretically, by proving that the risk of discarding optimal beams decreases exponentially with generation length and empirically, by demonstrating a strong correlation between partial and final rewards across multiple reward models. On math reasoning benchmarks, our method achieves up to 1.4$\times$-9$\times$ reduction in inference FLOPs without degrading final performance. These results suggest that early rejection is a powerful mechanism for improving the compute-efficiency of reasoning in LLMs.

Comment: The paper introduces a method for accelerating LLM reasoning via early rejection, relevant to model efficiency and theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


32. Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

ArXiv ID: 2508.01908

Authors: Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish

Abstract: Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.

Comment: The paper explores continual pre-training of LLMs with techniques like experience replay and gradient alignment, relevant to understanding LLM behavior and efficiency.

Relevance: 8 Novelty: 7


33. How Does Controllability Emerge In Language Models During Pretraining?

ArXiv ID: 2508.01892

Authors: Jianshu She, Xinyue Li, Eric Xing, Zhengzhong Liu, Qirong Ho

Abstract: Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training. To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the "Intervention Detector" (ID), which is designed to reveal how linear steerability evolves over the course of training through hidden state and representation analysis. ID reveals that concepts become increasingly linearly separable in the hidden space as training progresses, which strongly correlates with the emergence of linear steerability. We further introduce ID-based metrics, such as heatmaps, entropy trends, and cosine similarity, to help interpret how linear steerability evolves throughout training. In addition, we apply ID across different model families to ensure the generality of our findings on steerability dynamics.

Comment: The paper investigates how controllability emerges in LLMs during pretraining, relevant to understanding LLM behavior and interpretability.

Relevance: 8 Novelty: 7


34. Adaptive Riemannian Graph Neural Networks

ArXiv ID: 2508.02600

Authors: Xudong Wang, Tongxin Li, Chris Ding, Jicong Fan

Abstract: Graph data often exhibits complex geometric heterogeneity, where structures with varying local curvature, such as tree-like hierarchies and dense communities, coexist within a single network. Existing geometric GNNs, which embed graphs into single fixed-curvature manifolds or discrete product spaces, struggle to capture this diversity. We introduce Adaptive Riemannian Graph Neural Networks (ARGNN), a novel framework that learns a continuous and anisotropic Riemannian metric tensor field over the graph. It allows each node to determine its optimal local geometry, enabling the model to fluidly adapt to the graph's structural landscape. Our core innovation is an efficient parameterization of the node-wise metric tensor, specializing to a learnable diagonal form that captures directional geometric information while maintaining computational tractability. To ensure geometric regularity and stable training, we integrate a Ricci flow-inspired regularization that smooths the learned manifold. Theoretically, we establish the rigorous geometric evolution convergence guarantee for ARGNN and provide a continuous generalization that unifies prior fixed or mixed-curvature GNNs. Empirically, our method demonstrates superior performance on both homophilic and heterophilic benchmark datasets with the ability to capture diverse structures adaptively. Moreover, the learned geometries both offer interpretable insights into the underlying graph structure and empirically corroborate our theoretical analysis.

Comment: The paper introduces Adaptive Riemannian Graph Neural Networks, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


35. FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

ArXiv ID: 2508.01506

Authors: Zishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, Hai Li

Abstract: Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments. We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for SVD-compressed large language models. FlashSVD can be seamlessly integrated with any model that employs SVD-based methods for parameter reduction. By fusing low-rank projection kernels directly into both the self-attention and feed-forward network (FFN) pipelines, FlashSVD avoid materializing full-size activation buffers. Instead, small tiles of the truncated factors are loaded into on-chip SRAM, multiplied and reduced on the fly, and immediately evicted, preserving high GPU occupancy and adding no extra latency. On standard encoder benchmarks (e.g., BERT-Base), FlashSVD cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%, all while incur no accuracy loss with upstreaming compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs.

Comment: The paper introduces FlashSVD, a framework for memory-efficient inference with low-rank models, which is relevant to model compression.

Relevance: 8 Novelty: 7


36. Effects of Feature Correlations on Associative Memory Capacity

ArXiv ID: 2508.01395

Authors: Stefan Bielmeier, Gerald Friedland

Abstract: We investigate how feature correlations influence the capacity of Dense Associative Memory (DAM), a Transformer attention-like model. Practical machine learning scenarios involve feature-correlated data and learn representations in the input space, but current capacity analyses do not account for this. We develop an empirical framework to analyze the effects of data structure on capacity dynamics. Specifically, we systematically construct datasets that vary in feature correlation and pattern separation using Hamming distance from information theory, and compute the model's corresponding storage capacity using a simple binary search algorithm. Our experiments confirm that memory capacity scales exponentially with increasing separation in the input space. Feature correlations do not alter this relationship fundamentally, but reduce capacity slightly at constant separation. This effect is amplified at higher polynomial degrees in the energy function, suggesting that Associative Memory is more limited in depicting higher-order interactions between features than patterns. Our findings bridge theoretical work and practical settings for DAM, and might inspire more data-centric methods.

Comment: The paper investigates the effects of feature correlations on associative memory capacity, contributing to representation learning by analyzing capacity dynamics.

Relevance: 8 Novelty: 7


37. Graph Embedding in the Graph Fractional Fourier Transform Domain

ArXiv ID: 2508.02383

Authors: Changjie Sheng, Zhichao Zhang, Wei Yao

Abstract: Spectral graph embedding plays a critical role in graph representation learning by generating low-dimensional vector representations from graph spectral information. However, the embedding space of traditional spectral embedding methods often exhibit limited expressiveness, failing to exhaustively capture latent structural features across alternative transform domains. To address this issue, we use the graph fractional Fourier transform to extend the existing state-of-the-art generalized frequency filtering embedding (GEFFE) into fractional domains, giving birth to the generalized fractional filtering embedding (GEFRFE), which enhances embedding informativeness via the graph fractional domain. The GEFRFE leverages graph fractional domain filtering and a nonlinear composition of eigenvector components derived from a fractionalized graph Laplacian. To dynamically determine the fractional order, two parallel strategies are introduced: search-based optimization and a ResNet18-based adaptive learning. Extensive experiments on six benchmark datasets demonstrate that the GEFRFE captures richer structural features and significantly enhance classification performance. Notably, the proposed method retains computational complexity comparable to GEFFE approaches.

Comment: The paper introduces a novel graph embedding method using the graph fractional Fourier transform, contributing to representation learning with a focus on spectral methods.

Relevance: 8 Novelty: 7


38. ProCut: LLM Prompt Compression via Attribution Estimation

ArXiv ID: 2508.02053

Authors: Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang

Abstract: In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

Comment: The paper introduces ProCut for LLM prompt compression, focusing on efficiency improvements in LLMs, which aligns with model compression.

Relevance: 8 Novelty: 7


39. Expressive Power of Graph Transformers via Logic

ArXiv ID: 2508.01067

Authors: Veeti Ahvonen, Maurice Funk, Damian Heiman, Antti Kuusisto, Carsten Lutz

Abstract: Transformers are the basis of modern large language models, but relatively little is known about their precise expressive power on graphs. We study the expressive power of graph transformers (GTs) by Dwivedi and Bresson (2020) and GPS-networks by Ramp\'asek et al. (2022), both under soft-attention and average hard-attention. Our study covers two scenarios: the theoretical setting with real numbers and the more practical case with floats. With reals, we show that in restriction to vertex properties definable in first-order logic (FO), GPS-networks have the same expressive power as graded modal logic (GML) with the global modality. With floats, GPS-networks turn out to be equally expressive as GML with the counting global modality. The latter result is absolute, not restricting to properties definable in a background logic. We also obtain similar characterizations for GTs in terms of propositional logic with the global modality (for reals) and the counting global modality (for floats).

Comment: The paper provides theoretical insights into the expressive power of graph transformers, which aligns with the interest in understanding model architectures, particularly transformers.

Relevance: 8 Novelty: 7


40. Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations

ArXiv ID: 2508.01728

Authors: Dahee Kwon, Sehyun Lee, Jaesik Choi

Abstract: Deep vision models have achieved remarkable classification performance by leveraging a hierarchical architecture in which human-interpretable concepts emerge through the composition of individual neurons across layers. Given the distributed nature of representations, pinpointing where specific visual concepts are encoded within a model remains a crucial yet challenging task. In this paper, we introduce an effective circuit discovery method, called Granular Concept Circuit (GCC), in which each circuit represents a concept relevant to a given query. To construct each circuit, our method iteratively assesses inter-neuron connectivity, focusing on both functional dependencies and semantic alignment. By automatically discovering multiple circuits, each capturing specific concepts within that query, our approach offers a profound, concept-wise interpretation of models and is the first to identify circuits tied to specific visual concepts at a fine-grained level. We validate the versatility and effectiveness of GCCs across various deep image classification models.

Comment: The paper introduces a method for discovering concept circuits in deep vision models, which relates to representation learning by offering insights into how models encode information.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.