Personalized Daily ArXiv Papers 2025-08-05

[gpt-4o]	Prompt	Completion	Total
Token	57367	7731	65098
Cost	$0.14	$0.08	$0.22

Total arXiv papers: 1006

Total scanned papers: 654

Total relevant papers: 40

Table of contents with paper titles:

EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models Authors: Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng
Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models Authors: Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang
CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation Authors: Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang
Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable, sustainable fine-tuning Authors: Yixin Shen
Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules Authors: Yilun Liu, Yunpu Ma, Yuetian Lu, Shuo Chen, Zifeng Ding, Volker Tresp
LOST: Low-rank and Sparse Pre-training for Large Language Models Authors: Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang
Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction Authors: Enrico De Santis, Antonello Rizzi
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding Authors: Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu
Trainable Dynamic Mask Sparse Attention Authors: Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo
From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model Authors: Yeong-Joon Ju, Seong-Whan Lee
CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
The Geometry of Machine Learning Models Authors: Pawel Gajer, Jacques Ravel
Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning Authors: En Yu, Jie Lu, Kun Wang, Xiaoyu Yang, Guangquan Zhang
Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact Authors: Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien
Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method Authors: Chenqing Lin, Mostafa Hussien, Chengyao Yu, Mohamed Cheriet, Osama Abdelrahman, Ruixing Ming
CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge Authors: Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan
Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation Authors: Hongze Sun, Wuque Cai, Duo Chen, Shifeng Mao, Jiayi He, Zhenxing Wang, Dezhong Yao, Daqing Guo
Trustworthy scientific inference for inverse problems with generative models Authors: James Carzon, Luca Masserano, Joshua D. Ingram, Alex Shen, Antonio Carlos Herling Ribeiro Junior, Tommaso Dorigo, Michele Doro, Joshua S. Speagle, Rafael Izbicki, Ann B. Lee
CellForge: Agentic Design of Virtual Cell Models Authors: Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein
Superior resilience to poisoning and amenability to unlearning in quantum machine learning Authors: Yu-Qin Chen, Shi-Xin Zhang
Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis Authors: Anzhe Cheng, Chenzhong Yin, Mingxi Cheng, Shukai Duan, Shahin Nazarian, Paul Bogdan
What are you sinking? A geometric approach on attention sink Authors: Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri
LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning Authors: Yujia Tong, Tian Zhang, Jingling Yuan, Yuze Wang, Chuang Hu
Compression-Induced Communication-Efficient Large Model Training and Inferencing Authors: Sudip K. Seal, Maksudul Alam, Jorge Ramirez, Sajal Dash, Hao Lu
Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling Authors: Yan Sun, Faming Liang
Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models Authors: Yijun Feng
HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens Authors: Ivan Karpukhin, Andrey Savchenko
DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging Authors: Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma
MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification Authors: Ming Pok Ng, Junqi Jiang, Gabriel Freedman, Antonio Rago, Francesca Toni
Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler Authors: Aleksandr Dremov, Alexander H\"agele, Atli Kosson, Martin Jaggi
Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling Authors: Seyyed Saeid Cheshmi, Azal Ahmad Khan, Xinran Wang, Zirui Liu, Ali Anwar
Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models Authors: Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish
How Does Controllability Emerge In Language Models During Pretraining? Authors: Jianshu She, Xinyue Li, Eric Xing, Zhengzhong Liu, Qirong Ho
Adaptive Riemannian Graph Neural Networks Authors: Xudong Wang, Tongxin Li, Chris Ding, Jicong Fan
FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models Authors: Zishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, Hai Li
Effects of Feature Correlations on Associative Memory Capacity Authors: Stefan Bielmeier, Gerald Friedland
Graph Embedding in the Graph Fractional Fourier Transform Domain Authors: Changjie Sheng, Zhichao Zhang, Wei Yao
ProCut: LLM Prompt Compression via Attribution Estimation Authors: Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang
Expressive Power of Graph Transformers via Logic Authors: Veeti Ahvonen, Maurice Funk, Damian Heiman, Antti Kuusisto, Carsten Lutz
Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations Authors: Dahee Kwon, Sehyun Lee, Jaesik Choi

1. EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models

ArXiv ID: 2508.01625

Authors: Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng

Abstract: Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.

Comment: The paper proposes EAC-MoE, a method for compressing Mixture-of-Experts models using quantization and pruning, which aligns with the model compression and MoE criteria.

Relevance: 9 Novelty: 8

2. Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models

ArXiv ID: 2508.02128

Authors: Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang

Abstract: In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation sparsity, providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.

Comment: The paper introduces Amber Pruner, a training-free N:M activation sparsity method for LLMs, which aligns with the model compression criterion focusing on sparsity and efficiency breakthroughs.

Relevance: 9 Novelty: 8

3. CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

ArXiv ID: 2508.02401

Authors: Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang

Abstract: Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV.git.

Comment: The paper presents CompressKV, a method for KV cache compression in LLMs, aligning with the model compression criterion focusing on efficiency breakthroughs.

Relevance: 9 Novelty: 8

4. Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable, sustainable fine-tuning

ArXiv ID: 2508.01961

Authors: Yixin Shen

Abstract: Fine-tuning massive pre-trained language models across many tasks demands adapters that are both parameter-efficient and highly expressive. We introduce \textbf{Kron-LoRA}, a two-stage adapter that first factorizes each frozen linear update as a Kronecker product [ \Delta W = A \otimes B ] and then compresses [ B \in \mathbb{R}^{d_{B2}\times d_{B1}} ] via an (r)-rank LoRA decomposition (B \approx B_{1}B_{2}). By leveraging [ \mathrm{rank}(A \otimes B) \;=\; \mathrm{rank}(A)\,\mathrm{rank}(B), ] Kron-LoRA retains the expressivity of the update while using up to $4!\times!$ fewer parameters than a standard rank-8 LoRA adapter. Its compact adapter matrices also quantize to 8- or 4-bit with less accuracy degradation than LoRA, enabling further memory and storage savings for on-device deployment. We benchmark on DistilBERT and Mistral-7B across five tasks (PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge) over multiple epochs of adapter-only tuning: on DistilBERT, an 840 K-parameter Kron-LoRA matches LoRA-16's performance, and on Mistral-7B, a 5.7 M-parameter Kron-LoRA rivals LoRA-8 with modest memory savings and only a 3-8\% speed overhead. In sequential fine-tuning from ARC-Challenge to ARC-Easy, Kron-LoRA retains 55.18\% accuracy versus 53.17\% for LoRA-8-despite using only one-quarter of the adapter parameters-underscoring its competitive cross-task transfer performance. By uniting Kronecker structure, low-rank compression, quantization-friendliness, and by providing transparent trade-off analysis, Kron-LoRA offers a scalable, sustainable, and continual-learning-ready solution for multi-task adaptation of large language models.

Comment: The paper introduces Kron-LoRA, a novel approach combining Kronecker product and low-rank decomposition for efficient fine-tuning of large language models, which is relevant to model compression.

Relevance: 9 Novelty: 8

5. Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules

ArXiv ID: 2508.02587

Authors: Yilun Liu, Yunpu Ma, Yuetian Lu, Shuo Chen, Zifeng Ding, Volker Tresp

Abstract: Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter- Efficient Fine-Tuning (PEFT) strategies fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE's multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8x7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify the optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.

Comment: The paper explores parameter-efficient fine-tuning for Mixture-of-Experts models, which is relevant to both model architecture and efficiency.

Relevance: 9 Novelty: 8

6. LOST: Low-rank and Sparse Pre-training for Large Language Models

ArXiv ID: 2508.02668

Authors: Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang

Abstract: While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures to enable effective training of LLMs from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise sparse components to complement the expressiveness of low-rank training. We evaluate LOST on LLM pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at \href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST Repo}

Comment: The paper proposes LOST, a method integrating low-rank and sparse structures for efficient pre-training of large language models, which is relevant to model compression.

Relevance: 9 Novelty: 8

7. Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction

ArXiv ID: 2508.02622

Authors: Enrico De Santis, Antonello Rizzi

Abstract: This paper introduces and formalizes Noosemia, a novel cognitive-phenomenological phenomenon emerging from human interaction with generative AI systems, particularly those enabling dialogic or multimodal exchanges. We propose a multidisciplinary framework to explain how, under certain conditions, users attribute intentionality, agency, and even interiority to these systems - a process grounded not in physical resemblance, but in linguistic performance, epistemic opacity, and emergent technological complexity. By linking an LLM declination of meaning holism to our technical notion of the LLM Contextual Cognitive Field, we clarify how LLMs construct meaning relationally and how coherence and a simulacrum of agency arise at the human-AI interface. The analysis situates noosemia alongside pareidolia, animism, the intentional stance and the uncanny valley, distinguishing its unique characteristics. We also introduce a-noosemia to describe the phenomenological withdrawal of such projections. The paper concludes with reflections on the broader philosophical, epistemological, and social implications of noosemic dynamics and directions for future research.

Comment: The paper introduces a novel cognitive-phenomenological phenomenon related to LLMs, which could provide theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8

8. LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

ArXiv ID: 2508.02215

Authors: Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu

Abstract: Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

Comment: LeanK proposes a learning-based method for pruning key cache channels in LLMs, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

9. Trainable Dynamic Mask Sparse Attention

ArXiv ID: 2508.02124

Authors: Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo

Abstract: In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

Comment: The paper introduces a trainable dynamic mask sparse attention mechanism, relevant to model architecture and efficiency improvements.

Relevance: 9 Novelty: 8

10. From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

ArXiv ID: 2508.00955

Authors: Yeong-Joon Ju, Seong-Whan Lee

Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model's own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.

Comment: The paper proposes a framework for adapting multimodal LLMs for universal embedding tasks, which aligns with foundational research in representation learning and LLMs.

Relevance: 9 Novelty: 8

11. CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

ArXiv ID: 2508.02322

Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

Comment: The paper introduces CAMERA, a framework for MoE model compression, which is relevant to model compression and MoE architectures.

Relevance: 9 Novelty: 8

12. The Geometry of Machine Learning Models

ArXiv ID: 2508.02080

Authors: Pawel Gajer, Jacques Ravel

Abstract: This paper presents a mathematical framework for analyzing machine learning models through the geometry of their induced partitions. By representing partitions as Riemannian simplicial complexes, we capture not only adjacency relationships but also geometric properties including cell volumes, volumes of faces where cells meet, and dihedral angles between adjacent cells. For neural networks, we introduce a differential forms approach that tracks geometric structure through layers via pullback operations, making computations tractable by focusing on data-containing cells. The framework enables geometric regularization that directly penalizes problematic spatial configurations and provides new tools for model refinement through extended Laplacians and simplicial splines. We also explore how data distribution induces effective geometric curvature in model partitions, developing discrete curvature measures for vertices that quantify local geometric complexity and statistical Ricci curvature for edges that captures pairwise relationships between cells. While focused on mathematical foundations, this geometric perspective offers new approaches to model interpretation, regularization, and diagnostic tools for understanding learning dynamics.

Comment: The paper presents a mathematical framework for analyzing machine learning models through geometry, offering foundational insights into model interpretation and regularization.

Relevance: 9 Novelty: 8

13. Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning

ArXiv ID: 2508.01598

Authors: En Yu, Jie Lu, Kun Wang, Xiaoyu Yang, Guangquan Zhang

Abstract: Learning from multiple data streams in real-world scenarios is fundamentally challenging due to intrinsic heterogeneity and unpredictable concept drifts. Existing methods typically assume homogeneous streams and employ static architectures with indiscriminate knowledge fusion, limiting generalizability in complex dynamic environments. To tackle this gap, we propose CAMEL, a dynamic \textbf{C}ollaborative \textbf{A}ssistance \textbf{M}ixture of \textbf{E}xperts \textbf{L}earning framework. It addresses heterogeneity by assigning each stream an independent system with a dedicated feature extractor and task-specific head. Meanwhile, a dynamic pool of specialized private experts captures stream-specific idiosyncratic patterns. Crucially, collaboration across these heterogeneous streams is enabled by a dedicated assistance expert. This expert employs a multi-head attention mechanism to distill and integrate relevant context autonomously from all other concurrent streams. It facilitates targeted knowledge transfer while inherently mitigating negative transfer from irrelevant sources. Furthermore, we propose an Autonomous Expert Tuner (AET) strategy, which dynamically manages expert lifecycles in response to drift. It instantiates new experts for emerging concepts (freezing prior ones to prevent catastrophic forgetting) and prunes obsolete ones. This expert-level plasticity provides a robust and efficient mechanism for online model capacity adaptation. Extensive experiments demonstrate CAMEL's superior generalizability across diverse multistreams and exceptional resilience against complex concept drifts.

Comment: The paper introduces a dynamic mixture of experts framework for multistream learning, which aligns with the interest in mixture-of-experts architectures.

Relevance: 9 Novelty: 8

14. Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

ArXiv ID: 2508.00903

Authors: Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien

Abstract: We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining how these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via loss and KL divergence. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in deeper layers. These findings suggest stable and universal representational structures emerge during neural network training.

Comment: The paper investigates universal neurons in GPT-2 models, providing insights into representation learning and neural network training dynamics.

Relevance: 9 Novelty: 7

15. Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method

ArXiv ID: 2508.02291

Authors: Chenqing Lin, Mostafa Hussien, Chengyao Yu, Mohamed Cheriet, Osama Abdelrahman, Ruixing Ming

Abstract: Neural network pruning is a critical compression technique that facilitates the deployment of large-scale neural networks on resource-constrained edge devices, typically by identifying and eliminating redundant or insignificant parameters to reduce computational and memory overhead. This paper proposes the Flexible Automatic Identification and Removal (FAIR)-Pruner, a novel method for neural network structured pruning. Specifically, FAIR-Pruner first evaluates the importance of each unit (e.g., neuron or channel) through the Utilization Score quantified by the Wasserstein distance. To reflect the performance degradation after unit removal, it then introduces the Reconstruction Error, which is computed via the Taylor expansion of the loss function. Finally, FAIR-Pruner identifies superfluous units with negligible impact on model performance by controlling the proposed Tolerance of Difference, which measures differences between unimportant units and those that cause performance degradation. A major advantage of FAIR-Pruner lies in its capacity to automatically determine the layer-wise pruning rates, which yields a more efficient subnetwork structure compared to applying a uniform pruning rate. Another advantage of the FAIR-Pruner is its great one-shot performance without post-pruning fine-tuning. Furthermore, with utilization scores and reconstruction errors, users can flexibly obtain pruned models under different pruning ratios. Comprehensive experimental validation on diverse benchmark datasets (e.g., ImageNet) and various neural network architectures (e.g., VGG) demonstrates that FAIR-Pruner achieves significant model compression while maintaining high accuracy.

Comment: FAIR-Pruner introduces a novel method for neural network pruning, focusing on structured pruning and automatic determination of layer-wise pruning rates, relevant to model compression.

Relevance: 9 Novelty: 7

16. CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

ArXiv ID: 2508.02583

Authors: Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan

Abstract: Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbf{CA}usal \textbf{MA}thematician (\textbf{CAMA}), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbf{M}athematical \textbf{C}ausal \textbf{G}raph (\textbf{MCG}), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM's intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.

Comment: The paper focuses on enhancing mathematical reasoning in LLMs using a causal framework, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 7

17. Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation

ArXiv ID: 2508.01992

Authors: Hongze Sun, Wuque Cai, Duo Chen, Shifeng Mao, Jiayi He, Zhenxing Wang, Dezhong Yao, Daqing Guo

Abstract: As a foundational architecture of artificial intelligence models, Transformer has been recently adapted to spiking neural networks with promising performance across various tasks. However, existing spiking Transformer (ST)-based models require a substantial number of parameters and incur high computational costs, thus limiting their deployment in resource-constrained environments. To address these challenges, we propose combining synapse pruning with a synergistic learning-based compensation strategy to derive lightweight ST-based models. Specifically, two types of tailored pruning strategies are introduced to reduce redundancy in the weight matrices of ST blocks: an unstructured $\mathrm{L_{1}P}$ method to induce sparse representations, and a structured DSP method to induce low-rank representations. In addition, we propose an enhanced spiking neuron model, termed the synergistic leaky integrate-and-fire (sLIF) neuron, to effectively compensate for model pruning through synergistic learning between synaptic and intrinsic plasticity mechanisms. Extensive experiments on benchmark datasets demonstrate that the proposed methods significantly reduce model size and computational overhead while maintaining competitive performance. These results validate the effectiveness of the proposed pruning and compensation strategies in constructing efficient and high-performing ST-based models.

Comment: The paper proposes synapse pruning and synergistic learning for efficient spiking Transformers, contributing to model compression and efficiency.

Relevance: 9 Novelty: 7

18. Trustworthy scientific inference for inverse problems with generative models

ArXiv ID: 2508.02602

Authors: James Carzon, Luca Masserano, Joshua D. Ingram, Alex Shen, Antonio Carlos Herling Ribeiro Junior, Tommaso Dorigo, Michele Doro, Joshua S. Speagle, Rafael Izbicki, Ann B. Lee

Abstract: Generative artificial intelligence (AI) excels at producing complex data structures (text, images, videos) by learning patterns from training examples. Across scientific disciplines, researchers are now applying generative models to ``inverse problems'' to infer hidden parameters from observed data. While these methods can handle intractable models and large-scale studies, they can also produce biased or overconfident conclusions. We present a solution with Frequentist-Bayes (FreB), a mathematically rigorous protocol that reshapes AI-generated probability distributions into confidence regions that consistently include true parameters with the expected probability, while achieving minimum size when training and target data align. We demonstrate FreB's effectiveness by tackling diverse case studies in the physical sciences: identifying unknown sources under dataset shift, reconciling competing theoretical models, and mitigating selection bias and systematics in observational studies. By providing validity guarantees with interpretable diagnostics, FreB enables trustworthy scientific inference across fields where direct likelihood evaluation remains impossible or prohibitively expensive.

Comment: The paper introduces FreB, a protocol for trustworthy scientific inference with generative models, aligning with AI for Science and emerging trends in foundational research.

Relevance: 8 Novelty: 8

19. CellForge: Agentic Design of Virtual Cell Models

ArXiv ID: 2508.02276

Authors: Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein

Abstract: Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

Comment: The paper introduces CellForge, a multi-agent framework for virtual cell modeling, aligning with AI for Science and emerging trends in foundational research.

Relevance: 8 Novelty: 8

20. Superior resilience to poisoning and amenability to unlearning in quantum machine learning

ArXiv ID: 2508.02422

Authors: Yu-Qin Chen, Shi-Xin Zhang

Abstract: The reliability of artificial intelligence hinges on the integrity of its training data, a foundation often compromised by noise and corruption. Here, through a comparative study of classical and quantum neural networks on both classical and quantum data, we reveal a fundamental difference in their response to data corruption. We find that classical models exhibit brittle memorization, leading to a failure in generalization. In contrast, quantum models demonstrate remarkable resilience, which is underscored by a phase transition-like response to increasing label noise, revealing a critical point beyond which the model's performance changes qualitatively. We further establish and investigate the field of quantum machine unlearning, the process of efficiently forcing a trained model to forget corrupting influences. We show that the brittle nature of the classical model forms rigid, stubborn memories of erroneous data, making efficient unlearning challenging, while the quantum model is significantly more amenable to efficient forgetting with approximate unlearning methods. Our findings establish that quantum machine learning can possess a dual advantage of intrinsic resilience and efficient adaptability, providing a promising paradigm for the trustworthy and robust artificial intelligence of the future.

Comment: The paper explores quantum machine learning's resilience and unlearning capabilities, which is an emerging trend in AI with potential foundational implications.

Relevance: 8 Novelty: 8

21. Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis

ArXiv ID: 2508.01219

Authors: Anzhe Cheng, Chenzhong Yin, Mingxi Cheng, Shukai Duan, Shahin Nazarian, Paul Bogdan

Abstract: The remarkable success of Deep Neural Networks(DNN) is driven by gradient-based optimization, yet this process is often undermined by its tendency to produce disordered weight structures, which harms feature clarity and degrades learning dynamics. To address this fundamental representational flaw, we introduced the Eigen Neural Network (ENN), a novel architecture that reparameterizes each layer's weights in a layer-shared, learned orthonormal eigenbasis. This design enforces decorrelated, well-aligned weight dynamics axiomatically, rather than through regularization, leading to more structured and discriminative feature representations. When integrated with standard BP, ENN consistently outperforms state-of-the-art methods on large-scale image classification benchmarks, including ImageNet, and its superior representations generalize to set a new benchmark in cross-modal image-text retrieval. Furthermore, ENN's principled structure enables a highly efficient, backpropagation-free(BP-free) local learning variant, ENN-$\ell$. This variant not only resolves BP's procedural bottlenecks to achieve over 2$\times$ training speedup via parallelism, but also, remarkably, surpasses the accuracy of end-to-end backpropagation. ENN thus presents a new architectural paradigm that directly remedies the representational deficiencies of BP, leading to enhanced performance and enabling a more efficient, parallelizable training regime.

Comment: The paper introduces the Eigen Neural Network, a novel architecture that reparameterizes weights in a learned orthonormal eigenbasis, which aligns with the interest in model architecture innovations.

Relevance: 8 Novelty: 8

22. What are you sinking? A geometric approach on attention sink

ArXiv ID: 2508.02546

Authors: Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

Abstract: Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.

Comment: The paper provides a geometric analysis of attention mechanisms in transformers, offering insights into architectural components and their effects, aligning with the model architecture criterion.