Personalized Daily Arxiv Papers 02/24/2025

[gpt-4o]	Prompt	Completion	Total
Token	37854	5611	43465
Cost	$0.09	$0.06	$0.15

Total ArXiv papers: 467

Total scanned papers: 284

Total relevant papers: 29

Table of contents with paper titles:

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? Authors: Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S\"oren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King
Tight Clusters Make Specialized Experts Authors: Stefan K. Nielsen, Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen
A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLMs Authors: Yuan Sun
Do we really need the Rademacher complexities? Authors: Daniel Bartl, Shahar Mendelson
Approximating Latent Manifolds in Neural Networks via Vanishing Ideals Authors: Nico Pelleriti, Max Zimmer, Elias Wirth, Sebastian Pokutta
Towards Physics-Guided Foundation Models Authors: Majid Farhadloo, Arun Sharma, Mingzhou Yang, Bharat Jayaprakash, William Northrop, Shashi Shekhar
Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing Authors: Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention Authors: Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan
Generalization Guarantees for Representation Learning via Data-Dependent Gaussian Mixture Priors Authors: Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski
More for Keys, Less for Values: Adaptive KV Cache Quantization Authors: Mohsen Hariri, Lam Nguyen, Sixu Chen, Shaochen Zhong, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary
Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans Authors: Masha Fedzechkina, Eleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald
Machine-generated text detection prevents language model collapse Authors: George Drayson, Vasileios Lampos
Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference Authors: Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers Authors: Anton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev, Elizaveta Goncharova, Polina Druzhinina, Ivan Oseledets, Andrey Kuznetsov
Fr\'echet Cumulative Covariance Net for Deep Nonlinear Sufficient Dimension Reduction with Random Objects Authors: Hang Yuan, Christina Dan Wang, Zhou Yu
LightThinker: Thinking Step-by-Step Compression Authors: Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language? Authors: Ibrahim Alabdulmohsin, Andreas Steiner
Sparks of cognitive flexibility: self-guided context inference for flexible stimulus-response mapping by attentional routing Authors: Rowan Sommers, Sushrut Thorat, Daniel Anthes, Tim C. Kietzmann
Solving Inverse Problems with Deep Linear Neural Networks: Global Convergence Guarantees for Gradient Descent with Weight Decay Authors: Hannah Laus, Suzanna Parkinson, Vasileios Charisopoulos, Felix Krahmer, Rebecca Willett
On the Robustness of Transformers against Context Hijacking for Linear Classification Authors: Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, Difan Zou
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer Authors: Marthe Ballon, Andres Algaba, Vincent Ginis
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps Authors: Yen-Che Hsiao, Abhishek Dutta
Scale-Free Graph-Language Models Authors: Jianglin Lu, Yixuan Liu, Yitian Zhang, Yun Fu
Curvature Corrected Nonnegative Manifold Data Factorization Authors: Joyce Chew, Willem Diepeveen, Deanna Needell
AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms Authors: Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen
PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning Authors: Pengcheng Huang, Zhenghao Liu, Yukun Yan, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong
The Multi-Faceted Monosemanticity in Multimodal Representations Authors: Hanqi Yan, Xiangxiang Cui, Lu Yin, Paul Pu Liang, Yulan He, Yifei Wang
When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models Authors: Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason Xue
EvoP: Robust LLM Inference via Evolutionary Pruning Authors: Shangyu Wu, Hongchao Du, Ying Xiong, Shuai Chen, Tei-wei Kuo, Nan Guan, Chun Jason Xue

1. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

ArXiv ID: 2502.15657

Authors: Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S\"oren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King

Abstract: The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

Comment: Author match

2. Tight Clusters Make Specialized Experts

ArXiv ID: 2502.15315

Authors: Stefan K. Nielsen, Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen

Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

Comment: The paper proposes an Adaptive Clustering router for Sparse Mixture-of-Experts (MoE), directly addressing foundational aspects of MoE architectures and improving their robustness and performance.

Relevance: 10 Novelty: 9

3. A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLMs

ArXiv ID: 2502.15451

Authors: Yuan Sun

Abstract: MoE (Mixture-of-Expert) architectures appear frequently in large language models, and the number of experts can be over one hundred recently. However, the expert load imbalance problem always happens in MoE model pre-training, which will cause routing collapse or increased computational overhead. In order to balance loads on experts, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q that can help change the top-K order of s by solving a binary integer programming with very small time costs. In simulation experiments, we observe that BIP-Based Balancing make imbalance disappoint very fast, while the final sum of routine scores decreases very little. Our algorithm achieves nearly perfect trade-off between expert load balance and pre-training efficiency under the simulation view.

Comment: The paper proposes a binary integer programming-based algorithm for expert load balancing in MoE architectures, directly addressing a key challenge in MoE training and efficiency.

Relevance: 10 Novelty: 8

4. Do we really need the Rademacher complexities?

ArXiv ID: 2502.15118

Authors: Daniel Bartl, Shahar Mendelson

Abstract: We study the fundamental problem of learning with respect to the squared loss in a convex class. The state-of-the-art sample complexity estimates in this setting rely on Rademacher complexities, which are generally difficult to control. We prove that, contrary to prevailing belief and under minimal assumptions, the sample complexity is not governed by the Rademacher complexities but rather by the behaviour of the limiting gaussian process. In particular, all such learning problems that have the same $L_2$-structure -- even those with heavy-tailed distributions -- share the same sample complexity. This constitutes the first universality result for general convex learning problems. The proof is based on a novel learning procedure, and its performance is studied by combining optimal mean estimation techniques for real-valued random variables with Talagrand's generic chaining method.

Comment: The paper challenges the reliance on Rademacher complexities for learning problems and introduces a novel universality result, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 9

5. Approximating Latent Manifolds in Neural Networks via Vanishing Ideals

ArXiv ID: 2502.15051

Authors: Nico Pelleriti, Max Zimmer, Elias Wirth, Sebastian Pokutta

Abstract: Deep neural networks have reshaped modern machine learning by learning powerful latent representations that often align with the manifold hypothesis: high-dimensional data lie on lower-dimensional manifolds. In this paper, we establish a connection between manifold learning and computational algebra by demonstrating how vanishing ideals can characterize the latent manifolds of deep networks. To that end, we propose a new neural architecture that (i) truncates a pretrained network at an intermediate layer, (ii) approximates each class manifold via polynomial generators of the vanishing ideal, and (iii) transforms the resulting latent space into linearly separable features through a single polynomial layer. The resulting models have significantly fewer layers than their pretrained baselines, while maintaining comparable accuracy, achieving higher throughput, and utilizing fewer parameters. Furthermore, drawing on spectral complexity analysis, we derive sharper theoretical guarantees for generalization, showing that our approach can in principle offer tighter bounds than standard deep networks. Numerical experiments confirm the effectiveness and efficiency of the proposed approach.

Comment: The paper connects manifold learning with computational algebra using vanishing ideals, proposing a novel architecture for latent manifold approximation. It aligns well with representation learning and architectural innovation.

Relevance: 9 Novelty: 9

6. Towards Physics-Guided Foundation Models

ArXiv ID: 2502.15013

Authors: Majid Farhadloo, Arun Sharma, Mingzhou Yang, Bharat Jayaprakash, William Northrop, Shashi Shekhar

Abstract: Traditional foundation models are pre-trained on broad datasets to reduce the training resources (e.g., time, energy, labeled samples) needed for fine-tuning a wide range of downstream tasks. However, traditional foundation models struggle with out-of-distribution prediction and can produce outputs that are unrealistic and physically infeasible. We propose the notation of physics-guided foundation models (PGFM), that is, foundation models integrated with broad or general domain (e.g., scientific) physical knowledge applicable to a wide range of downstream tasks.

Comment: The paper introduces the concept of physics-guided foundation models, which aligns with the 'AI for Science' criterion by proposing a new paradigm integrating physical knowledge into foundation models.

Relevance: 9 Novelty: 9

7. Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing

ArXiv ID: 2502.15618

Authors: Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar

Abstract: We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.

Comment: The paper introduces a novel dynamic pruning framework for LLMs, which aligns with model compression and efficiency breakthroughs.