Previous Day 2025-02-18
Monthly Overview 2025-02
Next Day 2025-02-20

Personalized Daily Arxiv Papers 02/19/2025

gpt-4o Prompt Completion Total
Token 61827 9142 70969
Cost $0.15 $0.09 $0.24

Total ArXiv papers: 655

Total scanned papers: 402

Total relevant papers: 42

Table of contents with paper titles:

  1. MeMo: Towards Language Models with Associative Memory Mechanisms Authors: Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli

  2. Accurate Expert Predictions in MoE Inference via Cross-Layer Gate Authors: Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

  3. Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models Authors: Gyeongman Kim, Gyouk Chu, Eunho Yang

  4. Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Authors: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

  5. Independence Tests for Language Models Authors: Sally Zhu, Ahmed Ahmed, Rohith Kuditipudi, Percy Liang

  6. Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks Authors: Markus J. Buehler

  7. Optimal Brain Iterative Merging: Mitigating Interference in LLM Merging Authors: Zhixiang Wang, Zhenyu Mao, Yixuan Qiao, Yunfang Wu, Biye Li

  8. Stability-based Generalization Bounds for Variational Inference Authors: Yadi Wei, Roni Khardon

  9. GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning Authors: Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang

  10. Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models Authors: Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, Linqin Song

  11. Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization Authors: Aditya Ranganath, Mukesh Singhal, Roummel Marcia

  12. MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections Authors: Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

  13. GoRA: Gradient-driven Adaptive Low Rank Adaptation Authors: Haonan He, Peng Ye, Yuchen Ren, Yuan Yuan, Lei Chen

  14. Understanding Generalization in Transformers: Error Bounds and Training Dynamics Under Benign and Harmful Overfitting Authors: Yingying Zhang, Zhenyu Wu, Jian Li, Yong Liu

  15. QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models Authors: Jiajun Zhou, Yifan Yang, Kai Zhen, Ziyue Liu, Yequan Zhao, Ershad Banijamali, Athanasios Mouchtaris, Ngai Wong, Zheng Zhang

  16. Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs Authors: Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci

  17. Electron flow matching for generative reaction mechanism prediction obeying conservation laws Authors: Joonyoung F. Joung, Mun Hong Fong, Nicholas Casetti, Jordan P. Liles, Ne S. Dassanayake, Connor W. Coley

  18. Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement Authors: Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan

  19. HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading Authors: Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar

  20. Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification Authors: Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan J. Foster

  21. Keep what you need : extracting efficient subnetworks from large audio representation models Authors: David Genova, Philippe Esling, Tom Hurlin

  22. A Neural Difference-of-Entropies Estimator for Mutual Information Authors: Haoran Ni, Martin Lotz

  23. Stability Bounds for Smooth Optimal Transport Maps and their Statistical Implications Authors: Sivaraman Balakrishnan, Tudor Manole

  24. Efficient Neural SDE Training using Wiener-Space Cubature Authors: Luke Snow, Vikram Krishnamurthy

  25. Scalable Model Merging with Progressive Layer-wise Distillation Authors: Jing Xu, Jiazheng Li, Jingzhao Zhang

  26. Learning the symmetric group: large from small Authors: Max Petschack, Alexandr Garbali, Jan de Gier

  27. SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs Authors: Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Mu\~noz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah

  28. Unveiling Mode Connectivity in Graph Neural Networks Authors: Bingheng Li, Zhikai Chen, Haoyu Han, Shenglai Zeng, Jingzhe Liu, Jiliang Tang

  29. An Interpretable Automated Mechanism Design Framework with Large Language Models Authors: Jiayuan Liu, Mingyu Guo, Vincent Conitzer

  30. Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation Authors: Tatsuya Aoyama, Hanting Yang, Hiroyuki Hanada, Satoshi Akahane, Tomonari Tanaka, Yoshito Okura, Yu Inatsu, Noriaki Hashimoto, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi

  31. Enhanced uncertainty quantification variational autoencoders for the solution of Bayesian inverse problems Authors: Andrea Tonini, Luca Dede'

  32. Tuning Algorithmic and Architectural Hyperparameters in Graph-Based Semi-Supervised Learning with Provable Guarantees Authors: Ally Yalei Du, Eric Huang, Dravyansh Sharma

  33. Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product Authors: Pengxiang Lan, Haoyu Xu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, Xingwei Wang

  34. Asymptotic Optimism of Random-Design Linear and Kernel Regression Models Authors: Hengrui Luo, Yunzhang Zhu

  35. B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability Authors: Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg

  36. Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs Authors: Batu El, Deepro Choudhury, Pietro Li`o, Chaitanya K. Joshi

  37. GPU Memory Usage Optimization for Backward Propagation in Deep Network Training Authors: Ding-Yong Hong, Tzu-Hsien Tsai, Ning Wang, Pangfeng Liu, Jan-Jan Wu

  38. Spiking Vision Transformer with Saccadic Attention Authors: Shuai Wang, Malu Zhang, Dehao Zhang, Ammar Belatreche, Yichen Xiao, Yu Liang, Yimeng Shan, Qian Sun, Enqi Zhang, Yang Yang

  39. Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals? Authors: Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, Bryan Hooi

  40. Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? Authors: Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu

  41. RM-PoT: Reformulating Mathematical Problems and Solving via Program of Thoughts Authors: Yu Zhang, Shujun Peng, Nengwu Wu, Xinhan Lin, Yang Hu, Jie Tang

  42. DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization Authors: Jiaqi Wang, Yuhang Zhou, Zhixiong Zhang, Qiguang Chen, Yongqiang Chen, James Cheng


1. MeMo: Towards Language Models with Associative Memory Mechanisms

ArXiv ID: 2502.12851

Authors: Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli

Abstract: Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.

Comment: The paper proposes a novel architecture, MeMo, with associative memory mechanisms for LLMs, which aligns with the model architecture criterion by introducing a new paradigm for memorization and transparency.

Relevance: 10 Novelty: 9


2. Accurate Expert Predictions in MoE Inference via Cross-Layer Gate

ArXiv ID: 2502.12224

Authors: Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional GPU overhead. Furthermore, Fate employs a shallow-favoring expert caching strategy that increases the expert hit rate to 99\%. Additionally, Fate integrates tailored quantization strategies for cache optimization and IO efficiency. Experimental results show that, compared to Load on Demand and Expert Activation Path-based method, Fate achieves up to 4.5x and 1.9x speedups in prefill speed and up to 4.1x and 2.2x speedups in decoding speed, respectively, while maintaining inference quality. Moreover, Fate's performance improvements are scalable across different memory budgets.

Comment: The paper focuses on improving MoE inference efficiency through cross-layer gating and caching strategies, which directly aligns with the topic of Mixture-of-Experts and model efficiency.

Relevance: 10 Novelty: 8


3. Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models

ArXiv ID: 2502.12947

Authors: Gyeongman Kim, Gyouk Chu, Eunho Yang

Abstract: With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts. Specifically, KA augments knowledge by sampling experts multiple times, while SAR uses all experts and adjusts the expert weights through router training to provide optimal knowledge. Extensive experiments show that our methods outperform conventional KD methods, demonstrating their effectiveness for MoE teacher models.

Comment: The paper introduces MoE-specific knowledge distillation methods, which directly align with the Mixture-of-Experts (MoE) topic and provide novel insights into leveraging non-activated experts.

Relevance: 10 Novelty: 8


4. Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

ArXiv ID: 2502.13063

Authors: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

Abstract: A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

Comment: The paper explores the limits of embedding space capacity, which is relevant to representation learning and compression. The focus on theoretical limits and optimization is highly novel.

Relevance: 9 Novelty: 9


5. Independence Tests for Language Models

ArXiv ID: 2502.12292

Authors: Sally Zhu, Ahmed Ahmed, Rohith Kuditipudi, Percy Liang

Abstract: We consider the following problem: given the weights of two models, can we test whether they were trained independently -- i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.

Comment: The paper introduces statistical tests for determining independence between model weights, which is a novel and foundational contribution to understanding model training dynamics.

Relevance: 9 Novelty: 9


6. Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks

ArXiv ID: 2502.13025

Authors: Markus J. Buehler

Abstract: We present an agentic, autonomous graph expansion framework that iteratively structures and refines knowledge in situ. Unlike conventional knowledge graph construction methods relying on static extraction or single-pass learning, our approach couples a reasoning-native large language model with a continually updated graph representation. At each step, the system actively generates new concepts and relationships, merges them into a global graph, and formulates subsequent prompts based on its evolving structure. Through this feedback-driven loop, the model organizes information into a scale-free network characterized by hub formation, stable modularity, and bridging nodes that link disparate knowledge clusters. Over hundreds of iterations, new nodes and edges continue to appear without saturating, while centrality measures and shortest path distributions evolve to yield increasingly distributed connectivity. Our analysis reveals emergent patterns, such as the rise of highly connected 'hub' concepts and the shifting influence of 'bridge' nodes, indicating that agentic, self-reinforcing graph construction can yield open-ended, coherent knowledge structures. Applied to materials design problems, we present compositional reasoning experiments by extracting node-specific and synergy-level principles to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that transcend rote summarization and strengthen the framework's potential for open-ended scientific discovery. We discuss other applications in scientific discovery and outline future directions for enhancing scalability and interpretability.

Comment: The paper introduces a novel framework for self-organizing knowledge networks using graph reasoning and LLMs, which aligns with emerging trends and foundational research in knowledge representation.

Relevance: 9 Novelty: 8


7. Optimal Brain Iterative Merging: Mitigating Interference in LLM Merging

ArXiv ID: 2502.12217

Authors: Zhixiang Wang, Zhenyu Mao, Yixuan Qiao, Yunfang Wu, Biye Li

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities, but their high computational costs pose challenges for customization. Model merging offers a cost-effective alternative, yet existing methods suffer from interference among parameters, leading to performance degradation. In this work, we propose Optimal Brain Iterative Merging (OBIM), a novel method designed to mitigate both intra-model and inter-model interference. OBIM consists of two key components: (1) A saliency measurement mechanism that evaluates parameter importance based on loss changes induced by individual weight alterations, reducing intra-model interference by preserving only high-saliency parameters. (2) A mutually exclusive iterative merging framework, which incrementally integrates models using a binary mask to avoid direct parameter averaging, thereby mitigating inter-model interference. We validate OBIM through experiments on both Supervised Fine-Tuned (SFT) models and post-pretrained checkpoints. The results show that OBIM significantly outperforms existing merging techniques. Overall, OBIM provides an effective and practical solution for enhancing LLM merging.

Comment: The paper proposes a novel method for mitigating interference in LLM merging, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


8. Stability-based Generalization Bounds for Variational Inference

ArXiv ID: 2502.12353

Authors: Yadi Wei, Roni Khardon

Abstract: Variational inference (VI) is widely used for approximate inference in Bayesian machine learning. In addition to this practical success, generalization bounds for variational inference and related algorithms have been developed, mostly through the connection to PAC-Bayes analysis. A second line of work has provided algorithm-specific generalization bounds through stability arguments or using mutual information bounds, and has shown that the bounds are tight in practice, but unfortunately these bounds do not directly apply to approximate Bayesian algorithms. This paper fills this gap by developing algorithm-specific stability based generalization bounds for a class of approximate Bayesian algorithms that includes VI, specifically when using stochastic gradient descent to optimize their objective. As in the non-Bayesian case, the generalization error is bounded by by expected parameter differences on a perturbed dataset. The new approach complements PAC-Bayes analysis and can provide tighter bounds in some cases. An experimental illustration shows that the new approach yields non-vacuous bounds on modern neural network architectures and datasets and that it can shed light on performance differences between variant approximate Bayesian algorithms.

Comment: The paper develops stability-based generalization bounds for variational inference, which aligns with foundational research in representation learning and theoretical insights into training dynamics.

Relevance: 9 Novelty: 8


9. GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning

ArXiv ID: 2502.12913

Authors: Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang

Abstract: Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point (FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to FP16-based fine-tuning while significantly reducing memory usage (50%). Moreover, compared to FP8, our method can reduce 5x power consumption and 11x chip area with same performance, making large-scale model adaptation feasible on edge devices.

Comment: The paper proposes a fully quantized training framework for LLM fine-tuning, which aligns with model compression and efficiency topics. It introduces a novel integer-based approach for on-device fine-tuning.

Relevance: 9 Novelty: 8


10. Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models

ArXiv ID: 2502.12420

Authors: Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, Linqin Song

Abstract: Recent advances in large language models have led to numerous task-specialized fine-tuned variants, creating a need for efficient model merging techniques that preserve specialized capabilities while avoiding costly retraining. While existing task vector-based merging methods show promise, they typically apply uniform coefficients across all parameters, overlooking varying parameter importance both within and across tasks. We present Sens-Merging, a sensitivity-guided coefficient adjustment method that enhances existing model merging techniques by operating at both task-specific and cross-task levels. Our method analyzes parameter sensitivity within individual tasks and evaluates cross-task transferability to determine optimal merging coefficients. Extensive experiments on Mistral 7B and LLaMA2-7B/13B models demonstrate that Sens-Merging significantly improves performance across general knowledge, mathematical reasoning, and code generation tasks. Notably, when combined with existing merging techniques, our method enables merged models to outperform specialized fine-tuned models, particularly in code generation tasks. Our findings reveal important trade-offs between task-specific and cross-task scalings, providing insights for future model merging strategies.

Comment: The paper introduces a sensitivity-guided method for merging LLMs, which aligns with foundational research in LLM architecture and efficiency improvements.

Relevance: 9 Novelty: 8


11. Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization

ArXiv ID: 2502.12298

Authors: Aditya Ranganath, Mukesh Singhal, Roummel Marcia

Abstract: Stochastic gradient descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning due to their computational efficiency and low-storage memory requirements. However, these methods do not exploit curvature information. Consequently, iterates can converge to saddle points or poor local minima. On the other hand, Quasi-Newton methods compute Hessian approximations which exploit this information with a comparable computational budget. Quasi-Newton methods re-use previously computed iterates and gradients to compute a low-rank structured update. The most widely used quasi-Newton update is the L-BFGS, which guarantees a positive semi-definite Hessian approximation, making it suitable in a line search setting. However, the loss functions in DNNs are non-convex, where the Hessian is potentially non-positive definite. In this paper, we propose using a limited-memory symmetric rank-one quasi-Newton approach which allows for indefinite Hessian approximations, enabling directions of negative curvature to be exploited. Furthermore, we use a modified adaptive regularized cubics approach, which generates a sequence of cubic subproblems that have closed-form solutions with suitable regularization choices. We investigate the performance of our proposed method on autoencoders and feed-forward neural network models and compare our approach to state-of-the-art first-order adaptive stochastic methods as well as other quasi-Newton methods.x

Comment: The paper explores a novel quasi-Newton method for deep learning optimization, which aligns with foundational research in training dynamics and representation learning. The use of cubic regularization and indefinite Hessian approximations is a notable theoretical contribution.

Relevance: 9 Novelty: 8


12. MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

ArXiv ID: 2502.12170

Authors: Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

Abstract: We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .

Comment: The paper introduces MUDD connections to improve Transformers, which is highly relevant to architectural innovations. The dynamic dense connections and their impact on efficiency are novel contributions.

Relevance: 9 Novelty: 8


13. GoRA: Gradient-driven Adaptive Low Rank Adaptation

ArXiv ID: 2502.12171

Authors: Haonan He, Peng Ye, Yuchen Ren, Yuan Yuan, Lei Chen

Abstract: Low-Rank Adaptation (LoRA) is a crucial method for efficiently fine-tuning pretrained large language models (LLMs), with its performance largely influenced by two key factors: rank and initialization strategy. Numerous LoRA variants have been proposed to enhance its performance by addressing these factors. However, these variants often compromise LoRA's usability or efficiency. In this paper, we analyze the fundamental limitations of existing methods and introduce a novel approach, GoRA (Gradient-driven Adaptive Low Rank Adaptation), which adaptively assigns ranks and initializes weights for low-rank adapters simultaneously based on gradient information. Extensive experimental results demonstrate that GoRA significantly improves performance while preserving the high usability and efficiency of LoRA. On the T5 model fine-tuned for the GLUE benchmark, GoRA achieves a 5.88-point improvement over LoRA and slightly surpasses full fine-tuning. Similarly, on the Llama3.1-8B-Base model fine-tuned for GSM8k tasks, GoRA outperforms LoRA with a 5.13-point improvement and exceeds full fine-tuning in high-rank settings by a margin of 2.05 points.

Comment: The paper proposes GoRA, a novel gradient-driven adaptive low-rank adaptation method, which directly aligns with model compression and efficiency topics, particularly low-rank approaches.

Relevance: 9 Novelty: 8


14. Understanding Generalization in Transformers: Error Bounds and Training Dynamics Under Benign and Harmful Overfitting

ArXiv ID: 2502.12508

Authors: Yingying Zhang, Zhenyu Wu, Jian Li, Yong Liu

Abstract: Transformers serve as the foundational architecture for many successful large-scale models, demonstrating the ability to overfit the training data while maintaining strong generalization on unseen data, a phenomenon known as benign overfitting. However, research on how the training dynamics influence error bounds within the context of benign overfitting has been limited. This paper addresses this gap by developing a generalization theory for a two-layer transformer with labeled flip noise. Specifically, we present generalization error bounds for both benign and harmful overfitting under varying signal-to-noise ratios (SNR), where the training dynamics are categorized into three distinct stages, each with its corresponding error bounds. Additionally, we conduct extensive experiments to identify key factors that influence test errors in transformers. Our experimental results align closely with the theoretical predictions, validating our findings.

Comment: The paper develops a generalization theory for transformers, addressing error bounds and training dynamics under overfitting scenarios. This aligns with foundational research on model architecture and training dynamics.

Relevance: 9 Novelty: 8


15. QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

ArXiv ID: 2502.12346

Authors: Jiajun Zhou, Yifan Yang, Kai Zhen, Ziyue Liu, Yequan Zhao, Ershad Banijamali, Athanasios Mouchtaris, Ngai Wong, Zheng Zhang

Abstract: Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which are error-prone in the low-precision settings. To overcome these limitations, we propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision (e.g., 4- or 8-bit) forward passes. Our method can avoid the error-prone low-precision straight-through estimator, and utilizes optimized stochastic rounding to mitigate the increased bias. QuZO simplifies the training process, while achieving results comparable to first-order methods in ${\rm FP}8$ and superior accuracy in ${\rm INT}8$ and ${\rm INT}4$ training. Experiments demonstrate that low-bit training QuZO achieves performance comparable to MeZO optimization on GLUE, Multi-Choice, and Generation tasks, while reducing memory cost by $2.94 \times$ in LLaMA2-7B fine-tuning compared to quantized first-order methods.

Comment: The paper introduces a novel quantized zeroth-order fine-tuning framework for LLMs, which aligns with the model compression criterion, specifically addressing low-precision training and optimization.

Relevance: 9 Novelty: 8


16. Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

ArXiv ID: 2502.12216

Authors: Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci

Abstract: Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead. We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup. This improvement translates to an overall 1.58x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.

Comment: The paper introduces Tactic, a sparse attention mechanism for long-context LLMs, which aligns with the model compression criterion by addressing efficiency in attention mechanisms.

Relevance: 9 Novelty: 8


17. Electron flow matching for generative reaction mechanism prediction obeying conservation laws

ArXiv ID: 2502.12979

Authors: Joonyoung F. Joung, Mun Hong Fong, Nicholas Casetti, Jordan P. Liles, Ne S. Dassanayake, Connor W. Coley

Abstract: Central to our understanding of chemical reactivity is the principle of mass conservation, which is fundamental for ensuring physical consistency, balancing equations, and guiding reaction design. However, data-driven computational models for tasks such as reaction product prediction rarely abide by this most basic constraint. In this work, we recast the problem of reaction prediction as a problem of electron redistribution using the modern deep generative framework of flow matching. Our model, FlowER, overcomes limitations inherent in previous approaches by enforcing exact mass conservation, thereby resolving hallucinatory failure modes, recovering mechanistic reaction sequences for unseen substrate scaffolds, and generalizing effectively to out-of-domain reaction classes with extremely data-efficient fine-tuning. FlowER additionally enables estimation of thermodynamic or kinetic feasibility and manifests a degree of chemical intuition in reaction prediction tasks. This inherently interpretable framework represents a significant step in bridging the gap between predictive accuracy and mechanistic understanding in data-driven reaction outcome prediction.

Comment: The paper introduces FlowER, a generative framework for reaction mechanism prediction that enforces conservation laws, aligning with AI for Science by addressing foundational challenges in chemical modeling.

Relevance: 9 Novelty: 8


18. Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

ArXiv ID: 2502.12214

Authors: Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan

Abstract: Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it alongside regular tokens in the attention mechanism. The corresponding attention scores not only reflect each layer's computational importance but also enable dynamic early exits without sacrificing overall model accuracy. Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models for enhanced efficiency and adaptability.

Comment: The Zero Token Transformer introduces architectural innovations like parameter cycling and zero-token mechanisms, which align with the model architecture criterion.

Relevance: 9 Novelty: 8


19. HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

ArXiv ID: 2502.12574

Authors: Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar

Abstract: Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU memory usage from 207 GB to 17 GB, achieving a 92% reduction compared to BF16 baseline inference. Notably, HEADINFER enables 4-million-token inference with an 8B model on a single consumer GPU with 24GB memory (e.g., NVIDIA RTX 4090) without approximation methods.

Comment: The HEADINFER method introduces a memory-efficient inference strategy for LLMs by offloading KV cache, which aligns with the model compression and efficiency criterion.

Relevance: 9 Novelty: 8


20. Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

ArXiv ID: 2502.12465

Authors: Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan J. Foster

Abstract: Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification -- where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq 1$ -- we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: (1) Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. (2) Next-token prediction can be made robust so as to achieve $C=\tilde O(H)$, representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer $C=\Omega(H)$. (3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning algorithm generalizes next-token prediction.

Comment: The paper provides theoretical insights into error amplification in next-token prediction and its computational-statistical tradeoffs, aligning with emerging trends in foundational research.

Relevance: 8 Novelty: 9


21. Keep what you need : extracting efficient subnetworks from large audio representation models

ArXiv ID: 2502.12925

Authors: David Genova, Philippe Esling, Tom Hurlin

Abstract: Recently, research on audio foundation models has witnessed notable advances, as illustrated by the ever improving results on complex downstream tasks. Subsequently, those pretrained networks have quickly been used for various audio applications. These improvements have however resulted in a considerable increase both in size and complexity of these models. Along the environmental concerns this issue raises, this prevents the deployment of such networks on consumer-level devices, and precludes their use for real-time applications. Moreover, this appears contradictory with the specificity of the tasks for which these models are used, which are often simpler compared to extracting a rich, multi-purpose representation from any type of audio data. In this paper, we address this issue with a simple, yet effective method to extract lightweight specialist subnetworks from large foundation models. Specifically, we introduce learnable binary masks in-between the layers of a pretrained representation model. When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective, hence learning a compact subnetwork specialized on a single task. Importantly, the weights of the foundation model are kept frozen, resulting into low additional training costs. Once trained, the masked computational units can then be removed from the network, implying significant performance gains. We assess our method on three widespread audio foundation models, each based on a different backbone architecture, and illustrate its effectiveness on common audio representation evaluation tasks, as well as its versatility on both speech, music, and general audio. Code for reproducing the results and supporting webpage are available at https://github.com/gnvIRCAM/Audio-representation-trimming

Comment: The paper introduces a method for extracting efficient subnetworks from large audio models using sparsity-inducing losses. This aligns with model compression topics like pruning and sparsity.

Relevance: 9 Novelty: 7


22. A Neural Difference-of-Entropies Estimator for Mutual Information

ArXiv ID: 2502.13085

Authors: Haoran Ni, Martin Lotz

Abstract: Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.

Comment: The paper introduces a neural difference-of-entropies estimator for mutual information, which is relevant to representation learning and foundational research in information theory.

Relevance: 8 Novelty: 8


23. Stability Bounds for Smooth Optimal Transport Maps and their Statistical Implications

ArXiv ID: 2502.12326

Authors: Sivaraman Balakrishnan, Tudor Manole

Abstract: We study estimators of the optimal transport (OT) map between two probability distributions. We focus on plugin estimators derived from the OT map between estimates of the underlying distributions. We develop novel stability bounds for OT maps which generalize those in past work, and allow us to reduce the problem of optimally estimating the transport map to that of optimally estimating densities in the Wasserstein distance. In contrast, past work provided a partial connection between these problems and relied on regularity theory for the Monge-Ampere equation to bridge the gap, a step which required unnatural assumptions to obtain sharp guarantees. We also provide some new insights into the connections between stability bounds which arise in the analysis of plugin estimators and growth bounds for the semi-dual functional which arise in the analysis of Brenier potential-based estimators of the transport map. We illustrate the applicability of our new stability bounds by revisiting the smooth setting studied by Manole et al., analyzing two of their estimators under more general conditions. Critically, our bounds do not require smoothness or boundedness assumptions on the underlying measures. As an illustrative application, we develop and analyze a novel tuning parameter-free estimator for the OT map between two strongly log-concave distributions.

Comment: The paper provides stability bounds for optimal transport maps, which is a theoretical contribution relevant to foundational research in representation learning and optimization.

Relevance: 8 Novelty: 8


24. Efficient Neural SDE Training using Wiener-Space Cubature

ArXiv ID: 2502.12395

Authors: Luke Snow, Vikram Krishnamurthy

Abstract: A neural stochastic differential equation (SDE) is an SDE with drift and diffusion terms parametrized by neural networks. The training procedure for neural SDEs consists of optimizing the SDE vector field (neural network) parameters to minimize the expected value of an objective functional on infinite-dimensional path-space. Existing training techniques focus on methods to efficiently compute path-wise gradients of the objective functional with respect to these parameters, then pair this with Monte-Carlo simulation to estimate the expectation, and stochastic gradient descent to optimize. In this work we introduce a novel training technique which bypasses and improves upon Monte-Carlo simulation; we extend results in the theory of Wiener-space cubature to approximate the expected objective functional by a weighted sum of deterministic ODE solutions. This allows us to compute gradients by efficient ODE adjoint methods. Furthermore, we exploit a high-order recombination scheme to drastically reduce the number of ODE solutions necessary to achieve a reasonable approximation. We show that this Wiener-space cubature approach can surpass the O(1/sqrt(n)) rate of Monte-Carlo simulation, or the O(log(n)/n) rate of quasi-Monte-Carlo, to achieve a O(1/n) rate under reasonable assumptions.

Comment: The paper introduces a novel training technique for neural SDEs using Wiener-space cubature, which is relevant to efficiency improvements and foundational methods.

Relevance: 8 Novelty: 8


25. Scalable Model Merging with Progressive Layer-wise Distillation

ArXiv ID: 2502.12706

Authors: Jing Xu, Jiazheng Li, Jingzhao Zhang

Abstract: Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

Comment: The paper proposes a novel layer-wise distillation method for scalable model merging, which is relevant to model compression and efficiency. The theoretical insights into data-agnostic algorithms add to its novelty.

Relevance: 8 Novelty: 8


26. Learning the symmetric group: large from small

ArXiv ID: 2502.12717

Authors: Max Petschack, Alexandr Garbali, Jan de Gier

Abstract: Machine learning explorations can make significant inroads into solving difficult problems in pure mathematics. One advantage of this approach is that mathematical datasets do not suffer from noise, but a challenge is the amount of data required to train these models and that this data can be computationally expensive to generate. Key challenges further comprise difficulty in a posteriori interpretation of statistical models and the implementation of deep and abstract mathematical problems. We propose a method for scalable tasks, by which models trained on simpler versions of a task can then generalize to the full task. Specifically, we demonstrate that a transformer neural-network trained on predicting permutations from words formed by general transpositions in the symmetric group $S_{10}$ can generalize to the symmetric group $S_{25}$ with near 100\% accuracy. We also show that $S_{10}$ generalizes to $S_{16}$ with similar performance if we only use adjacent transpositions. We employ identity augmentation as a key tool to manage variable word lengths, and partitioned windows for training on adjacent transpositions. Finally we compare variations of the method used and discuss potential challenges with extending the method to other tasks.

Comment: The paper explores generalization in learning symmetric groups, which is an emerging trend in foundational research. The method of scaling tasks from small to large groups is a novel theoretical contribution.

Relevance: 8 Novelty: 8


27. SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

ArXiv ID: 2502.12444

Authors: Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Mu\~noz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah

Abstract: Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 \times$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a $1.14 \times$ speedup over the current systems without compromising accuracy. Code: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SparAMX

Comment: The paper focuses on accelerating LLM token generation on CPUs using sparsity and AMX, which aligns with model compression and efficiency topics. The use of unstructured sparsity in attention computation is novel.

Relevance: 8 Novelty: 8


28. Unveiling Mode Connectivity in Graph Neural Networks

ArXiv ID: 2502.12608

Authors: Bingheng Li, Zhikai Chen, Haoyu Han, Shenglai Zeng, Jingzhe Liu, Jiliang Tang

Abstract: A fundamental challenge in understanding graph neural networks (GNNs) lies in characterizing their optimization dynamics and loss landscape geometry, critical for improving interpretability and robustness. While mode connectivity, a lens for analyzing geometric properties of loss landscapes has proven insightful for other deep learning architectures, its implications for GNNs remain unexplored. This work presents the first investigation of mode connectivity in GNNs. We uncover that GNNs exhibit distinct non-linear mode connectivity, diverging from patterns observed in fully-connected networks or CNNs. Crucially, we demonstrate that graph structure, rather than model architecture, dominates this behavior, with graph properties like homophily correlating with mode connectivity patterns. We further establish a link between mode connectivity and generalization, proposing a generalization bound based on loss barriers and revealing its utility as a diagnostic tool. Our findings further bridge theoretical insights with practical implications: they rationalize domain alignment strategies in graph learning and provide a foundation for refining GNN training paradigms.

Comment: The paper investigates mode connectivity in GNNs, which provides theoretical insights into optimization dynamics and loss landscapes, aligning with representation learning and emerging trends.

Relevance: 8 Novelty: 8


29. An Interpretable Automated Mechanism Design Framework with Large Language Models

ArXiv ID: 2502.12203

Authors: Jiayuan Liu, Mingyu Guo, Vincent Conitzer

Abstract: Mechanism design has long been a cornerstone of economic theory, with traditional approaches relying on mathematical derivations. Recently, automated approaches, including differentiable economics with neural networks, have emerged for designing payments and allocations. While both analytical and automated methods have advanced the field, they each face significant weaknesses: mathematical derivations are not automated and often struggle to scale to complex problems, while automated and especially neural-network-based approaches suffer from limited interpretability. To address these challenges, we introduce a novel framework that reformulates mechanism design as a code generation task. Using large language models (LLMs), we generate heuristic mechanisms described in code and evolve them to optimize over some evaluation metrics while ensuring key design criteria (e.g., strategy-proofness) through a problem-specific fixing process. This fixing process ensures any mechanism violating the design criteria is adjusted to satisfy them, albeit with some trade-offs in performance metrics. These trade-offs are factored in during the LLM-based evolution process. The code generation capabilities of LLMs enable the discovery of novel and interpretable solutions, bridging the symbolic logic of mechanism design and the generative power of modern AI. Through rigorous experimentation, we demonstrate that LLM-generated mechanisms achieve competitive performance while offering greater interpretability compared to previous approaches. Notably, our framework can rediscover existing manually designed mechanisms and provide insights into neural-network based solutions through Programming-by-Example. These results highlight the potential of LLMs to not only automate but also enhance the transparency and scalability of mechanism design, ensuring safe deployment of the mechanisms in society.

Comment: The paper explores mechanism design using LLMs and introduces a novel framework for code generation and interpretability, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 8 Novelty: 8


30. Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation

ArXiv ID: 2502.12607

Authors: Tatsuya Aoyama, Hanting Yang, Hiroyuki Hanada, Satoshi Akahane, Tomonari Tanaka, Yoshito Okura, Yu Inatsu, Noriaki Hashimoto, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi

Abstract: We propose Duality Gap KIP (DGKIP), an extension of the Kernel Inducing Points (KIP) method for dataset distillation. While existing dataset distillation methods often rely on bi-level optimization, DGKIP eliminates the need for such optimization by leveraging duality theory in convex programming. The KIP method has been introduced as a way to avoid bi-level optimization; however, it is limited to the squared loss and does not support other loss functions (e.g., cross-entropy or hinge loss) that are more suitable for classification tasks. DGKIP addresses this limitation by exploiting an upper bound on parameter changes after dataset distillation using the duality gap, enabling its application to a wider range of loss functions. We also characterize theoretical properties of DGKIP by providing upper bounds on the test error and prediction consistency after dataset distillation. Experimental results on standard benchmarks such as MNIST and CIFAR-10 demonstrate that DGKIP retains the efficiency of KIP while offering broader applicability and robust performance.

Comment: The paper extends Kernel Inducing Points for dataset distillation, which aligns with foundational research in representation learning and efficiency improvements.

Relevance: 8 Novelty: 8


31. Enhanced uncertainty quantification variational autoencoders for the solution of Bayesian inverse problems

ArXiv ID: 2502.13105

Authors: Andrea Tonini, Luca Dede'

Abstract: Among other uses, neural networks are a powerful tool for solving deterministic and Bayesian inverse problems in real-time. In the Bayesian framework, variational autoencoders, a specialized type of neural network, enable the estimation of model parameters and their distribution based on observational data allowing to perform real-time inverse uncertainty quantification. In this work, we build upon existing research [Goh, H. et al., Proceedings of Machine Learning Research, 2022] by proposing a novel loss function to train variational autoencoders for Bayesian inverse problems. When the forward map is affine, we provide a theoretical proof of the convergence of the latent states of variational autoencoders to the posterior distribution of the model parameters. We validate this theoretical result through numerical tests and we compare the proposed variational autoencoder with the existing one in the literature. Finally, we test the proposed variational autoencoder on the Laplace equation.

Comment: The paper proposes a novel loss function for variational autoencoders in Bayesian inverse problems, which aligns with foundational research in representation learning and generative models.

Relevance: 8 Novelty: 7


32. Tuning Algorithmic and Architectural Hyperparameters in Graph-Based Semi-Supervised Learning with Provable Guarantees

ArXiv ID: 2502.12937

Authors: Ally Yalei Du, Eric Huang, Dravyansh Sharma

Abstract: Graph-based semi-supervised learning is a powerful paradigm in machine learning for modeling and exploiting the underlying graph structure that captures the relationship between labeled and unlabeled data. A large number of classical as well as modern deep learning based algorithms have been proposed for this problem, often having tunable hyperparameters. We initiate a formal study of tuning algorithm hyperparameters from parameterized algorithm families for this problem. We obtain novel $O(\log n)$ pseudo-dimension upper bounds for hyperparameter selection in three classical label propagation-based algorithm families, where $n$ is the number of nodes, implying bounds on the amount of data needed for learning provably good parameters. We further provide matching $\Omega(\log n)$ pseudo-dimension lower bounds, thus asymptotically characterizing the learning-theoretic complexity of the parameter tuning problem. We extend our study to selecting architectural hyperparameters in modern graph neural networks. We bound the Rademacher complexity for tuning the self-loop weighting in recently proposed Simplified Graph Convolution (SGC) networks. We further propose a tunable architecture that interpolates graph convolutional neural networks (GCN) and graph attention networks (GAT) in every layer, and provide Rademacher complexity bounds for tuning the interpolation coefficient.

Comment: The paper studies hyperparameter tuning in graph-based semi-supervised learning with provable guarantees, which aligns with foundational research in graph neural networks and architectural innovations.

Relevance: 8 Novelty: 7


33. Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product

ArXiv ID: 2502.12200

Authors: Pengxiang Lan, Haoyu Xu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, Xingwei Wang

Abstract: Prompt tuning (PT) offers a cost-effective alternative to fine-tuning large-scale pre-trained language models (PLMs), requiring only a few parameters in soft prompt tokens added before the input text. However, existing PT approaches face two significant issues: (i) They overlook intrinsic semantic associations between soft prompt tokens, leading to high discreteness and limited interactions, thus reducing the model's comprehension and effectiveness in complex tasks. (ii) Due to the complexity of downstream tasks, long soft prompt is necessitated to improve performance, but prompt length correlates positively with memory usage and computational costs. Achieving high efficiency and performance remains an ongoing challenge. To address these issues, we propose a novel Low-parameters prompt tuning (LAMP) method, which leverages prompt decomposition and compressed outer product. Specifically, the prompt decomposition module employs Truncated SVD to reduce training parameters and significantly lower the dimensionality of the soft prompt parameter space. It then utilizes a compressed outer product module to facilitate multiple interactions among prompt tokens, exploring their intrinsic associations to enhance knowledge representation. Finally, LAMP uses average pooling to reduce memory usage and training/inference time. Extensive experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.

Comment: The paper proposes a novel prompt tuning method using prompt decomposition and compressed outer product, which aligns with model compression and efficiency topics. It introduces a new approach to reduce memory usage and computational costs.

Relevance: 8 Novelty: 7


34. Asymptotic Optimism of Random-Design Linear and Kernel Regression Models

ArXiv ID: 2502.12999

Authors: Hengrui Luo, Yunzhang Zhu

Abstract: We derived the closed-form asymptotic optimism of linear regression models under random designs, and generalizes it to kernel ridge regression. Using scaled asymptotic optimism as a generic predictive model complexity measure, we studied the fundamental different behaviors of linear regression model, tangent kernel (NTK) regression model and three-layer fully connected neural networks (NN). Our contribution is two-fold: we provided theoretical ground for using scaled optimism as a model predictive complexity measure; and we show empirically that NN with ReLUs behaves differently from kernel models under this measure. With resampling techniques, we can also compute the optimism for regression models with real data.

Comment: The paper provides theoretical insights into model complexity measures and compares neural networks with kernel models, which aligns with representation learning and training dynamics.

Relevance: 8 Novelty: 7


35. B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

ArXiv ID: 2502.12992

Authors: Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg

Abstract: Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural models. Meanwhile, B-cos networks have been introduced to improve model explainability through architectural and computational adaptations, but their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous B-cos methods. Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we provide practical guidelines for effectively building B-cos LMs based on our findings. Our code is available at https://anonymous.4open.science/r/bcos_lm.

Comment: The paper introduces B-cos LMs for improved explainability in language models, which is relevant to representation learning and interpretability. The adaptation of B-cos networks to NLP tasks is a novel contribution.

Relevance: 8 Novelty: 7


36. Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs

ArXiv ID: 2502.12352

Authors: Batu El, Deepro Choudhury, Pietro Li`o, Chaitanya K. Joshi

Abstract: We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: https://github.com/batu-el/understanding-inductive-biases-of-gnns

Comment: The paper introduces Attention Graphs for mechanistic interpretability of Graph Transformers, which aligns with representation learning and interpretability. The network science perspective adds novelty.

Relevance: 8 Novelty: 7


37. GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

ArXiv ID: 2502.12499

Authors: Ding-Yong Hong, Tzu-Hsien Tsai, Ning Wang, Pangfeng Liu, Jan-Jan Wu

Abstract: In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology rematerialization can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as checkpoints, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity O(n3) to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an O(n)-time algorithm for finding the optimal checkpoint subset.

Comment: The paper focuses on memory optimization during backward propagation, which aligns with model compression and efficiency topics. The dynamic programming algorithm for checkpoint selection is a novel contribution.

Relevance: 8 Novelty: 7


38. Spiking Vision Transformer with Saccadic Attention

ArXiv ID: 2502.12677

Authors: Shuai Wang, Malu Zhang, Dehao Zhang, Ammar Belatreche, Yichen Xiao, Yu Liang, Yimeng Shan, Qian Sun, Enqi Zhang, Yang Yang

Abstract: The combination of Spiking Neural Networks (SNNs) and Vision Transformers (ViTs) holds potential for achieving both energy efficiency and high performance, particularly suitable for edge vision applications. However, a significant performance gap still exists between SNN-based ViTs and their ANN counterparts. Here, we first analyze why SNN-based ViTs suffer from limited performance and identify a mismatch between the vanilla self-attention mechanism and spatio-temporal spike trains. This mismatch results in degraded spatial relevance and limited temporal interactions. To address these issues, we draw inspiration from biological saccadic attention mechanisms and introduce an innovative Saccadic Spike Self-Attention (SSSA) method. Specifically, in the spatial domain, SSSA employs a novel spike distribution-based method to effectively assess the relevance between Query and Key pairs in SNN-based ViTs. Temporally, SSSA employs a saccadic interaction module that dynamically focuses on selected visual areas at each timestep and significantly enhances whole scene understanding through temporal interactions. Building on the SSSA mechanism, we develop a SNN-based Vision Transformer (SNN-ViT). Extensive experiments across various visual tasks demonstrate that SNN-ViT achieves state-of-the-art performance with linear computational complexity. The effectiveness and efficiency of the SNN-ViT highlight its potential for power-critical edge vision applications.

Comment: The paper proposes a Spiking Vision Transformer with a novel Saccadic Spike Self-Attention mechanism, which aligns with architectural innovations in Transformers. The focus on spatio-temporal interactions is relevant to foundational research.

Relevance: 8 Novelty: 7


39. Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

ArXiv ID: 2502.12206

Authors: Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, Bryan Hooi

Abstract: As large language models (LLMs) continue to evolve, ensuring their alignment with human goals and values remains a pressing challenge. A key concern is \textit{instrumental convergence}, where an AI system, in optimizing for a given objective, develops unintended intermediate goals that override the ultimate objective and deviate from human-intended goals. This issue is particularly relevant in reinforcement learning (RL)-trained models, which can generate creative but unintended strategies to maximize rewards. In this paper, we explore instrumental convergence in LLMs by comparing models trained with direct RL optimization (e.g., the o1 model) to those trained with reinforcement learning from human feedback (RLHF). We hypothesize that RL-driven models exhibit a stronger tendency for instrumental convergence due to their optimization of goal-directed behavior in ways that may misalign with human intentions. To assess this, we introduce InstrumentalEval, a benchmark for evaluating instrumental convergence in RL-trained LLMs. Initial experiments reveal cases where a model tasked with making money unexpectedly pursues instrumental objectives, such as self-replication, implying signs of instrumental convergence. Our findings contribute to a deeper understanding of alignment challenges in AI systems and the risks posed by unintended model behaviors.

Comment: The paper investigates instrumental convergence in RL-trained LLMs, which provides theoretical insights into LLM behavior and alignment challenges, aligning with foundational research in LLMs.

Relevance: 8 Novelty: 7


40. Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?

ArXiv ID: 2502.12215

Authors: Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu

Abstract: The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.

Comment: The paper critiques test-time scaling in LLMs and proposes a novel method for improving scalability, which aligns with the LLM behavior/interpretability criterion.

Relevance: 8 Novelty: 7


41. RM-PoT: Reformulating Mathematical Problems and Solving via Program of Thoughts

ArXiv ID: 2502.12589

Authors: Yu Zhang, Shujun Peng, Nengwu Wu, Xinhan Lin, Yang Hu, Jie Tang

Abstract: Recently, substantial advancements have been made in training language models to carry out step-by-step reasoning for solving intricate numerical reasoning tasks. Beyond the methods used to solve these problems, the structure and formulation of the problems themselves also play a crucial role in determining the performance of large language models. We observe that even small changes in the surface form of mathematical problems can have a profound impact on both the answer distribution and solve rate. This highlights the vulnerability of LLMs to surface-level variations, revealing its limited robustness when reasoning through complex problems. In this paper, we propose RM-PoT, a three-stage framework that integrates problem reformulation (RM), code-aided reasoning (PoT), and domain-aware few-shot learning to address these limitations. Our approach first reformulates the input problem into diverse surface forms to reduce structural bias, then retrieves five semantically aligned examples from a pre-constructed domain-specific question bank to provide contextual guidance, and finally generates executable Python code for precise computation.

Comment: The paper proposes a framework for reformulating mathematical problems to improve LLM reasoning, which aligns with LLM behavior/interpretability.

Relevance: 8 Novelty: 7


42. DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization

ArXiv ID: 2502.12413

Authors: Jiaqi Wang, Yuhang Zhou, Zhixiong Zhang, Qiguang Chen, Yongqiang Chen, James Cheng

Abstract: Out-of-distribution generalization is a common problem that expects the model to perform well in the different distributions even far from the train data. A popular approach to addressing this issue is invariant learning (IL), in which the model is compiled to focus on invariant features instead of spurious features by adding strong constraints during training. However, there are some potential pitfalls of strong invariant constraints. Due to the limited number of diverse environments and over-regularization in the feature space, it may lead to a loss of important details in the invariant features while alleviating the spurious correlations, namely the over-invariance, which can also degrade the generalization performance. We theoretically define the over-invariance and observe that this issue occurs in various classic IL methods. To alleviate this issue, we propose a simple approach Diverse Invariant Learning (DivIL) by adding the unsupervised contrastive learning and the random masking mechanism compensatory for the invariant constraints, which can be applied to various IL methods. Furthermore, we conduct experiments across multiple modalities across 12 datasets and 6 classic models, verifying our over-invariance insight and the effectiveness of our DivIL framework. Our code is available at https://github.com/kokolerk/DivIL.

Comment: The paper proposes a method to address over-invariance in invariant learning, which is relevant to representation learning and training dynamics.

Relevance: 7 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research, avoiding purely application-driven work:

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords for Relevant Domains: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.

Hints on Irrelevant Domains: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.

Hints on Application Tasks: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.