Personalized Daily Arxiv Papers 02/19/2025

gpt-4o	Prompt	Completion	Total
Token	61827	9142	70969
Cost	$0.15	$0.09	$0.24

Total ArXiv papers: 655

Total scanned papers: 402

Total relevant papers: 42

Table of contents with paper titles:

MeMo: Towards Language Models with Associative Memory Mechanisms Authors: Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli
Accurate Expert Predictions in MoE Inference via Cross-Layer Gate Authors: Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng
Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models Authors: Gyeongman Kim, Gyouk Chu, Eunho Yang
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Authors: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
Independence Tests for Language Models Authors: Sally Zhu, Ahmed Ahmed, Rohith Kuditipudi, Percy Liang
Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks Authors: Markus J. Buehler
Optimal Brain Iterative Merging: Mitigating Interference in LLM Merging Authors: Zhixiang Wang, Zhenyu Mao, Yixuan Qiao, Yunfang Wu, Biye Li
Stability-based Generalization Bounds for Variational Inference Authors: Yadi Wei, Roni Khardon
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning Authors: Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang
Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models Authors: Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, Linqin Song
Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization Authors: Aditya Ranganath, Mukesh Singhal, Roummel Marcia
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections Authors: Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan
GoRA: Gradient-driven Adaptive Low Rank Adaptation Authors: Haonan He, Peng Ye, Yuchen Ren, Yuan Yuan, Lei Chen
Understanding Generalization in Transformers: Error Bounds and Training Dynamics Under Benign and Harmful Overfitting Authors: Yingying Zhang, Zhenyu Wu, Jian Li, Yong Liu
QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models Authors: Jiajun Zhou, Yifan Yang, Kai Zhen, Ziyue Liu, Yequan Zhao, Ershad Banijamali, Athanasios Mouchtaris, Ngai Wong, Zheng Zhang
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs Authors: Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci
Electron flow matching for generative reaction mechanism prediction obeying conservation laws Authors: Joonyoung F. Joung, Mun Hong Fong, Nicholas Casetti, Jordan P. Liles, Ne S. Dassanayake, Connor W. Coley
Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement Authors: Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading Authors: Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar
Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification Authors: Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan J. Foster
Keep what you need : extracting efficient subnetworks from large audio representation models Authors: David Genova, Philippe Esling, Tom Hurlin
A Neural Difference-of-Entropies Estimator for Mutual Information Authors: Haoran Ni, Martin Lotz
Stability Bounds for Smooth Optimal Transport Maps and their Statistical Implications Authors: Sivaraman Balakrishnan, Tudor Manole
Efficient Neural SDE Training using Wiener-Space Cubature Authors: Luke Snow, Vikram Krishnamurthy
Scalable Model Merging with Progressive Layer-wise Distillation Authors: Jing Xu, Jiazheng Li, Jingzhao Zhang
Learning the symmetric group: large from small Authors: Max Petschack, Alexandr Garbali, Jan de Gier
SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs Authors: Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Mu\~noz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah
Unveiling Mode Connectivity in Graph Neural Networks Authors: Bingheng Li, Zhikai Chen, Haoyu Han, Shenglai Zeng, Jingzhe Liu, Jiliang Tang
An Interpretable Automated Mechanism Design Framework with Large Language Models Authors: Jiayuan Liu, Mingyu Guo, Vincent Conitzer
Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation Authors: Tatsuya Aoyama, Hanting Yang, Hiroyuki Hanada, Satoshi Akahane, Tomonari Tanaka, Yoshito Okura, Yu Inatsu, Noriaki Hashimoto, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi
Enhanced uncertainty quantification variational autoencoders for the solution of Bayesian inverse problems Authors: Andrea Tonini, Luca Dede'
Tuning Algorithmic and Architectural Hyperparameters in Graph-Based Semi-Supervised Learning with Provable Guarantees Authors: Ally Yalei Du, Eric Huang, Dravyansh Sharma
Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product Authors: Pengxiang Lan, Haoyu Xu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, Xingwei Wang
Asymptotic Optimism of Random-Design Linear and Kernel Regression Models Authors: Hengrui Luo, Yunzhang Zhu
B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability Authors: Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs Authors: Batu El, Deepro Choudhury, Pietro Li`o, Chaitanya K. Joshi
GPU Memory Usage Optimization for Backward Propagation in Deep Network Training Authors: Ding-Yong Hong, Tzu-Hsien Tsai, Ning Wang, Pangfeng Liu, Jan-Jan Wu
Spiking Vision Transformer with Saccadic Attention Authors: Shuai Wang, Malu Zhang, Dehao Zhang, Ammar Belatreche, Yichen Xiao, Yu Liang, Yimeng Shan, Qian Sun, Enqi Zhang, Yang Yang
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals? Authors: Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, Bryan Hooi
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? Authors: Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu
RM-PoT: Reformulating Mathematical Problems and Solving via Program of Thoughts Authors: Yu Zhang, Shujun Peng, Nengwu Wu, Xinhan Lin, Yang Hu, Jie Tang
DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization Authors: Jiaqi Wang, Yuhang Zhou, Zhixiong Zhang, Qiguang Chen, Yongqiang Chen, James Cheng

1. MeMo: Towards Language Models with Associative Memory Mechanisms

ArXiv ID: 2502.12851

Authors: Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli

Abstract: Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.

Comment: The paper proposes a novel architecture, MeMo, with associative memory mechanisms for LLMs, which aligns with the model architecture criterion by introducing a new paradigm for memorization and transparency.

Relevance: 10 Novelty: 9

2. Accurate Expert Predictions in MoE Inference via Cross-Layer Gate

ArXiv ID: 2502.12224

Authors: Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional GPU overhead. Furthermore, Fate employs a shallow-favoring expert caching strategy that increases the expert hit rate to 99\%. Additionally, Fate integrates tailored quantization strategies for cache optimization and IO efficiency. Experimental results show that, compared to Load on Demand and Expert Activation Path-based method, Fate achieves up to 4.5x and 1.9x speedups in prefill speed and up to 4.1x and 2.2x speedups in decoding speed, respectively, while maintaining inference quality. Moreover, Fate's performance improvements are scalable across different memory budgets.

Comment: The paper focuses on improving MoE inference efficiency through cross-layer gating and caching strategies, which directly aligns with the topic of Mixture-of-Experts and model efficiency.

Relevance: 10 Novelty: 8

3. Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models

ArXiv ID: 2502.12947

Authors: Gyeongman Kim, Gyouk Chu, Eunho Yang

Abstract: With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts. Specifically, KA augments knowledge by sampling experts multiple times, while SAR uses all experts and adjusts the expert weights through router training to provide optimal knowledge. Extensive experiments show that our methods outperform conventional KD methods, demonstrating their effectiveness for MoE teacher models.

Comment: The paper introduces MoE-specific knowledge distillation methods, which directly align with the Mixture-of-Experts (MoE) topic and provide novel insights into leveraging non-activated experts.

Relevance: 10 Novelty: 8

4. Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

ArXiv ID: 2502.13063

Authors: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

Abstract: A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

Comment: The paper explores the limits of embedding space capacity, which is relevant to representation learning and compression. The focus on theoretical limits and optimization is highly novel.

Relevance: 9 Novelty: 9

5. Independence Tests for Language Models

ArXiv ID: 2502.12292

Authors: Sally Zhu, Ahmed Ahmed, Rohith Kuditipudi, Percy Liang

Abstract: We consider the following problem: given the weights of two models, can we test whether they were trained independently -- i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.

Comment: The paper introduces statistical tests for determining independence between model weights, which is a novel and foundational contribution to understanding model training dynamics.

Relevance: 9 Novelty: 9

6. Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks

ArXiv ID: 2502.13025

Authors: Markus J. Buehler

Abstract: We present an agentic, autonomous graph expansion framework that iteratively structures and refines knowledge in situ. Unlike conventional knowledge graph construction methods relying on static extraction or single-pass learning, our approach couples a reasoning-native large language model with a continually updated graph representation. At each step, the system actively generates new concepts and relationships, merges them into a global graph, and formulates subsequent prompts based on its evolving structure. Through this feedback-driven loop, the model organizes information into a scale-free network characterized by hub formation, stable modularity, and bridging nodes that link disparate knowledge clusters. Over hundreds of iterations, new nodes and edges continue to appear without saturating, while centrality measures and shortest path distributions evolve to yield increasingly distributed connectivity. Our analysis reveals emergent patterns, such as the rise of highly connected 'hub' concepts and the shifting influence of 'bridge' nodes, indicating that agentic, self-reinforcing graph construction can yield open-ended, coherent knowledge structures. Applied to materials design problems, we present compositional reasoning experiments by extracting node-specific and synergy-level principles to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that transcend rote summarization and strengthen the framework's potential for open-ended scientific discovery. We discuss other applications in scientific discovery and outline future directions for enhancing scalability and interpretability.

Comment: The paper introduces a novel framework for self-organizing knowledge networks using graph reasoning and LLMs, which aligns with emerging trends and foundational research in knowledge representation.