Personalized Daily ArXiv Papers 2025-04-18

[gpt-4o]	Prompt	Completion	Total
Token	35479	4816	40295
Cost	$0.09	$0.05	$0.14

Total arXiv papers: 433

Total scanned papers: 254

Total relevant papers: 26

Table of contents with paper titles:

Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models Authors: Yuanbo Tang, Yan Tang, Naifan Zhang, Meixuan Chen, Yang Li
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein
An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research Authors: Patrik Reizinger, Randall Balestriero, David Klindt, Wieland Brendel
Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation Authors: Nairouz Mrabah, Nicolas Richet, Ismail Ben Ayed, \'Eric Granger
Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection Authors: Long Zeng, Jianxiang Yu, Jiapeng Zhu, Qingsong Zhong, Xiang Li
Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation Authors: Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang
On Linear Representations and Pretraining Data Frequency in Language Models Authors: Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar
Memorization: A Close Look at Books Authors: Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes
A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving Authors: Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models Authors: Junyang Zhang, Tianyi Zhu, Cheng Luo, Anima Anandkumar
A Two-Phase Perspective on Deep Learning Dynamics Authors: Robert de Mello Koch, Animik Ghosh
Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time Authors: Margalit Glasgow, Denny Wu, Joan Bruna
Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models Authors: Liyi Zhang, Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
MIB: A Mechanistic Interpretability Benchmark Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iv\'an Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov
Spectral Algorithms under Covariate Shift Authors: Jun Fan, Zheng-Chu Guo, Lei Shi
Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods Authors: Ruinan Jin, Difei Cheng, Hong Qiao, Xin Shi, Shaodong Liu, Bo Zhang
Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models Authors: Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu
Transferrable Surrogates in Expressive Neural Architecture Search Spaces Authors: Shiwen Qin, Gabriela Kadlecov\'a, Martin Pil\'at, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson
You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models Authors: Shiwei Ding, Lan Zhang, Zhenlin Wang, Giuseppe Ateniese, Xiaoyong Yuan
Hadamard product in deep learning: Introduction, Advances and Challenges Authors: Grigorios G Chrysos, Yongtao Wu, Razvan Pascanu, Philip Torr, Volkan Cevher
Towards Lossless Token Pruning in Late-Interaction Retrieval Models Authors: Yuxuan Zong, Benjamin Piwowarski
GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs Authors: Kun-Woo Kim, Ji-Hoon Park, Ju-Min Han, Seong-Whan Lee
Simplifying Graph Transformers Authors: Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates
Disentangling Polysemantic Channels in Convolutional Neural Networks Authors: Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth
The Others: Naturally Isolating Out-of-Distribution Samples for Robust Open-Set Semi-Supervised Learning Authors: You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim

1. Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models

ArXiv ID: 2504.12359

Authors: Yuanbo Tang, Yan Tang, Naifan Zhang, Meixuan Chen, Yang Li

Abstract: Mixture-of-Experts based large language models (MoE LLMs) have shown significant promise in multitask adaptability by dynamically routing inputs to specialized experts. Despite their success, the collaborative mechanisms among experts are still not well understood, limiting both the interpretability and optimization of these models. In this paper, we focus on two critical issues: (1) identifying expert collaboration patterns, and (2) optimizing MoE LLMs through expert pruning. To address the first issue, we propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. For the second issue, we introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts. Our extensive experiments demonstrate that expert collaboration patterns are closely linked to specific input types and exhibit semantic significance across various tasks. Moreover, pruning experiments show that our approach improves overall performance by 2.5\% on average, outperforming existing methods. These findings offer valuable insights into enhancing the efficiency and interpretability of MoE LLMs, offering a clearer understanding of expert interactions and improving model optimization.

Comment: The paper explores expert collaboration and pruning in MoE-based LLMs, which is highly relevant to foundational research in model architecture and efficiency.

Relevance: 10 Novelty: 8

2. Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

ArXiv ID: 2504.12463

Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein

Abstract: Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.

Comment: Proposes a method to improve training for sparse Mixture-of-Experts, directly aligning with foundational research in MoE architectures.

Relevance: 10 Novelty: 8

3. An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research

ArXiv ID: 2504.13101

Authors: Patrik Reizinger, Randall Balestriero, David Klindt, Wieland Brendel

Abstract: Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL's empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.

Comment: The paper proposes expanding Identifiability Theory to explain self-supervised learning, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 9

4. Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

ArXiv ID: 2504.12436

Authors: Nairouz Mrabah, Nicolas Richet, Ismail Ben Ayed, \'Eric Granger

Abstract: Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.

Comment: The paper introduces a sparse optimization framework for few-shot adaptation, which aligns with model compression topics like sparsity and efficiency improvements.