Personalized Daily ArXiv Papers 2025-04-18
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 35479 | 4816 | 40295 |
| Cost | $0.09 | $0.05 | $0.14 |
Total arXiv papers: 433
Total scanned papers: 254
Total relevant papers: 26
Table of contents with paper titles:
-
Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models Authors: Yuanbo Tang, Yan Tang, Naifan Zhang, Meixuan Chen, Yang Li
-
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein
-
An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research Authors: Patrik Reizinger, Randall Balestriero, David Klindt, Wieland Brendel
-
Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation Authors: Nairouz Mrabah, Nicolas Richet, Ismail Ben Ayed, \'Eric Granger
-
Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection Authors: Long Zeng, Jianxiang Yu, Jiapeng Zhu, Qingsong Zhong, Xiang Li
-
Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation Authors: Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang
-
On Linear Representations and Pretraining Data Frequency in Language Models Authors: Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar
-
Memorization: A Close Look at Books Authors: Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes
-
A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving Authors: Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko
-
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models Authors: Junyang Zhang, Tianyi Zhu, Cheng Luo, Anima Anandkumar
-
A Two-Phase Perspective on Deep Learning Dynamics Authors: Robert de Mello Koch, Animik Ghosh
-
Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time Authors: Margalit Glasgow, Denny Wu, Joan Bruna
-
Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models Authors: Liyi Zhang, Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths
-
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
-
MIB: A Mechanistic Interpretability Benchmark Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iv\'an Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov
-
Spectral Algorithms under Covariate Shift Authors: Jun Fan, Zheng-Chu Guo, Lei Shi
-
Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods Authors: Ruinan Jin, Difei Cheng, Hong Qiao, Xin Shi, Shaodong Liu, Bo Zhang
-
Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models Authors: Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu
-
Transferrable Surrogates in Expressive Neural Architecture Search Spaces Authors: Shiwen Qin, Gabriela Kadlecov\'a, Martin Pil\'at, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson
-
You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models Authors: Shiwei Ding, Lan Zhang, Zhenlin Wang, Giuseppe Ateniese, Xiaoyong Yuan
-
Hadamard product in deep learning: Introduction, Advances and Challenges Authors: Grigorios G Chrysos, Yongtao Wu, Razvan Pascanu, Philip Torr, Volkan Cevher
-
Towards Lossless Token Pruning in Late-Interaction Retrieval Models Authors: Yuxuan Zong, Benjamin Piwowarski
-
GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs Authors: Kun-Woo Kim, Ji-Hoon Park, Ju-Min Han, Seong-Whan Lee
-
Simplifying Graph Transformers Authors: Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates
-
Disentangling Polysemantic Channels in Convolutional Neural Networks Authors: Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth
-
The Others: Naturally Isolating Out-of-Distribution Samples for Robust Open-Set Semi-Supervised Learning Authors: You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim
1. Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models
ArXiv ID: 2504.12359
Authors: Yuanbo Tang, Yan Tang, Naifan Zhang, Meixuan Chen, Yang Li
Abstract: Mixture-of-Experts based large language models (MoE LLMs) have shown significant promise in multitask adaptability by dynamically routing inputs to specialized experts. Despite their success, the collaborative mechanisms among experts are still not well understood, limiting both the interpretability and optimization of these models. In this paper, we focus on two critical issues: (1) identifying expert collaboration patterns, and (2) optimizing MoE LLMs through expert pruning. To address the first issue, we propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. For the second issue, we introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts. Our extensive experiments demonstrate that expert collaboration patterns are closely linked to specific input types and exhibit semantic significance across various tasks. Moreover, pruning experiments show that our approach improves overall performance by 2.5\% on average, outperforming existing methods. These findings offer valuable insights into enhancing the efficiency and interpretability of MoE LLMs, offering a clearer understanding of expert interactions and improving model optimization.
Comment: The paper explores expert collaboration and pruning in MoE-based LLMs, which is highly relevant to foundational research in model architecture and efficiency.
Relevance: 10 Novelty: 8
2. Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
ArXiv ID: 2504.12463
Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein
Abstract: Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.
Comment: Proposes a method to improve training for sparse Mixture-of-Experts, directly aligning with foundational research in MoE architectures.
Relevance: 10 Novelty: 8
3. An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research
ArXiv ID: 2504.13101
Authors: Patrik Reizinger, Randall Balestriero, David Klindt, Wieland Brendel
Abstract: Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL's empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.
Comment: The paper proposes expanding Identifiability Theory to explain self-supervised learning, which aligns with foundational research in representation learning and training dynamics.
Relevance: 9 Novelty: 9
4. Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation
ArXiv ID: 2504.12436
Authors: Nairouz Mrabah, Nicolas Richet, Ismail Ben Ayed, \'Eric Granger
Abstract: Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.
Comment: The paper introduces a sparse optimization framework for few-shot adaptation, which aligns with model compression topics like sparsity and efficiency improvements.
Relevance: 9 Novelty: 8
5. Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection
ArXiv ID: 2504.12715
Authors: Long Zeng, Jianxiang Yu, Jiapeng Zhu, Qingsong Zhong, Xiang Li
Abstract: Graph self-supervised learning has gained significant attention recently. However, many existing approaches heavily depend on perturbations, and inappropriate perturbations may corrupt the graph's inherent information. The Vector Quantized Variational Autoencoder (VQ-VAE) is a powerful autoencoder extensively used in fields such as computer vision; however, its application to graph data remains underexplored. In this paper, we provide an empirical analysis of vector quantization in the context of graph autoencoders, demonstrating its significant enhancement of the model's capacity to capture graph topology. Furthermore, we identify two key challenges associated with vector quantization when applying in graph data: codebook underutilization and codebook space sparsity. For the first challenge, we propose an annealing-based encoding strategy that promotes broad code utilization in the early stages of training, gradually shifting focus toward the most effective codes as training progresses. For the second challenge, we introduce a hierarchical two-layer codebook that captures relationships between embeddings through clustering. The second layer codebook links similar codes, encouraging the model to learn closer embeddings for nodes with similar features and structural topology in the graph. Our proposed model outperforms 16 representative baseline methods in self-supervised link prediction and node classification tasks across multiple datasets.
Comment: The paper introduces a hierarchical vector quantized graph autoencoder, which aligns with foundational research in representation learning and autoencoders.
Relevance: 9 Novelty: 8
6. Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation
ArXiv ID: 2504.12637
Authors: Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang
Abstract: Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.
Comment: The paper introduces a novel synthetic data generation strategy for extending LLM context lengths, which aligns with the 'Large Language Models' criterion, particularly in addressing architectural and efficiency challenges.
Relevance: 9 Novelty: 8
7. On Linear Representations and Pretraining Data Frequency in Language Models
ArXiv ID: 2504.12459
Authors: Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar
Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.
Comment: The paper investigates the relationship between pretraining data frequency and linear representations in LLMs, aligning with the 'Representation Learning' criterion as it provides insights into how LLMs encode information.
Relevance: 9 Novelty: 8
8. Memorization: A Close Look at Books
ArXiv ID: 2504.12549
Authors: Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes
Abstract: To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.
Comment: The paper explores memorization in LLMs and its connection to pretraining data, aligning with the 'Large Language Models' criterion as it provides theoretical insights into LLM behavior.
Relevance: 9 Novelty: 8
9. A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving
ArXiv ID: 2504.12984
Authors: Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko
Abstract: Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.
Comment: Presents a virtual machine for low-precision GPGPU computation, which aligns with foundational research in model compression and efficiency.
Relevance: 9 Novelty: 8
10. MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
ArXiv ID: 2504.12526
Authors: Junyang Zhang, Tianyi Zhu, Cheng Luo, Anima Anandkumar
Abstract: Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50\% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35\% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.
Comment: Proposes a memory-efficient inference method for long-context LLMs, which is relevant to model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 8
11. A Two-Phase Perspective on Deep Learning Dynamics
ArXiv ID: 2504.12700
Authors: Robert de Mello Koch, Animik Ghosh
Abstract: We propose that learning in deep neural networks proceeds in two phases: a rapid curve fitting phase followed by a slower compression or coarse graining phase. This view is supported by the shared temporal structure of three phenomena: grokking, double descent and the information bottleneck, all of which exhibit a delayed onset of generalization well after training error reaches zero. We empirically show that the associated timescales align in two rather different settings. Mutual information between hidden layers and input data emerges as a natural progress measure, complementing circuit-based metrics such as local complexity and the linear mapping number. We argue that the second phase is not actively optimized by standard training algorithms and may be unnecessarily prolonged. Drawing on an analogy with the renormalization group, we suggest that this compression phase reflects a principled form of forgetting, critical for generalization.
Comment: Proposes a two-phase perspective on deep learning dynamics, offering insights into training dynamics and representation learning.
Relevance: 9 Novelty: 8
12. Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time
ArXiv ID: 2504.13110
Authors: Margalit Glasgow, Denny Wu, Joan Bruna
Abstract: We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle's velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension $d$. We show that, due to a certain ``self-concordance'' property in these problems -- where the local Hessian of a particle is bounded by a constant times the particle's velocity -- polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.
Comment: The paper provides theoretical insights into the dynamics of neural networks in the mean-field regime, which aligns with the 'Representation Learning' criterion by analyzing training dynamics and approximation gaps.
Relevance: 9 Novelty: 8
13. Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models
ArXiv ID: 2504.12585
Authors: Liyi Zhang, Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths
Abstract: Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.
Comment: The paper investigates the influence of prior distributions in LLMs and proposes methods to mitigate their effects, aligning with the 'Large Language Models' criterion by providing theoretical insights into LLM behavior.
Relevance: 9 Novelty: 8
14. It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
ArXiv ID: 2504.13173
Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
Abstract: Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
Comment: The paper proposes a general framework for designing neural architectures inspired by attentional bias, aligning with the 'Model Architecture' criterion by introducing novel architectural insights.
Relevance: 9 Novelty: 8
15. MIB: A Mechanistic Interpretability Benchmark
ArXiv ID: 2504.13151
Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iv\'an Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov
Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.
Comment: Introduces a benchmark for mechanistic interpretability, which aligns with foundational research in understanding LLM behavior.
Relevance: 9 Novelty: 7
16. Spectral Algorithms under Covariate Shift
ArXiv ID: 2504.12625
Authors: Jun Fan, Zheng-Chu Guo, Lei Shi
Abstract: Spectral algorithms leverage spectral regularization techniques to analyze and process data, providing a flexible framework for addressing supervised learning problems. To deepen our understanding of their performance in real-world scenarios where the distributions of training and test data may differ, we conduct a rigorous investigation into the convergence behavior of spectral algorithms under distribution shifts, specifically within the framework of reproducing kernel Hilbert spaces. Our study focuses on the case of covariate shift. In this scenario, the marginal distributions of the input data differ between the training and test datasets, while the conditional distribution of the output given the input remains unchanged. Under this setting, we analyze the generalization error of spectral algorithms and show that they achieve minimax optimality when the density ratios between the training and test distributions are uniformly bounded. However, we also identify a critical limitation: when the density ratios are unbounded, the spectral algorithms may become suboptimal. To address this limitation, we propose a weighted spectral algorithm that incorporates density ratio information into the learning process. Our theoretical analysis shows that this weighted approach achieves optimal capacity-independent convergence rates. Furthermore, by introducing a weight clipping technique, we demonstrate that the convergence rates of the weighted spectral algorithm can approach the optimal capacity-dependent convergence rates arbitrarily closely. This improvement resolves the suboptimality issue in unbounded density ratio scenarios and advances the state-of-the-art by refining existing theoretical results.
Comment: The paper investigates spectral algorithms under covariate shift, providing theoretical insights into generalization, which aligns with foundational research.
Relevance: 8 Novelty: 8
17. Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods
ArXiv ID: 2504.12601
Authors: Ruinan Jin, Difei Cheng, Hong Qiao, Xin Shi, Shaodong Liu, Bo Zhang
Abstract: Stochastic Gradient Descent (SGD) is widely used in machine learning research. Previous convergence analyses of SGD under the vanishing step-size setting typically require Robbins-Monro conditions. However, in practice, a wider variety of step-size schemes are frequently employed, yet existing convergence results remain limited and often rely on strong assumptions. This paper bridges this gap by introducing a novel analytical framework based on a stopping-time method, enabling asymptotic convergence analysis of SGD under more relaxed step-size conditions and weaker assumptions. In the non-convex setting, we prove the almost sure convergence of SGD iterates for step-sizes $ { \epsilon_t }{t \geq 1} $ satisfying $\sum \epsilon_t^p 2$. Compared with previous studies, our analysis eliminates the global Lipschitz continuity assumption on the loss function and relaxes the boundedness requirements for higher-order moments of stochastic gradients. Building upon the almost sure convergence results, we further establish $L_2$ convergence. These significantly relaxed assumptions make our theoretical results more general, thereby enhancing their applicability in practical scenarios.}^{+\infty} \epsilon_t = +\infty$ and $\sum_{t=1}^{+\infty
Comment: Provides a novel theoretical framework for SGD convergence under relaxed step-size conditions, contributing to foundational optimization research.
Relevance: 8 Novelty: 8
18. Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
ArXiv ID: 2504.12898
Authors: Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu
Abstract: Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.
Comment: The paper proposes a causal intervention framework for debiasing LLMs, which is relevant to foundational research in LLM behavior and interpretability.
Relevance: 8 Novelty: 7
19. Transferrable Surrogates in Expressive Neural Architecture Search Spaces
ArXiv ID: 2504.12971
Authors: Shiwen Qin, Gabriela Kadlecov\'a, Martin Pil\'at, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson
Abstract: Neural architecture search (NAS) faces a challenge in balancing the exploration of expressive, broad search spaces that enable architectural innovation with the need for efficient evaluation of architectures to effectively search such spaces. We investigate surrogate model training for improving search in highly expressive NAS search spaces based on context-free grammars. We show that i) surrogate models trained either using zero-cost-proxy metrics and neural graph features (GRAF) or by fine-tuning an off-the-shelf LM have high predictive power for the performance of architectures both within and across datasets, ii) these surrogates can be used to filter out bad architectures when searching on novel datasets, thereby significantly speeding up search and achieving better final performances, and iii) the surrogates can be further used directly as the search objective for huge speed-ups.
Comment: The paper focuses on surrogate models for neural architecture search (NAS), which aligns with the 'Model Architecture' criterion as it explores architectural innovation and efficiency in search spaces.
Relevance: 8 Novelty: 7
20. You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models
ArXiv ID: 2504.12471
Authors: Shiwei Ding, Lan Zhang, Zhenlin Wang, Giuseppe Ateniese, Xiaoyong Yuan
Abstract: Fine-tuning plays a crucial role in adapting models to downstream tasks with minimal training efforts. However, the rapidly increasing size of foundation models poses a daunting challenge for accommodating foundation model fine-tuning in most commercial devices, which often have limited memory bandwidth. Techniques like model sharding and tensor parallelism address this issue by distributing computation across multiple devices to meet memory requirements. Nevertheless, these methods do not fully leverage their foundation nature in facilitating the fine-tuning process, resulting in high computational costs and imbalanced workloads. We introduce a novel Distributed Dynamic Fine-Tuning (D2FT) framework that strategically orchestrates operations across attention modules based on our observation that not all attention modules are necessary for forward and backward propagation in fine-tuning foundation models. Through three innovative selection strategies, D2FT significantly reduces the computational workload required for fine-tuning foundation models. Furthermore, D2FT addresses workload imbalances in distributed computing environments by optimizing these selection strategies via multiple knapsack optimization. Our experimental results demonstrate that the proposed D2FT framework reduces the training computational costs by 40% and training communication costs by 50% with only 1% to 2% accuracy drops on the CIFAR-10, CIFAR-100, and Stanford Cars datasets. Moreover, the results show that D2FT can be effectively extended to recent LoRA, a state-of-the-art parameter-efficient fine-tuning technique. By reducing 40% computational cost or 50% communication cost, D2FT LoRA top-1 accuracy only drops 4% to 6% on Stanford Cars dataset.
Comment: The paper introduces a distributed fine-tuning framework for foundation models, which aligns with the 'Model Compression' criterion by addressing computational efficiency in fine-tuning.
Relevance: 8 Novelty: 7
21. Hadamard product in deep learning: Introduction, Advances and Challenges
ArXiv ID: 2504.13112
Authors: Grigorios G Chrysos, Yongtao Wu, Razvan Pascanu, Philip Torr, Volkan Cevher
Abstract: While convolution and self-attention mechanisms have dominated architectural design in deep learning, this survey examines a fundamental yet understudied primitive: the Hadamard product. Despite its widespread implementation across various applications, the Hadamard product has not been systematically analyzed as a core architectural primitive. We present the first comprehensive taxonomy of its applications in deep learning, identifying four principal domains: higher-order correlation, multimodal data fusion, dynamic representation modulation, and efficient pairwise operations. The Hadamard product's ability to model nonlinear interactions with linear computational complexity makes it particularly valuable for resource-constrained deployments and edge computing scenarios. We demonstrate its natural applicability in multimodal fusion tasks, such as visual question answering, and its effectiveness in representation masking for applications including image inpainting and pruning. This systematic review not only consolidates existing knowledge about the Hadamard product's role in deep learning architectures but also establishes a foundation for future architectural innovations. Our analysis reveals the Hadamard product as a versatile primitive that offers compelling trade-offs between computational efficiency and representational power, positioning it as a crucial component in the deep learning toolkit.
Comment: The paper surveys the Hadamard product in deep learning, which aligns with the 'Model Architecture' criterion by analyzing its role as a fundamental architectural primitive.
Relevance: 8 Novelty: 7
22. Towards Lossless Token Pruning in Late-Interaction Retrieval Models
ArXiv ID: 2504.12778
Authors: Yuxuan Zong, Benjamin Piwowarski
Abstract: Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.
Comment: The paper proposes a principled approach to token pruning in late-interaction retrieval models, aligning with the 'Model Compression' criterion by addressing efficiency through pruning strategies.
Relevance: 8 Novelty: 7
23. GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs
ArXiv ID: 2504.12681
Authors: Kun-Woo Kim, Ji-Hoon Park, Ju-Min Han, Seong-Whan Lee
Abstract: Large Language Models (LLMs) trained on extensive datasets often learn sensitive information, which raises significant social and legal concerns under principles such as the "Right to be forgotten." Retraining entire models from scratch to remove undesired information is both costly and impractical. Furthermore, existing single-domain unlearning methods fail to address multi-domain scenarios, where knowledge is interwoven across domains such as privacy and copyright, creating overlapping representations that lead to excessive knowledge removal or degraded performance. To tackle these issues, we propose GRAIL (GRadient-based AdaptIve unLearning), a novel multi-domain unlearning framework. GRAIL leverages gradient information from multiple domains to precisely distinguish the unlearning scope from the retention scope, and applies an adaptive parameter-wise localization strategy to selectively remove targeted knowledge while preserving critical parameters for each domain. Experimental results on unlearning benchmarks show that GRAIL achieves unlearning success on par with the existing approaches, while also demonstrating up to 17% stronger knowledge retention success compared to the previous state-of-art method. Our findings establish a new paradigm for effectively managing and regulating sensitive information in large-scale pre-trained language models.
Comment: Introduces a gradient-based unlearning framework for LLMs, which aligns with foundational research in managing sensitive information in large models.
Relevance: 8 Novelty: 7
24. Simplifying Graph Transformers
ArXiv ID: 2504.12588
Authors: Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates
Abstract: Transformers have attained outstanding performance across various modalities, employing scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers are designed with major architectural differences, either integrating message-passing or incorporating sophisticated attention mechanisms. These complexities prevent the easy adoption of Transformer training advances. We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L_2$ attention to measure the magnitude closeness of tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder. Significant performance gains across a variety of graph datasets justify the effectiveness of our proposed modifications. Furthermore, empirical evaluation on the expressiveness benchmark reveals noteworthy realized expressiveness in the graph isomorphism.
Comment: The paper proposes architectural simplifications for Graph Transformers, which aligns with the 'Model Architecture' criterion by introducing modifications to make Transformers applicable to graphs without major architectural changes.
Relevance: 8 Novelty: 7
25. Disentangling Polysemantic Channels in Convolutional Neural Networks
ArXiv ID: 2504.12939
Authors: Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth
Abstract: Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.
Comment: The paper focuses on disentangling polysemantic channels in CNNs, which aligns with the 'Representation Learning' criterion by enhancing interpretability and understanding of feature encoding in neural networks.
Relevance: 8 Novelty: 7
26. The Others: Naturally Isolating Out-of-Distribution Samples for Robust Open-Set Semi-Supervised Learning
ArXiv ID: 2504.12569
Authors: You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim
Abstract: Open-Set Semi-Supervised Learning (OSSL) tackles the practical challenge of learning from unlabeled data that may include both in-distribution (ID) and unknown out-of-distribution (OOD) classes. However, existing OSSL methods form suboptimal feature spaces by either excluding OOD samples, interfering with them, or overtrusting their information during training. In this work, we introduce MagMatch, a novel framework that naturally isolates OOD samples through a prototype-based contrastive learning paradigm. Unlike conventional methods, MagMatch does not assign any prototypes to OOD samples; instead, it selectively aligns ID samples with class prototypes using an ID-Selective Magnetic (ISM) module, while allowing OOD samples - the "others" - to remain unaligned in the feature space. To support this process, we propose Selective Magnetic Alignment (SMA) loss for unlabeled data, which dynamically adjusts alignment based on sample confidence. Extensive experiments on diverse datasets demonstrate that MagMatch significantly outperforms existing methods in both closed-set classification accuracy and OOD detection AUROC, especially in generalizing to unseen OOD data.
Comment: The paper introduces a novel framework, MagMatch, for open-set semi-supervised learning using a prototype-based contrastive learning paradigm. This aligns with representation learning, particularly in the context of feature space structuring and contrastive methods.
Relevance: 8 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.