Previous Day 2025-04-17
Monthly Overview 2025-04
Next Day 2025-04-21

Personalized Daily ArXiv Papers 2025-04-18

[gpt-4o] Prompt Completion Total
Token 35479 4816 40295
Cost $0.09 $0.05 $0.14

Total arXiv papers: 433

Total scanned papers: 254

Total relevant papers: 26

Table of contents with paper titles:

  1. Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models Authors: Yuanbo Tang, Yan Tang, Naifan Zhang, Meixuan Chen, Yang Li

  2. Dense Backpropagation Improves Training for Sparse Mixture-of-Experts Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein

  3. An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research Authors: Patrik Reizinger, Randall Balestriero, David Klindt, Wieland Brendel

  4. Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation Authors: Nairouz Mrabah, Nicolas Richet, Ismail Ben Ayed, \'Eric Granger

  5. Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection Authors: Long Zeng, Jianxiang Yu, Jiapeng Zhu, Qingsong Zhong, Xiang Li

  6. Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation Authors: Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang

  7. On Linear Representations and Pretraining Data Frequency in Language Models Authors: Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar

  8. Memorization: A Close Look at Books Authors: Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes

  9. A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving Authors: Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko

  10. MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models Authors: Junyang Zhang, Tianyi Zhu, Cheng Luo, Anima Anandkumar

  11. A Two-Phase Perspective on Deep Learning Dynamics Authors: Robert de Mello Koch, Animik Ghosh

  12. Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time Authors: Margalit Glasgow, Denny Wu, Joan Bruna

  13. Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models Authors: Liyi Zhang, Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths

  14. It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

  15. MIB: A Mechanistic Interpretability Benchmark Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iv\'an Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

  16. Spectral Algorithms under Covariate Shift Authors: Jun Fan, Zheng-Chu Guo, Lei Shi

  17. Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods Authors: Ruinan Jin, Difei Cheng, Hong Qiao, Xin Shi, Shaodong Liu, Bo Zhang

  18. Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models Authors: Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu

  19. Transferrable Surrogates in Expressive Neural Architecture Search Spaces Authors: Shiwen Qin, Gabriela Kadlecov\'a, Martin Pil\'at, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson

  20. You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models Authors: Shiwei Ding, Lan Zhang, Zhenlin Wang, Giuseppe Ateniese, Xiaoyong Yuan

  21. Hadamard product in deep learning: Introduction, Advances and Challenges Authors: Grigorios G Chrysos, Yongtao Wu, Razvan Pascanu, Philip Torr, Volkan Cevher

  22. Towards Lossless Token Pruning in Late-Interaction Retrieval Models Authors: Yuxuan Zong, Benjamin Piwowarski

  23. GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs Authors: Kun-Woo Kim, Ji-Hoon Park, Ju-Min Han, Seong-Whan Lee

  24. Simplifying Graph Transformers Authors: Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates

  25. Disentangling Polysemantic Channels in Convolutional Neural Networks Authors: Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth

  26. The Others: Naturally Isolating Out-of-Distribution Samples for Robust Open-Set Semi-Supervised Learning Authors: You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim


1. Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models

ArXiv ID: 2504.12359

Authors: Yuanbo Tang, Yan Tang, Naifan Zhang, Meixuan Chen, Yang Li

Abstract: Mixture-of-Experts based large language models (MoE LLMs) have shown significant promise in multitask adaptability by dynamically routing inputs to specialized experts. Despite their success, the collaborative mechanisms among experts are still not well understood, limiting both the interpretability and optimization of these models. In this paper, we focus on two critical issues: (1) identifying expert collaboration patterns, and (2) optimizing MoE LLMs through expert pruning. To address the first issue, we propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. For the second issue, we introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts. Our extensive experiments demonstrate that expert collaboration patterns are closely linked to specific input types and exhibit semantic significance across various tasks. Moreover, pruning experiments show that our approach improves overall performance by 2.5\% on average, outperforming existing methods. These findings offer valuable insights into enhancing the efficiency and interpretability of MoE LLMs, offering a clearer understanding of expert interactions and improving model optimization.

Comment: The paper explores expert collaboration and pruning in MoE-based LLMs, which is highly relevant to foundational research in model architecture and efficiency.

Relevance: 10 Novelty: 8


2. Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

ArXiv ID: 2504.12463

Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein

Abstract: Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.

Comment: Proposes a method to improve training for sparse Mixture-of-Experts, directly aligning with foundational research in MoE architectures.

Relevance: 10 Novelty: 8


3. An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research

ArXiv ID: 2504.13101

Authors: Patrik Reizinger, Randall Balestriero, David Klindt, Wieland Brendel

Abstract: Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL's empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.

Comment: The paper proposes expanding Identifiability Theory to explain self-supervised learning, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 9


4. Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

ArXiv ID: 2504.12436

Authors: Nairouz Mrabah, Nicolas Richet, Ismail Ben Ayed, \'Eric Granger

Abstract: Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.

Comment: The paper introduces a sparse optimization framework for few-shot adaptation, which aligns with model compression topics like sparsity and efficiency improvements.

Relevance: 9 Novelty: 8


5. Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection

ArXiv ID: 2504.12715

Authors: Long Zeng, Jianxiang Yu, Jiapeng Zhu, Qingsong Zhong, Xiang Li

Abstract: Graph self-supervised learning has gained significant attention recently. However, many existing approaches heavily depend on perturbations, and inappropriate perturbations may corrupt the graph's inherent information. The Vector Quantized Variational Autoencoder (VQ-VAE) is a powerful autoencoder extensively used in fields such as computer vision; however, its application to graph data remains underexplored. In this paper, we provide an empirical analysis of vector quantization in the context of graph autoencoders, demonstrating its significant enhancement of the model's capacity to capture graph topology. Furthermore, we identify two key challenges associated with vector quantization when applying in graph data: codebook underutilization and codebook space sparsity. For the first challenge, we propose an annealing-based encoding strategy that promotes broad code utilization in the early stages of training, gradually shifting focus toward the most effective codes as training progresses. For the second challenge, we introduce a hierarchical two-layer codebook that captures relationships between embeddings through clustering. The second layer codebook links similar codes, encouraging the model to learn closer embeddings for nodes with similar features and structural topology in the graph. Our proposed model outperforms 16 representative baseline methods in self-supervised link prediction and node classification tasks across multiple datasets.

Comment: The paper introduces a hierarchical vector quantized graph autoencoder, which aligns with foundational research in representation learning and autoencoders.

Relevance: 9 Novelty: 8


6. Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation

ArXiv ID: 2504.12637

Authors: Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang

Abstract: Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

Comment: The paper introduces a novel synthetic data generation strategy for extending LLM context lengths, which aligns with the 'Large Language Models' criterion, particularly in addressing architectural and efficiency challenges.

Relevance: 9 Novelty: 8


7. On Linear Representations and Pretraining Data Frequency in Language Models

ArXiv ID: 2504.12459

Authors: Jack Merullo, Noah A. Smith, Sarah Wiegreffe, Yanai Elazar

Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.

Comment: The paper investigates the relationship between pretraining data frequency and linear representations in LLMs, aligning with the 'Representation Learning' criterion as it provides insights into how LLMs encode information.

Relevance: 9 Novelty: 8


8. Memorization: A Close Look at Books

ArXiv ID: 2504.12549

Authors: Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes

Abstract: To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.

Comment: The paper explores memorization in LLMs and its connection to pretraining data, aligning with the 'Large Language Models' criterion as it provides theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8


9. A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

ArXiv ID: 2504.12984

Authors: Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko

Abstract: Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.

Comment: Presents a virtual machine for low-precision GPGPU computation, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


10. MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

ArXiv ID: 2504.12526

Authors: Junyang Zhang, Tianyi Zhu, Cheng Luo, Anima Anandkumar

Abstract: Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50\% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35\% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.

Comment: Proposes a memory-efficient inference method for long-context LLMs, which is relevant to model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


11. A Two-Phase Perspective on Deep Learning Dynamics

ArXiv ID: 2504.12700

Authors: Robert de Mello Koch, Animik Ghosh

Abstract: We propose that learning in deep neural networks proceeds in two phases: a rapid curve fitting phase followed by a slower compression or coarse graining phase. This view is supported by the shared temporal structure of three phenomena: grokking, double descent and the information bottleneck, all of which exhibit a delayed onset of generalization well after training error reaches zero. We empirically show that the associated timescales align in two rather different settings. Mutual information between hidden layers and input data emerges as a natural progress measure, complementing circuit-based metrics such as local complexity and the linear mapping number. We argue that the second phase is not actively optimized by standard training algorithms and may be unnecessarily prolonged. Drawing on an analogy with the renormalization group, we suggest that this compression phase reflects a principled form of forgetting, critical for generalization.

Comment: Proposes a two-phase perspective on deep learning dynamics, offering insights into training dynamics and representation learning.

Relevance: 9 Novelty: 8


12. Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time

ArXiv ID: 2504.13110

Authors: Margalit Glasgow, Denny Wu, Joan Bruna

Abstract: We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle's velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension $d$. We show that, due to a certain ``self-concordance'' property in these problems -- where the local Hessian of a particle is bounded by a constant times the particle's velocity -- polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.

Comment: The paper provides theoretical insights into the dynamics of neural networks in the mean-field regime, which aligns with the 'Representation Learning' criterion by analyzing training dynamics and approximation gaps.

Relevance: 9 Novelty: 8


13. Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

ArXiv ID: 2504.12585

Authors: Liyi Zhang, Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths

Abstract: Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.

Comment: The paper investigates the influence of prior distributions in LLMs and proposes methods to mitigate their effects, aligning with the 'Large Language Models' criterion by providing theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8


14. It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

ArXiv ID: 2504.13173

Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

Abstract: Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.

Comment: The paper proposes a general framework for designing neural architectures inspired by attentional bias, aligning with the 'Model Architecture' criterion by introducing novel architectural insights.

Relevance: 9 Novelty: 8


15. MIB: A Mechanistic Interpretability Benchmark

ArXiv ID: 2504.13151

Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iv\'an Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.

Comment: Introduces a benchmark for mechanistic interpretability, which aligns with foundational research in understanding LLM behavior.

Relevance: 9 Novelty: 7


16. Spectral Algorithms under Covariate Shift

ArXiv ID: 2504.12625

Authors: Jun Fan, Zheng-Chu Guo, Lei Shi

Abstract: Spectral algorithms leverage spectral regularization techniques to analyze and process data, providing a flexible framework for addressing supervised learning problems. To deepen our understanding of their performance in real-world scenarios where the distributions of training and test data may differ, we conduct a rigorous investigation into the convergence behavior of spectral algorithms under distribution shifts, specifically within the framework of reproducing kernel Hilbert spaces. Our study focuses on the case of covariate shift. In this scenario, the marginal distributions of the input data differ between the training and test datasets, while the conditional distribution of the output given the input remains unchanged. Under this setting, we analyze the generalization error of spectral algorithms and show that they achieve minimax optimality when the density ratios between the training and test distributions are uniformly bounded. However, we also identify a critical limitation: when the density ratios are unbounded, the spectral algorithms may become suboptimal. To address this limitation, we propose a weighted spectral algorithm that incorporates density ratio information into the learning process. Our theoretical analysis shows that this weighted approach achieves optimal capacity-independent convergence rates. Furthermore, by introducing a weight clipping technique, we demonstrate that the convergence rates of the weighted spectral algorithm can approach the optimal capacity-dependent convergence rates arbitrarily closely. This improvement resolves the suboptimality issue in unbounded density ratio scenarios and advances the state-of-the-art by refining existing theoretical results.

Comment: The paper investigates spectral algorithms under covariate shift, providing theoretical insights into generalization, which aligns with foundational research.

Relevance: 8 Novelty: 8


17. Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods

ArXiv ID: 2504.12601

Authors: Ruinan Jin, Difei Cheng, Hong Qiao, Xin Shi, Shaodong Liu, Bo Zhang

Abstract: Stochastic Gradient Descent (SGD) is widely used in machine learning research. Previous convergence analyses of SGD under the vanishing step-size setting typically require Robbins-Monro conditions. However, in practice, a wider variety of step-size schemes are frequently employed, yet existing convergence results remain limited and often rely on strong assumptions. This paper bridges this gap by introducing a novel analytical framework based on a stopping-time method, enabling asymptotic convergence analysis of SGD under more relaxed step-size conditions and weaker assumptions. In the non-convex setting, we prove the almost sure convergence of SGD iterates for step-sizes $ { \epsilon_t }{t \geq 1} $ satisfying $\sum \epsilon_t^p 2$. Compared with previous studies, our analysis eliminates the global Lipschitz continuity assumption on the loss function and relaxes the boundedness requirements for higher-order moments of stochastic gradients. Building upon the almost sure convergence results, we further establish $L_2$ convergence. These significantly relaxed assumptions make our theoretical results more general, thereby enhancing their applicability in practical scenarios.}^{+\infty} \epsilon_t = +\infty$ and $\sum_{t=1}^{+\infty

Comment: Provides a novel theoretical framework for SGD convergence under relaxed step-size conditions, contributing to foundational optimization research.

Relevance: 8 Novelty: 8


18. Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models

ArXiv ID: 2504.12898

Authors: Zhouhao Sun, Xiao Ding, Li Du, Yunpeng Xu, Yixuan Ma, Yang Zhao, Bing Qin, Ting Liu

Abstract: Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.

Comment: The paper proposes a causal intervention framework for debiasing LLMs, which is relevant to foundational research in LLM behavior and interpretability.

Relevance: 8 Novelty: 7


19. Transferrable Surrogates in Expressive Neural Architecture Search Spaces

ArXiv ID: 2504.12971

Authors: Shiwen Qin, Gabriela Kadlecov\'a, Martin Pil\'at, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson

Abstract: Neural architecture search (NAS) faces a challenge in balancing the exploration of expressive, broad search spaces that enable architectural innovation with the need for efficient evaluation of architectures to effectively search such spaces. We investigate surrogate model training for improving search in highly expressive NAS search spaces based on context-free grammars. We show that i) surrogate models trained either using zero-cost-proxy metrics and neural graph features (GRAF) or by fine-tuning an off-the-shelf LM have high predictive power for the performance of architectures both within and across datasets, ii) these surrogates can be used to filter out bad architectures when searching on novel datasets, thereby significantly speeding up search and achieving better final performances, and iii) the surrogates can be further used directly as the search objective for huge speed-ups.

Comment: The paper focuses on surrogate models for neural architecture search (NAS), which aligns with the 'Model Architecture' criterion as it explores architectural innovation and efficiency in search spaces.

Relevance: 8 Novelty: 7


20. You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models

ArXiv ID: 2504.12471

Authors: Shiwei Ding, Lan Zhang, Zhenlin Wang, Giuseppe Ateniese, Xiaoyong Yuan

Abstract: Fine-tuning plays a crucial role in adapting models to downstream tasks with minimal training efforts. However, the rapidly increasing size of foundation models poses a daunting challenge for accommodating foundation model fine-tuning in most commercial devices, which often have limited memory bandwidth. Techniques like model sharding and tensor parallelism address this issue by distributing computation across multiple devices to meet memory requirements. Nevertheless, these methods do not fully leverage their foundation nature in facilitating the fine-tuning process, resulting in high computational costs and imbalanced workloads. We introduce a novel Distributed Dynamic Fine-Tuning (D2FT) framework that strategically orchestrates operations across attention modules based on our observation that not all attention modules are necessary for forward and backward propagation in fine-tuning foundation models. Through three innovative selection strategies, D2FT significantly reduces the computational workload required for fine-tuning foundation models. Furthermore, D2FT addresses workload imbalances in distributed computing environments by optimizing these selection strategies via multiple knapsack optimization. Our experimental results demonstrate that the proposed D2FT framework reduces the training computational costs by 40% and training communication costs by 50% with only 1% to 2% accuracy drops on the CIFAR-10, CIFAR-100, and Stanford Cars datasets. Moreover, the results show that D2FT can be effectively extended to recent LoRA, a state-of-the-art parameter-efficient fine-tuning technique. By reducing 40% computational cost or 50% communication cost, D2FT LoRA top-1 accuracy only drops 4% to 6% on Stanford Cars dataset.

Comment: The paper introduces a distributed fine-tuning framework for foundation models, which aligns with the 'Model Compression' criterion by addressing computational efficiency in fine-tuning.

Relevance: 8 Novelty: 7


21. Hadamard product in deep learning: Introduction, Advances and Challenges

ArXiv ID: 2504.13112

Authors: Grigorios G Chrysos, Yongtao Wu, Razvan Pascanu, Philip Torr, Volkan Cevher

Abstract: While convolution and self-attention mechanisms have dominated architectural design in deep learning, this survey examines a fundamental yet understudied primitive: the Hadamard product. Despite its widespread implementation across various applications, the Hadamard product has not been systematically analyzed as a core architectural primitive. We present the first comprehensive taxonomy of its applications in deep learning, identifying four principal domains: higher-order correlation, multimodal data fusion, dynamic representation modulation, and efficient pairwise operations. The Hadamard product's ability to model nonlinear interactions with linear computational complexity makes it particularly valuable for resource-constrained deployments and edge computing scenarios. We demonstrate its natural applicability in multimodal fusion tasks, such as visual question answering, and its effectiveness in representation masking for applications including image inpainting and pruning. This systematic review not only consolidates existing knowledge about the Hadamard product's role in deep learning architectures but also establishes a foundation for future architectural innovations. Our analysis reveals the Hadamard product as a versatile primitive that offers compelling trade-offs between computational efficiency and representational power, positioning it as a crucial component in the deep learning toolkit.

Comment: The paper surveys the Hadamard product in deep learning, which aligns with the 'Model Architecture' criterion by analyzing its role as a fundamental architectural primitive.

Relevance: 8 Novelty: 7


22. Towards Lossless Token Pruning in Late-Interaction Retrieval Models

ArXiv ID: 2504.12778

Authors: Yuxuan Zong, Benjamin Piwowarski

Abstract: Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

Comment: The paper proposes a principled approach to token pruning in late-interaction retrieval models, aligning with the 'Model Compression' criterion by addressing efficiency through pruning strategies.

Relevance: 8 Novelty: 7


ArXiv ID: 2504.12681

Authors: Kun-Woo Kim, Ji-Hoon Park, Ju-Min Han, Seong-Whan Lee

Abstract: Large Language Models (LLMs) trained on extensive datasets often learn sensitive information, which raises significant social and legal concerns under principles such as the "Right to be forgotten." Retraining entire models from scratch to remove undesired information is both costly and impractical. Furthermore, existing single-domain unlearning methods fail to address multi-domain scenarios, where knowledge is interwoven across domains such as privacy and copyright, creating overlapping representations that lead to excessive knowledge removal or degraded performance. To tackle these issues, we propose GRAIL (GRadient-based AdaptIve unLearning), a novel multi-domain unlearning framework. GRAIL leverages gradient information from multiple domains to precisely distinguish the unlearning scope from the retention scope, and applies an adaptive parameter-wise localization strategy to selectively remove targeted knowledge while preserving critical parameters for each domain. Experimental results on unlearning benchmarks show that GRAIL achieves unlearning success on par with the existing approaches, while also demonstrating up to 17% stronger knowledge retention success compared to the previous state-of-art method. Our findings establish a new paradigm for effectively managing and regulating sensitive information in large-scale pre-trained language models.

Comment: Introduces a gradient-based unlearning framework for LLMs, which aligns with foundational research in managing sensitive information in large models.

Relevance: 8 Novelty: 7


24. Simplifying Graph Transformers

ArXiv ID: 2504.12588

Authors: Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates

Abstract: Transformers have attained outstanding performance across various modalities, employing scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers are designed with major architectural differences, either integrating message-passing or incorporating sophisticated attention mechanisms. These complexities prevent the easy adoption of Transformer training advances. We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L_2$ attention to measure the magnitude closeness of tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder. Significant performance gains across a variety of graph datasets justify the effectiveness of our proposed modifications. Furthermore, empirical evaluation on the expressiveness benchmark reveals noteworthy realized expressiveness in the graph isomorphism.

Comment: The paper proposes architectural simplifications for Graph Transformers, which aligns with the 'Model Architecture' criterion by introducing modifications to make Transformers applicable to graphs without major architectural changes.

Relevance: 8 Novelty: 7


25. Disentangling Polysemantic Channels in Convolutional Neural Networks

ArXiv ID: 2504.12939

Authors: Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth

Abstract: Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.

Comment: The paper focuses on disentangling polysemantic channels in CNNs, which aligns with the 'Representation Learning' criterion by enhancing interpretability and understanding of feature encoding in neural networks.

Relevance: 8 Novelty: 7


26. The Others: Naturally Isolating Out-of-Distribution Samples for Robust Open-Set Semi-Supervised Learning

ArXiv ID: 2504.12569

Authors: You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim

Abstract: Open-Set Semi-Supervised Learning (OSSL) tackles the practical challenge of learning from unlabeled data that may include both in-distribution (ID) and unknown out-of-distribution (OOD) classes. However, existing OSSL methods form suboptimal feature spaces by either excluding OOD samples, interfering with them, or overtrusting their information during training. In this work, we introduce MagMatch, a novel framework that naturally isolates OOD samples through a prototype-based contrastive learning paradigm. Unlike conventional methods, MagMatch does not assign any prototypes to OOD samples; instead, it selectively aligns ID samples with class prototypes using an ID-Selective Magnetic (ISM) module, while allowing OOD samples - the "others" - to remain unaligned in the feature space. To support this process, we propose Selective Magnetic Alignment (SMA) loss for unlabeled data, which dynamically adjusts alignment based on sample confidence. Extensive experiments on diverse datasets demonstrate that MagMatch significantly outperforms existing methods in both closed-set classification accuracy and OOD detection AUROC, especially in generalizing to unseen OOD data.

Comment: The paper introduces a novel framework, MagMatch, for open-set semi-supervised learning using a prototype-based contrastive learning paradigm. This aligns with representation learning, particularly in the context of feature space structuring and contrastive methods.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: