Personalized Daily Arxiv Papers 03/13/2025

[gpt-4o]	Prompt	Completion	Total
Token	43136	6079	49215
Cost	$0.1	$0.06	$0.16

Total ArXiv papers: 492

Total scanned papers: 290

Total relevant papers: 34

Table of contents with paper titles:

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $\mu$P Parametrization Authors: Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Authors: Yuhang Liu, Dong Gong, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi
Why LLMs Cannot Think and How to Fix It Authors: Marius Jahrens, Thomas Martinetz
Cost-Optimal Grouped-Query Attention for Long-Context LLMs Authors: Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun
GRU: Mitigating the Trade-off between Unlearning and Retention for Large Language Models Authors: Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han
Towards Interpretable Protein Structure Prediction with Sparse Autoencoders Authors: Nithin Parsan, David J. Yang, John J. Yang
Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference Authors: Mohammad Siavashi, Faezeh Keshmiri Dindarloo, Dejan Kostic, Marco Chiesa
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment Authors: Xiaoda Yang, JunYu Lu, Hongshun Qiu, Sijing Li, Hao Li, Shengpeng Ji, Xudong Tang, Jiayang Xu, Jiaqi Duan, Ziyue Jiang, Cong Lin, Sihang Cai, Zejian Xie, Zhuoyang Song, Songxin Zhang
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference Authors: Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, Urmish Thakker
Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States Authors: Xin Wei Chia, Jonathan Pan
Discovering Influential Neuron Path in Vision Transformers Authors: Yifan Wang, Yifei Liu, Yingdong Shi, Changming Li, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Online multidimensional dictionary learning Authors: Ferdaous Ait Addi, Abdeslem Hafid Bentbib, Khalide Jbilou
Implicit Contrastive Representation Learning with Guided Stop-gradient Authors: Byeongchan Lee, Sehyun Lee
Interpreting the Repeated Token Phenomenon in Large Language Models Authors: Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, Yossi Gandelsman
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
Robust Multi-Objective Controlled Decoding of Large Language Models Authors: Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic
Training Plug-n-Play Knowledge Modules with Deep Context Distillation Authors: Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vuli\'c, Alessandro Sordoni
Is CLIP ideal? No. Can we fix it? Yes! Authors: Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach Authors: Ruifeng She, Bowen Pang, Kai Li, Zehua Liu, Tao Zhong
Quantitative Analysis of Deeply Quantized Tiny Neural Networks Robust to Adversarial Attacks Authors: Idris Zakariyya, Ferheen Ayaz, Mounia Kharbouche-Harrari, Jeremy Singer, Sye Loong Keoh, Danilo Pau, Jos\'e Cano
SO(3)-Equivariant Neural Networks for Learning Vector Fields on Spheres Authors: Francesco Ballerin, Nello Blaser, Erlend Grong
Learning Spatially Adaptive $\ell_1$-Norms Weights for Convolutional Synthesis Regularization Authors: Andreas Kofler, Luca Calatroni, Christoph Kolbitsch, Kostas Papafitsoros
Adaptive Temperature Based on Logits Correlation in Knowledge Distillation Authors: Kazuhiro Matsuyama, Usman Anjum, Satoko Matsuyama, Tetsuo Shoda, Justin Zhan
A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation Authors: Forough Fazeliasl, Michael Minyi Zhang, Bei Jiang, Linglong Kong
Neural Normalized Cut: A Differential and Generalizable Approach for Spectral Clustering Authors: Wei He, Shangzhi Zhang, Chun-Guang Li, Xianbiao Qi, Rong Xiao, Jun Guo
PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs Authors: Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, Stella Biderman
Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation Authors: Yu Wang, Jiaxin Zhang, Xiang Gao, Wendi Cui, Peng Li, Kamalika Das
Learning Pareto manifolds in high dimensions: How can regularization help? Authors: Tobias Wegel, Filip Kova\v{c}evi\'c, Alexandru \c{T}ifrea, Fanny Yang
Neurosymbolic Decision Trees Authors: Matthias M\"oller, Arvid Norlander, Pedro Zuidberg Dos Martires, Luc De Raedt
Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization Authors: Amit Attia, Tomer Koren
Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs Authors: Aidan Ferguson, Perry Gibson, Lara D'Agata, Parker McLeod, Ferhat Yaman, Amitabh Das, Ian Colbert, Jos\'e Cano
Towards Robust Multimodal Representation: A Unified Approach with Adaptive Experts and Alignment Authors: Nazanin Moradinasab, Saurav Sengupta, Jiebei Liu, Sana Syed, Donald E. Brown
Towards Graph Foundation Models: A Transferability Perspective Authors: Yuxiang Wang, Wenqi Fan, Suhang Wang, Yao Ma
The Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction Authors: Mohammad Tariqul Islam, Jason W. Fleischer

1. Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $\mu$P Parametrization

ArXiv ID: 2503.09565

Authors: Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu

Abstract: Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

Comment: The paper provides theoretical insights into training dynamics and feature learning in infinite-width neural networks, aligning strongly with representation learning and training dynamics.

Relevance: 10 Novelty: 9

2. I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

ArXiv ID: 2503.08980

Authors: Yuhang Liu, Dong Gong, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi

Abstract: The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also strongly reinforces the linear representation hypothesis, which posits that LLMs learn linear representations of human-interpretable concepts. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.

Comment: The paper provides theoretical insights into how LLMs learn human-interpretable concepts, aligning with foundational research in representation learning and LLM behavior.

Relevance: 9 Novelty: 9

3. Why LLMs Cannot Think and How to Fix It

ArXiv ID: 2503.09211

Authors: Marius Jahrens, Thomas Martinetz

Abstract: This paper elucidates that current state-of-the-art Large Language Models (LLMs) are fundamentally incapable of making decisions or developing "thoughts" within the feature space due to their architectural constraints. We establish a definition of "thought" that encompasses traditional understandings of that term and adapt it for application to LLMs. We demonstrate that the architectural design and language modeling training methodology of contemporary LLMs inherently preclude them from engaging in genuine thought processes. Our primary focus is on this theoretical realization rather than practical insights derived from experimental data. Finally, we propose solutions to enable thought processes within the feature space and discuss the broader implications of these architectural modifications.

Comment: The paper critiques the architectural limitations of LLMs and proposes solutions to enable 'thought processes,' aligning with foundational research on LLM architecture.

Relevance: 9 Novelty: 9

4. Cost-Optimal Grouped-Query Attention for Long-Context LLMs

ArXiv ID: 2503.09579

Authors: Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract: Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.

Comment: The paper explores cost-optimal grouped-query attention for long-context LLMs, which aligns with foundational research in model architecture and efficiency. It provides insights into attention head configurations and scaling laws.

Relevance: 9 Novelty: 8

5. GRU: Mitigating the Trade-off between Unlearning and Retention for Large Language Models

ArXiv ID: 2503.09117

Authors: Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han

Abstract: Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. In examining the update process for unlearning dynamically, we find gradients hold essential information for revealing this trade-off. In particular, we look at the varying relationship between retention performance and directional disparities between gradients during unlearning. It motivates the sculpting of an update mechanism derived from gradients from two sources, i.e., harmful for retention and useful for unlearning. Accordingly, we propose Gradient Rectified Unlearning (GRU), an enhanced unlearning framework controlling the updating gradients in a geometry-focused and optimization-driven manner such that their side impacts on other, unrelated responses can be minimized. Specifically, GRU derives a closed-form solution to project the unlearning gradient onto the orthogonal space of that gradient harmful for retention, ensuring minimal deviation from its original direction under the condition that overall performance is retained. Comprehensive experiments are conducted to demonstrate that GRU, as a general framework, is straightforward to implement and efficiently enhances a range of baseline methods through its adaptable and compatible characteristics. Additionally, experimental results show its broad effectiveness across a diverse set of benchmarks for LLM unlearning.

Comment: The paper introduces Gradient Rectified Unlearning (GRU) for LLMs, focusing on unlearning while retaining general functionality. This aligns with foundational advancements in LLM behavior and optimization.

Relevance: 9 Novelty: 8

6. Towards Interpretable Protein Structure Prediction with Sparse Autoencoders

ArXiv ID: 2503.08764

Authors: Nithin Parsan, David J. Yang, John J. Yang

Abstract: Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high-dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open-source our code, dataset, pretrained models https://github.com/johnyang101/reticular-sae , and visualizer https://sae.reticular.ai .

Comment: The paper scales sparse autoencoders to large protein language models, enabling interpretability in protein structure prediction. This aligns with foundational research in representation learning and AI for science.

Relevance: 9 Novelty: 8

7. Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

ArXiv ID: 2503.09304

Authors: Mohammad Siavashi, Faezeh Keshmiri Dindarloo, Dejan Kostic, Marco Chiesa

Abstract: Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of $65.5\times$ and meets the SLO at up to $7$ requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to $12.8\times$ without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.

Comment: The paper introduces a priority-aware preemptive scheduling system for MoE inference, which aligns with architectural innovations in MoE models.

Relevance: 9 Novelty: 8

8. Astrea: A MOE-based Visual Understanding Model with Progressive Alignment

ArXiv ID: 2503.09445

Authors: Xiaoda Yang, JunYu Lu, Hongshun Qiu, Sijing Li, Hao Li, Shengpeng Ji, Xudong Tang, Jiayang Xu, Jiaqi Duan, Ziyue Jiang, Cong Lin, Sihang Cai, Zejian Xie, Zhuoyang Song, Songxin Zhang

Abstract: Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures have emerged as a pivotal paradigm in multimodal understanding, offering a powerful framework for integrating visual and linguistic information. However, the increasing complexity and diversity of tasks present significant challenges in coordinating load balancing across heterogeneous visual experts, where optimizing one specialist's performance often compromises others' capabilities. To address task heterogeneity and expert load imbalance, we propose Astrea, a novel multi-expert collaborative VLM architecture based on progressive pre-alignment. Astrea introduces three key innovations: 1) A heterogeneous expert coordination mechanism that integrates four specialized models (detection, segmentation, classification, captioning) into a comprehensive expert matrix covering essential visual comprehension elements; 2) A dynamic knowledge fusion strategy featuring progressive pre-alignment to harmonize experts within the VLM latent space through contrastive learning, complemented by probabilistically activated stochastic residual connections to preserve knowledge continuity; 3) An enhanced optimization framework utilizing momentum contrastive learning for long-range dependency modeling and adaptive weight allocators for real-time expert contribution calibration. Extensive evaluations across 12 benchmark tasks spanning VQA, image captioning, and cross-modal retrieval demonstrate Astrea's superiority over state-of-the-art models, achieving an average performance gain of +4.7\%. This study provides the first empirical demonstration that progressive pre-alignment strategies enable VLMs to overcome task heterogeneity limitations, establishing new methodological foundations for developing general-purpose multimodal agents.

Comment: The paper introduces a MoE-based visual understanding model, which aligns with the model architecture criterion, particularly focusing on MoE innovations.

Relevance: 9 Novelty: 8

9. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference

ArXiv ID: 2503.08879

Authors: Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, Urmish Thakker

Abstract: Efficient long-context inference is critical as large language models (LLMs) adopt context windows of ranging from 128K to 1M tokens. However, the growing key-value (KV) cache and the high computational complexity of attention create significant bottlenecks in memory usage and latency. In this paper, we find that attention in diverse long-context tasks exhibits sparsity, and LLMs implicitly "know" which tokens can be dropped or evicted at the head level after the pre-filling stage. Based on this insight, we propose Self-Attention Guided Eviction~(SAGE-KV), a simple and effective KV eviction cache method for long-context inference. After prefilling, our method performs a one-time top-k selection at both the token and head levels to compress the KV cache, enabling efficient inference with the reduced cache. Evaluations on LongBench and three long-context LLMs (Llama3.1-8B-Instruct-128k, Llama3-8B-Prolong-512k-Instruct, and Qwen2.5-7B-Instruct-128k) show that SAGE-KV maintains accuracy comparable to full attention while significantly improving efficiency. Specifically, SAGE-KV achieves 4x higher memory efficiency with improved accuracy over the static KV cache selection method StreamLLM, and 2x higher memory efficiency with better accuracy than the dynamic KV cache selection method Quest.

Comment: The paper introduces a KV cache eviction method for efficient long-context inference in LLMs, aligning with the 'Model Compression' criterion due to its focus on memory efficiency and sparsity.

Relevance: 9 Novelty: 8

10. Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States

ArXiv ID: 2503.09066

Authors: Xin Wei Chia, Jonathan Pan

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to adversarial manipulations such as jailbreaking via prompt injection attacks. These attacks bypass safety mechanisms to generate restricted or harmful content. In this study, we investigated the underlying latent subspaces of safe and jailbroken states by extracting hidden activations from a LLM. Inspired by attractor dynamics in neuroscience, we hypothesized that LLM activations settle into semi stable states that can be identified and perturbed to induce state transitions. Using dimensionality reduction techniques, we projected activations from safe and jailbroken responses to reveal latent subspaces in lower dimensional spaces. We then derived a perturbation vector that when applied to safe representations, shifted the model towards a jailbreak state. Our results demonstrate that this causal intervention results in statistically significant jailbreak responses in a subset of prompts. Next, we probed how these perturbations propagate through the model's layers, testing whether the induced state change remains localized or cascades throughout the network. Our findings indicate that targeted perturbations induced distinct shifts in activations and model responses. Our approach paves the way for potential proactive defenses, shifting from traditional guardrail based methods to preemptive, model agnostic techniques that neutralize adversarial states at the representation level.

Comment: The paper explores latent subspaces in LLMs for adversarial state manipulation, aligning with the 'Large Language Models' criterion due to its focus on interpretability and theoretical insights.

Relevance: 9 Novelty: 8

11. Discovering Influential Neuron Path in Vision Transformers

ArXiv ID: 2503.09046

Authors: Yifan Wang, Yifei Liu, Yingdong Shi, Changming Li, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren

Abstract: Vision Transformer models exhibit immense power yet remain opaque to human understanding, posing challenges and risks for practical applications. While prior research has attempted to demystify these models through input attribution and neuron role analysis, there's been a notable gap in considering layer-level information and the holistic path of information flow across layers. In this paper, we investigate the significance of influential neuron paths within vision Transformers, which is a path of neurons from the model input to output that impacts the model inference most significantly. We first propose a joint influence measure to assess the contribution of a set of neurons to the model outcome. And we further provide a layer-progressive neuron locating approach that efficiently selects the most influential neuron at each layer trying to discover the crucial neuron path from input to output within the target model. Our experiments demonstrate the superiority of our method finding the most influential neuron path along which the information flows, over the existing baseline solutions. Additionally, the neuron paths have illustrated that vision Transformers exhibit some specific inner working mechanism for processing the visual information within the same image category. We further analyze the key effects of these neurons on the image classification task, showcasing that the found neuron paths have already preserved the model capability on downstream tasks, which may also shed some lights on real-world applications like model pruning. The project website including implementation code is available at https://foundation-model-research.github.io/NeuronPath/.

Comment: The paper investigates influential neuron paths in Vision Transformers, which aligns with understanding model architecture and interpretability. It provides insights into the inner workings of Transformers.

Relevance: 9 Novelty: 8

12. Online multidimensional dictionary learning

ArXiv ID: 2503.09337

Authors: Ferdaous Ait Addi, Abdeslem Hafid Bentbib, Khalide Jbilou

Abstract: Dictionary learning is a widely used technique in signal processing and machine learning that aims to represent data as a linear combination of a few elements from an overcomplete dictionary. In this work, we propose a generalization of the dictionary learning technique using the t-product framework, enabling efficient handling of multidimensional tensor data. We address the dictionary learning problem through online methods suitable for tensor structures. To effectively address the sparsity problem, we utilize an accelerated Iterative Shrinkage-Thresholding Algorithm (ISTA) enhanced with an extrapolation technique known as Anderson acceleration. This approach significantly improves signal reconstruction results. Extensive experiments prove that our proposed method outperforms existing acceleration techniques, particularly in applications such as data completion. These results suggest that our approach can be highly beneficial for large-scale tensor data analysis in various domains.

Comment: The paper focuses on online multidimensional dictionary learning, which is directly relevant to representation learning and sparse methods. It introduces a novel acceleration technique.

Relevance: 9 Novelty: 8

13. Implicit Contrastive Representation Learning with Guided Stop-gradient

ArXiv ID: 2503.09058

Authors: Byeongchan Lee, Sehyun Lee

Abstract: In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available at https://github.com/bych-lee/gsg.

Comment: The paper introduces a novel method for implicit contrastive representation learning, which aligns with representation learning and training dynamics. It provides methodological advancements.

Relevance: 9 Novelty: 8

14. Interpreting the Repeated Token Phenomenon in Large Language Models

ArXiv ID: 2503.08908

Authors: Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, Yossi Gandelsman

Abstract: Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a vulnerability, allowing even end-users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of ``attention sinks'', an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other non-repeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the model's overall performance. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.

Comment: The paper provides a mechanistic explanation for a specific failure mode in LLMs and proposes a targeted patch, aligning with the criterion of theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

15. SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

ArXiv ID: 2503.09532

Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

Abstract: Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: https://saebench.xyz

Comment: The paper introduces a benchmark for sparse autoencoders, which aligns with the representation learning criterion. The focus on interpretability and feature disentanglement is relevant.

Relevance: 9 Novelty: 7

16. Robust Multi-Objective Controlled Decoding of Large Language Models

ArXiv ID: 2503.08796

Authors: Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Abstract: Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predefined weights or optimize for averages, sacrificing one objective for another and leading to unbalanced outcomes. To address this, we introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimizes for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy, solving for the Nash equilibrium. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically. We also introduce a practical RMOD variant designed for efficient decoding with contemporary LLMs, incurring minimal computational overhead compared to non-robust Multi-Objective Decoding (MOD) methods. Our experimental results showcase the effectiveness of RMOD in generating responses equitably aligned with diverse objectives, outperforming baselines up to 20%.

Comment: The paper proposes a novel inference-time algorithm for multi-objective decoding in LLMs, which aligns with the 'Large Language Models' criterion due to its focus on theoretical improvements in decoding strategies.

Relevance: 8 Novelty: 8

17. Training Plug-n-Play Knowledge Modules with Deep Context Distillation

ArXiv ID: 2503.08727

Authors: Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vuli\'c, Alessandro Sordoni

Abstract: Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.

Comment: The paper proposes a novel method for modularizing knowledge in LLMs using parameter-efficient LoRA modules, which aligns with the 'Large Language Models' criterion due to its focus on foundational improvements in knowledge integration.

Relevance: 8 Novelty: 8

18. Is CLIP ideal? No. Can we fix it? Yes!

ArXiv ID: 2503.08723

Authors: Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona

Abstract: Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP

Comment: The paper critiques the geometric limitations of CLIP's latent space and proposes a novel scoring method, aligning with representation learning and foundational model analysis.

Relevance: 8 Novelty: 8

19. Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

ArXiv ID: 2503.09357

Authors: Ruifeng She, Bowen Pang, Kai Li, Zehua Liu, Tao Zhong

Abstract: As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and pipeline-have been successfully implemented for popular neural networks on main-stream hardware, optimizing the distributed deployment schedule requires extensive expertise and manual effort. Further more, while existing frameworks with most simple chain-like structures, they struggle with complex non-linear architectures. Mixture-of-experts and multi-modal models feature intricate MIMO and branch-rich topologies that require fine-grained operator-level parallelization beyond the capabilities of existing frameworks. We propose formulating parallelism planning as a scheduling optimization problem using mixed-integer programming. We propose a bi-level solution framework balancing optimality with computational efficiency, automatically generating effective distributed plans that capture both the heterogeneous structure of modern neural networks and the underlying hardware constraints. In experiments comparing against expert-designed strategies like DeepSeek's DualPipe, our framework achieves comparable or superior performance, reducing computational bubbles by half under the same memory constraints. The framework's versatility extends beyond throughput optimization to incorporate hardware utilization maximization, memory capacity constraints, and other considerations or potential strategies. Such capabilities position our solution as both a valuable research tool for exploring optimal parallelization strategies and a practical industrial solution for large-scale AI deployment.

Comment: The paper focuses on distributed deep learning and operator-level parallelism planning, which is relevant to model efficiency and scalability. It introduces a novel mixed-integer programming approach, aligning with foundational research in model compression and efficiency.