Personalized Daily Arxiv Papers 03/13/2025
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 43136 | 6079 | 49215 |
| Cost | $0.1 | $0.06 | $0.16 |
Total ArXiv papers: 492
Total scanned papers: 290
Total relevant papers: 34
Table of contents with paper titles:
-
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $\mu$P Parametrization Authors: Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu
-
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Authors: Yuhang Liu, Dong Gong, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi
-
Why LLMs Cannot Think and How to Fix It Authors: Marius Jahrens, Thomas Martinetz
-
Cost-Optimal Grouped-Query Attention for Long-Context LLMs Authors: Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun
-
GRU: Mitigating the Trade-off between Unlearning and Retention for Large Language Models Authors: Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han
-
Towards Interpretable Protein Structure Prediction with Sparse Autoencoders Authors: Nithin Parsan, David J. Yang, John J. Yang
-
Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference Authors: Mohammad Siavashi, Faezeh Keshmiri Dindarloo, Dejan Kostic, Marco Chiesa
-
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment Authors: Xiaoda Yang, JunYu Lu, Hongshun Qiu, Sijing Li, Hao Li, Shengpeng Ji, Xudong Tang, Jiayang Xu, Jiaqi Duan, Ziyue Jiang, Cong Lin, Sihang Cai, Zejian Xie, Zhuoyang Song, Songxin Zhang
-
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference Authors: Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, Urmish Thakker
-
Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States Authors: Xin Wei Chia, Jonathan Pan
-
Discovering Influential Neuron Path in Vision Transformers Authors: Yifan Wang, Yifei Liu, Yingdong Shi, Changming Li, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
-
Online multidimensional dictionary learning Authors: Ferdaous Ait Addi, Abdeslem Hafid Bentbib, Khalide Jbilou
-
Implicit Contrastive Representation Learning with Guided Stop-gradient Authors: Byeongchan Lee, Sehyun Lee
-
Interpreting the Repeated Token Phenomenon in Large Language Models Authors: Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, Yossi Gandelsman
-
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
-
Robust Multi-Objective Controlled Decoding of Large Language Models Authors: Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic
-
Training Plug-n-Play Knowledge Modules with Deep Context Distillation Authors: Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vuli\'c, Alessandro Sordoni
-
Is CLIP ideal? No. Can we fix it? Yes! Authors: Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona
-
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach Authors: Ruifeng She, Bowen Pang, Kai Li, Zehua Liu, Tao Zhong
-
Quantitative Analysis of Deeply Quantized Tiny Neural Networks Robust to Adversarial Attacks Authors: Idris Zakariyya, Ferheen Ayaz, Mounia Kharbouche-Harrari, Jeremy Singer, Sye Loong Keoh, Danilo Pau, Jos\'e Cano
-
SO(3)-Equivariant Neural Networks for Learning Vector Fields on Spheres Authors: Francesco Ballerin, Nello Blaser, Erlend Grong
-
Learning Spatially Adaptive $\ell_1$-Norms Weights for Convolutional Synthesis Regularization Authors: Andreas Kofler, Luca Calatroni, Christoph Kolbitsch, Kostas Papafitsoros
-
Adaptive Temperature Based on Logits Correlation in Knowledge Distillation Authors: Kazuhiro Matsuyama, Usman Anjum, Satoko Matsuyama, Tetsuo Shoda, Justin Zhan
-
A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation Authors: Forough Fazeliasl, Michael Minyi Zhang, Bei Jiang, Linglong Kong
-
Neural Normalized Cut: A Differential and Generalizable Approach for Spectral Clustering Authors: Wei He, Shangzhi Zhang, Chun-Guang Li, Xianbiao Qi, Rong Xiao, Jun Guo
-
PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs Authors: Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, Stella Biderman
-
Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation Authors: Yu Wang, Jiaxin Zhang, Xiang Gao, Wendi Cui, Peng Li, Kamalika Das
-
Learning Pareto manifolds in high dimensions: How can regularization help? Authors: Tobias Wegel, Filip Kova\v{c}evi\'c, Alexandru \c{T}ifrea, Fanny Yang
-
Neurosymbolic Decision Trees Authors: Matthias M\"oller, Arvid Norlander, Pedro Zuidberg Dos Martires, Luc De Raedt
-
Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization Authors: Amit Attia, Tomer Koren
-
Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs Authors: Aidan Ferguson, Perry Gibson, Lara D'Agata, Parker McLeod, Ferhat Yaman, Amitabh Das, Ian Colbert, Jos\'e Cano
-
Towards Robust Multimodal Representation: A Unified Approach with Adaptive Experts and Alignment Authors: Nazanin Moradinasab, Saurav Sengupta, Jiebei Liu, Sana Syed, Donald E. Brown
-
Towards Graph Foundation Models: A Transferability Perspective Authors: Yuxiang Wang, Wenqi Fan, Suhang Wang, Yao Ma
-
The Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction Authors: Mohammad Tariqul Islam, Jason W. Fleischer
1. Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $\mu$P Parametrization
ArXiv ID: 2503.09565
Authors: Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu
Abstract: Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
Comment: The paper provides theoretical insights into training dynamics and feature learning in infinite-width neural networks, aligning strongly with representation learning and training dynamics.
Relevance: 10 Novelty: 9
2. I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
ArXiv ID: 2503.08980
Authors: Yuhang Liu, Dong Gong, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi
Abstract: The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also strongly reinforces the linear representation hypothesis, which posits that LLMs learn linear representations of human-interpretable concepts. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.
Comment: The paper provides theoretical insights into how LLMs learn human-interpretable concepts, aligning with foundational research in representation learning and LLM behavior.
Relevance: 9 Novelty: 9
3. Why LLMs Cannot Think and How to Fix It
ArXiv ID: 2503.09211
Authors: Marius Jahrens, Thomas Martinetz
Abstract: This paper elucidates that current state-of-the-art Large Language Models (LLMs) are fundamentally incapable of making decisions or developing "thoughts" within the feature space due to their architectural constraints. We establish a definition of "thought" that encompasses traditional understandings of that term and adapt it for application to LLMs. We demonstrate that the architectural design and language modeling training methodology of contemporary LLMs inherently preclude them from engaging in genuine thought processes. Our primary focus is on this theoretical realization rather than practical insights derived from experimental data. Finally, we propose solutions to enable thought processes within the feature space and discuss the broader implications of these architectural modifications.
Comment: The paper critiques the architectural limitations of LLMs and proposes solutions to enable 'thought processes,' aligning with foundational research on LLM architecture.
Relevance: 9 Novelty: 9
4. Cost-Optimal Grouped-Query Attention for Long-Context LLMs
ArXiv ID: 2503.09579
Authors: Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract: Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.
Comment: The paper explores cost-optimal grouped-query attention for long-context LLMs, which aligns with foundational research in model architecture and efficiency. It provides insights into attention head configurations and scaling laws.
Relevance: 9 Novelty: 8
5. GRU: Mitigating the Trade-off between Unlearning and Retention for Large Language Models
ArXiv ID: 2503.09117
Authors: Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han
Abstract: Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. In examining the update process for unlearning dynamically, we find gradients hold essential information for revealing this trade-off. In particular, we look at the varying relationship between retention performance and directional disparities between gradients during unlearning. It motivates the sculpting of an update mechanism derived from gradients from two sources, i.e., harmful for retention and useful for unlearning. Accordingly, we propose Gradient Rectified Unlearning (GRU), an enhanced unlearning framework controlling the updating gradients in a geometry-focused and optimization-driven manner such that their side impacts on other, unrelated responses can be minimized. Specifically, GRU derives a closed-form solution to project the unlearning gradient onto the orthogonal space of that gradient harmful for retention, ensuring minimal deviation from its original direction under the condition that overall performance is retained. Comprehensive experiments are conducted to demonstrate that GRU, as a general framework, is straightforward to implement and efficiently enhances a range of baseline methods through its adaptable and compatible characteristics. Additionally, experimental results show its broad effectiveness across a diverse set of benchmarks for LLM unlearning.
Comment: The paper introduces Gradient Rectified Unlearning (GRU) for LLMs, focusing on unlearning while retaining general functionality. This aligns with foundational advancements in LLM behavior and optimization.
Relevance: 9 Novelty: 8
6. Towards Interpretable Protein Structure Prediction with Sparse Autoencoders
ArXiv ID: 2503.08764
Authors: Nithin Parsan, David J. Yang, John J. Yang
Abstract: Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high-dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open-source our code, dataset, pretrained models https://github.com/johnyang101/reticular-sae , and visualizer https://sae.reticular.ai .
Comment: The paper scales sparse autoencoders to large protein language models, enabling interpretability in protein structure prediction. This aligns with foundational research in representation learning and AI for science.
Relevance: 9 Novelty: 8
7. Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
ArXiv ID: 2503.09304
Authors: Mohammad Siavashi, Faezeh Keshmiri Dindarloo, Dejan Kostic, Marco Chiesa
Abstract: Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of $65.5\times$ and meets the SLO at up to $7$ requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to $12.8\times$ without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.
Comment: The paper introduces a priority-aware preemptive scheduling system for MoE inference, which aligns with architectural innovations in MoE models.
Relevance: 9 Novelty: 8
8. Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
ArXiv ID: 2503.09445
Authors: Xiaoda Yang, JunYu Lu, Hongshun Qiu, Sijing Li, Hao Li, Shengpeng Ji, Xudong Tang, Jiayang Xu, Jiaqi Duan, Ziyue Jiang, Cong Lin, Sihang Cai, Zejian Xie, Zhuoyang Song, Songxin Zhang
Abstract: Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures have emerged as a pivotal paradigm in multimodal understanding, offering a powerful framework for integrating visual and linguistic information. However, the increasing complexity and diversity of tasks present significant challenges in coordinating load balancing across heterogeneous visual experts, where optimizing one specialist's performance often compromises others' capabilities. To address task heterogeneity and expert load imbalance, we propose Astrea, a novel multi-expert collaborative VLM architecture based on progressive pre-alignment. Astrea introduces three key innovations: 1) A heterogeneous expert coordination mechanism that integrates four specialized models (detection, segmentation, classification, captioning) into a comprehensive expert matrix covering essential visual comprehension elements; 2) A dynamic knowledge fusion strategy featuring progressive pre-alignment to harmonize experts within the VLM latent space through contrastive learning, complemented by probabilistically activated stochastic residual connections to preserve knowledge continuity; 3) An enhanced optimization framework utilizing momentum contrastive learning for long-range dependency modeling and adaptive weight allocators for real-time expert contribution calibration. Extensive evaluations across 12 benchmark tasks spanning VQA, image captioning, and cross-modal retrieval demonstrate Astrea's superiority over state-of-the-art models, achieving an average performance gain of +4.7\%. This study provides the first empirical demonstration that progressive pre-alignment strategies enable VLMs to overcome task heterogeneity limitations, establishing new methodological foundations for developing general-purpose multimodal agents.
Comment: The paper introduces a MoE-based visual understanding model, which aligns with the model architecture criterion, particularly focusing on MoE innovations.
Relevance: 9 Novelty: 8
9. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
ArXiv ID: 2503.08879
Authors: Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, Urmish Thakker
Abstract: Efficient long-context inference is critical as large language models (LLMs) adopt context windows of ranging from 128K to 1M tokens. However, the growing key-value (KV) cache and the high computational complexity of attention create significant bottlenecks in memory usage and latency. In this paper, we find that attention in diverse long-context tasks exhibits sparsity, and LLMs implicitly "know" which tokens can be dropped or evicted at the head level after the pre-filling stage. Based on this insight, we propose Self-Attention Guided Eviction~(SAGE-KV), a simple and effective KV eviction cache method for long-context inference. After prefilling, our method performs a one-time top-k selection at both the token and head levels to compress the KV cache, enabling efficient inference with the reduced cache. Evaluations on LongBench and three long-context LLMs (Llama3.1-8B-Instruct-128k, Llama3-8B-Prolong-512k-Instruct, and Qwen2.5-7B-Instruct-128k) show that SAGE-KV maintains accuracy comparable to full attention while significantly improving efficiency. Specifically, SAGE-KV achieves 4x higher memory efficiency with improved accuracy over the static KV cache selection method StreamLLM, and 2x higher memory efficiency with better accuracy than the dynamic KV cache selection method Quest.
Comment: The paper introduces a KV cache eviction method for efficient long-context inference in LLMs, aligning with the 'Model Compression' criterion due to its focus on memory efficiency and sparsity.
Relevance: 9 Novelty: 8
10. Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States
ArXiv ID: 2503.09066
Authors: Xin Wei Chia, Jonathan Pan
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to adversarial manipulations such as jailbreaking via prompt injection attacks. These attacks bypass safety mechanisms to generate restricted or harmful content. In this study, we investigated the underlying latent subspaces of safe and jailbroken states by extracting hidden activations from a LLM. Inspired by attractor dynamics in neuroscience, we hypothesized that LLM activations settle into semi stable states that can be identified and perturbed to induce state transitions. Using dimensionality reduction techniques, we projected activations from safe and jailbroken responses to reveal latent subspaces in lower dimensional spaces. We then derived a perturbation vector that when applied to safe representations, shifted the model towards a jailbreak state. Our results demonstrate that this causal intervention results in statistically significant jailbreak responses in a subset of prompts. Next, we probed how these perturbations propagate through the model's layers, testing whether the induced state change remains localized or cascades throughout the network. Our findings indicate that targeted perturbations induced distinct shifts in activations and model responses. Our approach paves the way for potential proactive defenses, shifting from traditional guardrail based methods to preemptive, model agnostic techniques that neutralize adversarial states at the representation level.
Comment: The paper explores latent subspaces in LLMs for adversarial state manipulation, aligning with the 'Large Language Models' criterion due to its focus on interpretability and theoretical insights.
Relevance: 9 Novelty: 8
11. Discovering Influential Neuron Path in Vision Transformers
ArXiv ID: 2503.09046
Authors: Yifan Wang, Yifei Liu, Yingdong Shi, Changming Li, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Abstract: Vision Transformer models exhibit immense power yet remain opaque to human understanding, posing challenges and risks for practical applications. While prior research has attempted to demystify these models through input attribution and neuron role analysis, there's been a notable gap in considering layer-level information and the holistic path of information flow across layers. In this paper, we investigate the significance of influential neuron paths within vision Transformers, which is a path of neurons from the model input to output that impacts the model inference most significantly. We first propose a joint influence measure to assess the contribution of a set of neurons to the model outcome. And we further provide a layer-progressive neuron locating approach that efficiently selects the most influential neuron at each layer trying to discover the crucial neuron path from input to output within the target model. Our experiments demonstrate the superiority of our method finding the most influential neuron path along which the information flows, over the existing baseline solutions. Additionally, the neuron paths have illustrated that vision Transformers exhibit some specific inner working mechanism for processing the visual information within the same image category. We further analyze the key effects of these neurons on the image classification task, showcasing that the found neuron paths have already preserved the model capability on downstream tasks, which may also shed some lights on real-world applications like model pruning. The project website including implementation code is available at https://foundation-model-research.github.io/NeuronPath/.
Comment: The paper investigates influential neuron paths in Vision Transformers, which aligns with understanding model architecture and interpretability. It provides insights into the inner workings of Transformers.
Relevance: 9 Novelty: 8
12. Online multidimensional dictionary learning
ArXiv ID: 2503.09337
Authors: Ferdaous Ait Addi, Abdeslem Hafid Bentbib, Khalide Jbilou
Abstract: Dictionary learning is a widely used technique in signal processing and machine learning that aims to represent data as a linear combination of a few elements from an overcomplete dictionary. In this work, we propose a generalization of the dictionary learning technique using the t-product framework, enabling efficient handling of multidimensional tensor data. We address the dictionary learning problem through online methods suitable for tensor structures. To effectively address the sparsity problem, we utilize an accelerated Iterative Shrinkage-Thresholding Algorithm (ISTA) enhanced with an extrapolation technique known as Anderson acceleration. This approach significantly improves signal reconstruction results. Extensive experiments prove that our proposed method outperforms existing acceleration techniques, particularly in applications such as data completion. These results suggest that our approach can be highly beneficial for large-scale tensor data analysis in various domains.
Comment: The paper focuses on online multidimensional dictionary learning, which is directly relevant to representation learning and sparse methods. It introduces a novel acceleration technique.
Relevance: 9 Novelty: 8
13. Implicit Contrastive Representation Learning with Guided Stop-gradient
ArXiv ID: 2503.09058
Authors: Byeongchan Lee, Sehyun Lee
Abstract: In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available at https://github.com/bych-lee/gsg.
Comment: The paper introduces a novel method for implicit contrastive representation learning, which aligns with representation learning and training dynamics. It provides methodological advancements.
Relevance: 9 Novelty: 8
14. Interpreting the Repeated Token Phenomenon in Large Language Models
ArXiv ID: 2503.08908
Authors: Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, Yossi Gandelsman
Abstract: Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a vulnerability, allowing even end-users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of ``attention sinks'', an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other non-repeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the model's overall performance. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.
Comment: The paper provides a mechanistic explanation for a specific failure mode in LLMs and proposes a targeted patch, aligning with the criterion of theoretical insights into LLM behavior.
Relevance: 9 Novelty: 8
15. SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
ArXiv ID: 2503.09532
Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
Abstract: Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: https://saebench.xyz
Comment: The paper introduces a benchmark for sparse autoencoders, which aligns with the representation learning criterion. The focus on interpretability and feature disentanglement is relevant.
Relevance: 9 Novelty: 7
16. Robust Multi-Objective Controlled Decoding of Large Language Models
ArXiv ID: 2503.08796
Authors: Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic
Abstract: Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predefined weights or optimize for averages, sacrificing one objective for another and leading to unbalanced outcomes. To address this, we introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimizes for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy, solving for the Nash equilibrium. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically. We also introduce a practical RMOD variant designed for efficient decoding with contemporary LLMs, incurring minimal computational overhead compared to non-robust Multi-Objective Decoding (MOD) methods. Our experimental results showcase the effectiveness of RMOD in generating responses equitably aligned with diverse objectives, outperforming baselines up to 20%.
Comment: The paper proposes a novel inference-time algorithm for multi-objective decoding in LLMs, which aligns with the 'Large Language Models' criterion due to its focus on theoretical improvements in decoding strategies.
Relevance: 8 Novelty: 8
17. Training Plug-n-Play Knowledge Modules with Deep Context Distillation
ArXiv ID: 2503.08727
Authors: Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vuli\'c, Alessandro Sordoni
Abstract: Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.
Comment: The paper proposes a novel method for modularizing knowledge in LLMs using parameter-efficient LoRA modules, which aligns with the 'Large Language Models' criterion due to its focus on foundational improvements in knowledge integration.
Relevance: 8 Novelty: 8
18. Is CLIP ideal? No. Can we fix it? Yes!
ArXiv ID: 2503.08723
Authors: Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona
Abstract: Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP
Comment: The paper critiques the geometric limitations of CLIP's latent space and proposes a novel scoring method, aligning with representation learning and foundational model analysis.
Relevance: 8 Novelty: 8
19. Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach
ArXiv ID: 2503.09357
Authors: Ruifeng She, Bowen Pang, Kai Li, Zehua Liu, Tao Zhong
Abstract: As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and pipeline-have been successfully implemented for popular neural networks on main-stream hardware, optimizing the distributed deployment schedule requires extensive expertise and manual effort. Further more, while existing frameworks with most simple chain-like structures, they struggle with complex non-linear architectures. Mixture-of-experts and multi-modal models feature intricate MIMO and branch-rich topologies that require fine-grained operator-level parallelization beyond the capabilities of existing frameworks. We propose formulating parallelism planning as a scheduling optimization problem using mixed-integer programming. We propose a bi-level solution framework balancing optimality with computational efficiency, automatically generating effective distributed plans that capture both the heterogeneous structure of modern neural networks and the underlying hardware constraints. In experiments comparing against expert-designed strategies like DeepSeek's DualPipe, our framework achieves comparable or superior performance, reducing computational bubbles by half under the same memory constraints. The framework's versatility extends beyond throughput optimization to incorporate hardware utilization maximization, memory capacity constraints, and other considerations or potential strategies. Such capabilities position our solution as both a valuable research tool for exploring optimal parallelization strategies and a practical industrial solution for large-scale AI deployment.
Comment: The paper focuses on distributed deep learning and operator-level parallelism planning, which is relevant to model efficiency and scalability. It introduces a novel mixed-integer programming approach, aligning with foundational research in model compression and efficiency.
Relevance: 8 Novelty: 7
20. Quantitative Analysis of Deeply Quantized Tiny Neural Networks Robust to Adversarial Attacks
ArXiv ID: 2503.08973
Authors: Idris Zakariyya, Ferheen Ayaz, Mounia Kharbouche-Harrari, Jeremy Singer, Sye Loong Keoh, Danilo Pau, Jos\'e Cano
Abstract: Reducing the memory footprint of Machine Learning (ML) models, especially Deep Neural Networks (DNNs), is imperative to facilitate their deployment on resource-constrained edge devices. However, a notable drawback of DNN models lies in their susceptibility to adversarial attacks, wherein minor input perturbations can deceive them. A primary challenge revolves around the development of accurate, resilient, and compact DNN models suitable for deployment on resource-constrained edge devices. This paper presents the outcomes of a compact DNN model that exhibits resilience against both black-box and white-box adversarial attacks. This work has achieved this resilience through training with the QKeras quantization-aware training framework. The study explores the potential of QKeras and an adversarial robustness technique, Jacobian Regularization (JR), to co-optimize the DNN architecture through per-layer JR methodology. As a result, this paper has devised a DNN model employing this co-optimization strategy based on Stochastic Ternary Quantization (STQ). Its performance was compared against existing DNN models in the face of various white-box and black-box attacks. The experimental findings revealed that, the proposed DNN model had small footprint and on average, it exhibited better performance than Quanos and DS-CNN MLCommons/TinyML (MLC/T) benchmarks when challenged with white-box and black-box attacks, respectively, on the CIFAR-10 image and Google Speech Commands audio datasets.
Comment: The paper explores quantization-aware training and adversarial robustness in tiny neural networks, which aligns with model compression and efficiency breakthroughs.
Relevance: 8 Novelty: 7
21. SO(3)-Equivariant Neural Networks for Learning Vector Fields on Spheres
ArXiv ID: 2503.09456
Authors: Francesco Ballerin, Nello Blaser, Erlend Grong
Abstract: Analyzing vector fields on the sphere, such as wind speed and direction on Earth, is a difficult task. Models should respect both the rotational symmetries of the sphere and the inherent symmetries of the vector fields. In this paper, we introduce a deep learning architecture that respects both symmetry types using novel techniques based on group convolutions in the 3-dimensional rotation group. This architecture is suitable for scalar and vector fields on the sphere as they can be described as equivariant signals on the 3-dimensional rotation group. Experiments show that our architecture achieves lower prediction and reconstruction error when tested on rotated data compared to both standard CNNs and spherical CNNs.
Comment: The paper introduces SO(3)-equivariant neural networks for learning vector fields on spheres, which involves architectural innovation and symmetry-aware modeling.
Relevance: 8 Novelty: 7
22. Learning Spatially Adaptive $\ell_1$-Norms Weights for Convolutional Synthesis Regularization
ArXiv ID: 2503.09483
Authors: Andreas Kofler, Luca Calatroni, Christoph Kolbitsch, Kostas Papafitsoros
Abstract: We propose an unrolled algorithm approach for learning spatially adaptive parameter maps in the framework of convolutional synthesis-based $\ell_1$ regularization. More precisely, we consider a family of pre-trained convolutional filters and estimate deeply parametrized spatially varying parameters applied to the sparse feature maps by means of unrolling a FISTA algorithm to solve the underlying sparse estimation problem. The proposed approach is evaluated for image reconstruction of low-field MRI and compared to spatially adaptive and non-adaptive analysis-type procedures relying on Total Variation regularization and to a well-established model-based deep learning approach. We show that the proposed approach produces visually and quantitatively comparable results with the latter approaches and at the same time remains highly interpretable. In particular, the inferred parameter maps quantify the local contribution of each filter in the reconstruction, which provides valuable insight into the algorithm mechanism and could potentially be used to discard unsuited filters.
Comment: The paper proposes an unrolled algorithm for learning spatially adaptive parameters in convolutional synthesis regularization, which aligns with representation learning and sparse methods.
Relevance: 8 Novelty: 7
23. Adaptive Temperature Based on Logits Correlation in Knowledge Distillation
ArXiv ID: 2503.09030
Authors: Kazuhiro Matsuyama, Usman Anjum, Satoko Matsuyama, Tetsuo Shoda, Justin Zhan
Abstract: Knowledge distillation is a technique to imitate a performance that a deep learning model has, but reduce the size on another model. It applies the outputs of a model to train another model having comparable accuracy. These two distinct models are similar to the way information is delivered in human society, with one acting as the "teacher" and the other as the "student". Softmax plays a role in comparing logits generated by models with each other by converting probability distributions. It delivers the logits of a teacher to a student with compression through a parameter named temperature. Tuning this variable reinforces the distillation performance. Although only this parameter helps with the interaction of logits, it is not clear how temperatures promote information transfer. In this paper, we propose a novel approach to calculate the temperature. Our method only refers to the maximum logit generated by a teacher model, which reduces computational time against state-of-the-art methods. Our method shows a promising result in different student and teacher models on a standard benchmark dataset. Algorithms using temperature can obtain the improvement by plugging in this dynamic approach. Furthermore, the approximation of the distillation process converges to a correlation of logits by both models. This reinforces the previous argument that the distillation conveys the relevance of logits. We report that this approximating algorithm yields a higher temperature compared to the commonly used static values in testing.
Comment: The paper proposes an adaptive temperature method for knowledge distillation, which aligns with model compression and efficiency improvements.
Relevance: 8 Novelty: 7
24. A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation
ArXiv ID: 2503.08902
Authors: Forough Fazeliasl, Michael Minyi Zhang, Bei Jiang, Linglong Kong
Abstract: Mutual Information (MI) is a crucial measure for capturing dependencies between variables, but exact computation is challenging in high dimensions with intractable likelihoods, impacting accuracy and robustness. One idea is to use an auxiliary neural network to train an MI estimator; however, methods based on the empirical distribution function (EDF) can introduce sharp fluctuations in the MI loss due to poor out-of-sample performance, destabilizing convergence. We present a Bayesian nonparametric (BNP) solution for training an MI estimator by constructing the MI loss with a finite representation of the Dirichlet process posterior to incorporate regularization in the training process. With this regularization, the MI loss integrates both prior knowledge and empirical data to reduce the loss sensitivity to fluctuations and outliers in the sample data, especially in small sample settings like mini-batches. This approach addresses the challenge of balancing accuracy and low variance by effectively reducing variance, leading to stabilized and robust MI loss gradients during training and enhancing the convergence of the MI approximation while offering stronger theoretical guarantees for convergence. We explore the application of our estimator in maximizing MI between the data space and the latent space of a variational autoencoder. Experimental results demonstrate significant improvements in convergence over EDF-based methods, with applications across synthetic and real datasets, notably in 3D CT image generation, yielding enhanced structure discovery and reduced overfitting in data synthesis. While this paper focuses on generative models in application, the proposed estimator is not restricted to this setting and can be applied more broadly in various BNP learning procedures.
Comment: The paper introduces a Bayesian nonparametric framework for mutual information estimation, which aligns with representation learning and theoretical insights into training dynamics.
Relevance: 8 Novelty: 7
25. Neural Normalized Cut: A Differential and Generalizable Approach for Spectral Clustering
ArXiv ID: 2503.09260
Authors: Wei He, Shangzhi Zhang, Chun-Guang Li, Xianbiao Qi, Rong Xiao, Jun Guo
Abstract: Spectral clustering, as a popular tool for data clustering, requires an eigen-decomposition step on a given affinity to obtain the spectral embedding. Nevertheless, such a step suffers from the lack of generalizability and scalability. Moreover, the obtained spectral embeddings can hardly provide a good approximation to the ground-truth partition and thus a k-means step is adopted to quantize the embedding. In this paper, we propose a simple yet effective scalable and generalizable approach, called Neural Normalized Cut (NeuNcut), to learn the clustering membership for spectral clustering directly. In NeuNcut, we properly reparameterize the unknown cluster membership via a neural network, and train the neural network via stochastic gradient descent with a properly relaxed normalized cut loss. As a result, our NeuNcut enjoys a desired generalization ability to directly infer clustering membership for out-of-sample unseen data and hence brings us an efficient way to handle clustering task with ultra large-scale data. We conduct extensive experiments on both synthetic data and benchmark datasets and experimental results validate the effectiveness and the superiority of our approach. Our code is available at: https://github.com/hewei98/NeuNcut.
Comment: The paper proposes a neural approach to spectral clustering, which aligns with representation learning and introduces a novel method for clustering membership.
Relevance: 8 Novelty: 7
26. PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs
ArXiv ID: 2503.09543
Authors: Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, Stella Biderman
Abstract: The stability of language model pre-training and its effects on downstream performance are still understudied. Prior work shows that the training process can yield significantly different results in response to slight variations in initial conditions, e.g., the random seed. Crucially, the research community still lacks sufficient resources and tools to systematically investigate pre-training stability, particularly for decoder-only language models. We introduce the PolyPythias, a set of 45 new training runs for the Pythia model suite: 9 new seeds across 5 model sizes, from 14M to 410M parameters, resulting in about 7k new checkpoints that we release. Using these new 45 training runs, in addition to the 5 already available, we study the effects of different initial conditions determined by the seed -- i.e., parameters' initialisation and data order -- on (i) downstream performance, (ii) learned linguistic representations, and (iii) emergence of training phases. In addition to common scaling behaviours, our analyses generally reveal highly consistent training dynamics across both model sizes and initial conditions. Further, the new seeds for each model allow us to identify outlier training runs and delineate their characteristics. Our findings show the potential of using these methods to predict training stability.
Comment: The paper studies the stability of language model pre-training, which aligns with foundational research on LLM behavior and training dynamics.
Relevance: 8 Novelty: 7
27. Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation
ArXiv ID: 2503.08963
Authors: Yu Wang, Jiaxin Zhang, Xiang Gao, Wendi Cui, Peng Li, Kamalika Das
Abstract: In tasks like summarization and open-book question answering (QA), Large Language Models (LLMs) often encounter "contextual hallucination", where they produce irrelevant or incorrect responses despite having access to accurate source information. This typically occurs because these models tend to prioritize self-generated content over the input context, causing them to disregard pertinent details. To address this challenge, we introduce a novel method called "Guided Attention Map Editing" (GAME), which dynamically adjusts attention maps to improve contextual relevance. During inference, GAME employs a trained classifier to identify attention maps prone to inducing hallucinations and executes targeted interventions. These interventions, guided by gradient-informed "edit directions'', strategically redistribute attention weights across various heads to effectively reduce hallucination. Comprehensive evaluations on challenging summarization and open-book QA tasks show that GAME consistently reduces hallucinations across a variety of open-source models. Specifically, GAME reduces hallucinations by 10% in the XSum summarization task while achieving a 7X speed-up in computational efficiency compared to the state-of-the-art baselines.
Comment: The paper proposes a method to mitigate contextual hallucination in LLMs by dynamically adjusting attention maps, aligning with the 'Large Language Models' criterion due to its focus on improving model behavior.
Relevance: 8 Novelty: 7
28. Learning Pareto manifolds in high dimensions: How can regularization help?
ArXiv ID: 2503.08849
Authors: Tobias Wegel, Filip Kova\v{c}evi\'c, Alexandru \c{T}ifrea, Fanny Yang
Abstract: Simultaneously addressing multiple objectives is becoming increasingly important in modern machine learning. At the same time, data is often high-dimensional and costly to label. For a single objective such as prediction risk, conventional regularization techniques are known to improve generalization when the data exhibits low-dimensional structure like sparsity. However, it is largely unexplored how to leverage this structure in the context of multi-objective learning (MOL) with multiple competing objectives. In this work, we discuss how the application of vanilla regularization approaches can fail, and propose a two-stage MOL framework that can successfully leverage low-dimensional structure. We demonstrate its effectiveness experimentally for multi-distribution learning and fairness-risk trade-offs.
Comment: The paper discusses leveraging low-dimensional structure in multi-objective learning, which aligns with representation learning and training dynamics. It introduces a novel framework for regularization in high-dimensional settings.
Relevance: 8 Novelty: 7
29. Neurosymbolic Decision Trees
ArXiv ID: 2503.08762
Authors: Matthias M\"oller, Arvid Norlander, Pedro Zuidberg Dos Martires, Luc De Raedt
Abstract: Neurosymbolic (NeSy) AI studies the integration of neural networks (NNs) and symbolic reasoning based on logic. Usually, NeSy techniques focus on learning the neural, probabilistic and/or fuzzy parameters of NeSy models. Learning the symbolic or logical structure of such models has, so far, received less attention. We introduce neurosymbolic decision trees (NDTs), as an extension of decision trees together with a novel NeSy structure learning algorithm, which we dub NeuID3. NeuID3 adapts the standard top-down induction of decision tree algorithms and combines it with a neural probabilistic logic representation, inherited from the DeepProbLog family of models. The key advantage of learning NDTs with NeuID3 is the support of both symbolic and subsymbolic data (such as images), and that they can exploit background knowledge during the induction of the tree structure, In our experimental evaluation we demonstrate the benefits of NeSys structure learning over more traditonal approaches such as purely data-driven learning with neural networks.
Comment: The paper introduces neurosymbolic decision trees, which is a novel approach combining symbolic reasoning and neural networks. It aligns with emerging trends in model architecture.
Relevance: 8 Novelty: 7
30. Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization
ArXiv ID: 2503.09411
Authors: Amit Attia, Tomer Koren
Abstract: The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $\rho$ (i.e., the grid resolution), achieving a rate of $O(\rho^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps, in contrast to the $O(\rho/\sqrt{T})$ rate that arises with fixed stepsizes and exhibits a linear dependence on $\rho$. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.
Comment: The paper analyzes learning rate annealing for tuning robustness, which aligns with training dynamics in neural networks. It provides theoretical insights into optimization.
Relevance: 8 Novelty: 7
31. Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs
ArXiv ID: 2503.09184
Authors: Aidan Ferguson, Perry Gibson, Lara D'Agata, Parker McLeod, Ferhat Yaman, Amitabh Das, Ian Colbert, Jos\'e Cano
Abstract: The deployment of deep neural networks (DNNs) in privacy-sensitive environments is constrained by computational overheads in fully homomorphic encryption (FHE). This paper explores unstructured sparsity in FHE matrix multiplication schemes as a means of reducing this burden while maintaining model accuracy requirements. We demonstrate that sparsity can be exploited in arbitrary matrix multiplication, providing runtime benefits compared to a baseline naive algorithm at all sparsity levels. This is a notable departure from the plaintext domain, where there is a trade-off between sparsity and the overhead of the sparse multiplication algorithm. In addition, we propose three sparse multiplication schemes in FHE based on common plaintext sparse encodings. We demonstrate the performance gain is scheme-invariant; however, some sparse schemes vastly reduce the memory storage requirements of the encrypted matrix at high sparsity values. Our proposed sparse schemes yield an average performance gain of 2.5x at 50% unstructured sparsity, with our multi-threading scheme providing a 32.5x performance increase over the equivalent single-threaded sparse computation when utilizing 64 cores.
Comment: The paper explores unstructured sparsity in FHE DNNs, which aligns with the model compression criterion. The focus on sparsity and performance gains in encrypted environments is relevant.
Relevance: 8 Novelty: 7
32. Towards Robust Multimodal Representation: A Unified Approach with Adaptive Experts and Alignment
ArXiv ID: 2503.09498
Authors: Nazanin Moradinasab, Saurav Sengupta, Jiebei Liu, Sana Syed, Donald E. Brown
Abstract: Healthcare relies on multiple types of data, such as medical images, genetic information, and clinical records, to improve diagnosis and treatment. However, missing data is a common challenge due to privacy restrictions, cost, and technical issues, making many existing multi-modal models unreliable. To address this, we propose a new multi-model model called Mixture of Experts, Symmetric Aligning, and Reconstruction (MoSARe), a deep learning framework that handles incomplete multimodal data while maintaining high accuracy. MoSARe integrates expert selection, cross-modal attention, and contrastive learning to improve feature representation and decision-making. Our results show that MoSARe outperforms existing models in situations when the data is complete. Furthermore, it provides reliable predictions even when some data are missing. This makes it especially useful in real-world healthcare settings, including resource-limited environments. Our code is publicly available at https://github.com/NazaninMn/MoSARe.
Comment: The paper introduces a Mixture of Experts (MoE) framework for handling incomplete multimodal data, which aligns with the 'Model Architecture' criterion. However, the focus on healthcare applications makes it partially relevant.
Relevance: 7 Novelty: 7
33. Towards Graph Foundation Models: A Transferability Perspective
ArXiv ID: 2503.09363
Authors: Yuxiang Wang, Wenqi Fan, Suhang Wang, Yao Ma
Abstract: In recent years, Graph Foundation Models (GFMs) have gained significant attention for their potential to generalize across diverse graph domains and tasks. Some works focus on Domain-Specific GFMs, which are designed to address a variety of tasks within a specific domain, while others aim to create General-Purpose GFMs that extend the capabilities of domain-specific models to multiple domains. Regardless of the type, transferability is crucial for applying GFMs across different domains and tasks. However, achieving strong transferability is a major challenge due to the structural, feature, and distributional variations in graph data. To date, there has been no systematic research examining and analyzing GFMs from the perspective of transferability. To bridge the gap, we present the first comprehensive taxonomy that categorizes and analyzes existing GFMs through the lens of transferability, structuring GFMs around their application scope (domain-specific vs. general-purpose) and their approaches to knowledge acquisition and transfer. We provide a structured perspective on current progress and identify potential pathways for advancing GFM generalization across diverse graph datasets and tasks. We aims to shed light on the current landscape of GFMs and inspire future research directions in GFM development.
Comment: The paper provides a taxonomy of Graph Foundation Models (GFMs) with a focus on transferability, which aligns with the 'Emerging Trends' criterion due to its foundational perspective on graph models.
Relevance: 7 Novelty: 7
34. The Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction
ArXiv ID: 2503.09101
Authors: Mohammad Tariqul Islam, Jason W. Fleischer
Abstract: Uniform manifold approximation and projection (UMAP) is among the most popular neighbor embedding methods. The method relies on attractive and repulsive forces among high-dimensional data points to obtain a low-dimensional embedding. In this paper, we analyze the forces to reveal their effects on cluster formations and visualization. Repulsion emphasizes differences, controlling cluster boundaries and inter-cluster distance. Attraction is more subtle, as attractive tension between points can manifest simultaneously as attraction and repulsion in the lower-dimensional mapping. This explains the need for learning rate annealing and motivates the different treatments between attractive and repulsive terms. Moreover, by modifying attraction, we improve the consistency of cluster formation under random initialization. Overall, our analysis makes UMAP and similar embedding methods more interpretable, more robust, and more accurate.
Comment: The paper provides an analysis of UMAP's embedding forces, offering insights into dimensionality reduction methods, which aligns with representation learning.
Relevance: 7 Novelty: 6
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.