Personalized Daily Arxiv Papers 02/05/2025

	Prompt	Completion	Total
Token	109444	10209	119653
Cost	$2.74	$1.02	$3.76

Total scanned papers: 382

Total relevant papers: 51

Table of contents with paper titles:

Layer by Layer: Uncovering Hidden Representations in Language Models Authors: Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv
Constrained belief updates explain geometric structures in transformer representations Authors: Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam S. Shai
Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers Authors: Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen
Toward Neurosymbolic Program Comprehension Authors: Alejandro Velasco, Aya Garryyeva, David N. Palacio, Antonio Mastropaolo, Denys Poshyvanyk
Choose Your Model Size: Any Compression by a Single Gradient Descent Authors: Martin Genzel, Patrick Putzky, Pengfei Zhao, Sebastian Schulze, Mattes Mollenhauer, Robert Seidel, Stefan Dietzel, Thomas Wollmann
How Memory in Optimization Algorithms Implicitly Modifies the Loss Authors: Matias D. Cattaneo, Boris Shigida
Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques Authors: Nathaniel Tomczak, Sanmukh Kuppannagari
Optimal Spectral Transitions in High-Dimensional Multi-Index Models Authors: Leonardo Defilippis, Yatin Dandi, Pierre Mergny, Florent Krzakala, Bruno Loureiro
Enhancing Generalization via Sharpness-Aware Trajectory Matching for Dataset Condensation Authors: Boyan Gao, Bo Zhao, Shreyank N Gowda, Xingrun Xing, Yibo Yang, Timothy Hospedales, David A. Clifton
When Dimensionality Hurts: The Role of LLM Embedding Compression for Noisy Regression Tasks Authors: Felix Drinkall, Janet B. Pierrehumbert, Stefan Zohren
A Periodic Bayesian Flow for Material Generation Authors: Hanlin Wu, Yuxuan Song, Jingjing Gong, Ziyao Cao, Yawen Ouyang, Jianbing Zhang, Hao Zhou, Wei-Ying Ma, Jingjing Liu
Reasoning Bias of Next Token Prediction Training Authors: Pengxiao Lin, Zhongwang Zhang, Zhi-Qin John Xu
EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization Authors: Yize Wu, Ke Gao, Yanjun Wu
Discovering Chunks in Neural Embeddings for Interpretability Authors: Shuchen Wu, Stephan Alaniz, Eric Schulz, Zeynep Akata
Multi-level Supervised Contrastive Learning Authors: Naghmeh Ghanooni, Barbod Pajoum, Harshit Rawal, Sophie Fellenz, Vo Nguyen Le Duy, Marius Kloft
BRIDLE: Generalized Self-supervised Learning with Quantization Authors: Hoang M. Nguyen, Satya N. Shukla, Qiang Zhang, Hanchao Yu, Sreya D. Roy, Taipeng Tian, Lingjiong Zhu, Yuchen Liu
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing Authors: Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao
Al-Khwarizmi: Discovering Physical Laws with Foundation Models Authors: Christopher E. Mower, Haitham Bou-Ammar
Do Graph Diffusion Models Accurately Capture and Generate Substructure Distributions? Authors: Xiyuan Wang, Yewei Liu, Lexi Pang, Siwei Chen, Muhan Zhang
ContinuouSP: Generative Model for Crystal Structure Prediction with Invariance and Continuity Authors: Yuji Tone, Masatoshi Hanai, Mitsuaki Kawamura, Kenjiro Taura, Toyotaro Suzumura
Local minima of the empirical risk in high dimension: General theorems and convex examples Authors: Kiana Asgari, Andrea Montanari, Basil Saeed
Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction Authors: Alexei Figueroa, Justus Westerhoff, Atefi Golzar, Dennis Fast, Benjamin Winter, Felix Alexader Gers, Alexander L\"oser, Wolfang Nejdl
mPOLICE: Provable Enforcement of Multi-Region Affine Constraints in Deep Neural Networks Authors: Mohammadmehdi Ataei, Hyunmin Cheong, Adrian Butscher
Self-supervised Subgraph Neural Network With Deep Reinforcement Walk Exploration Authors: Jianming Huang, Hiroyuki Kasai
MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving Authors: Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen
Learning the RoPEs: Better 2D and 3D Position Encodings with STRING Authors: Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid, Sumeet Singh, Rene Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, Krzysztof Choromanski
Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers Authors: Alireza Amiri, Xinting Huang, Mark Rofin, Michael Hahn
Modular Training of Neural Networks aids Interpretability Authors: Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging Authors: Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
Can LLMs Maintain Fundamental Abilities under KV Cache Compression? Authors: Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu
ReMiDi: Reconstruction of Microstructure Using a Differentiable Diffusion MRI Simulator Authors: Prathamesh Pradeep Khole, Zahra Kais Petiwala, Shri Prathaa Magesh, Ehsan Mirafzali, Utkarsh Gupta, Jing-Rebecca Li, Andrada Ianus, Razvan Marinescu
Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge Authors: Daniel Tamayo, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas
A Revisit of Total Correlation in Disentangled Variational Auto-Encoder with Partial Disentanglement Authors: Chengrui Li, Yunmiao Wang, Yule Wang, Weihan Li, Dieter Jaeger, Anqi Wu
BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation Authors: Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia
Poisson Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features-With Indications for Microbiome Species Sampling Models Authors: Lancelot F. James, Juho Lee, Abhinav Pandey
LIBRA: Measuring Bias of Large Language Model from a Local Context Authors: Bo Pang, Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh
Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation Authors: Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki
Activation-Informed Merging of Large Language Models Authors: Amin Heyrani Nobari, Kaveh Alimohammadi, Ali ArjomandBigdeli, Akash Srivastava, Faez Ahmed, Navid Azizan
VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation Authors: Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, Chang Xu
On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach Authors: Edo Cohen-Karlik, Itamar Zimerman, Liane Galanti, Ido Atad, Amir Globerson, Lior Wolf
Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants Authors: Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade
Avoiding spurious sharpness minimization broadens applicability of SAM Authors: Sidak Pal Singh, Hossein Mobahi, Atish Agarwala, Yann Dauphin
T-SCEND: Test-time Scalable MCTS-enhanced Diffusion Model Authors: Tao Zhang, Jia-Shu Pan, Ruiqi Feng, Tailin Wu
Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis Authors: Haonan Zhu, Andre R. Goncalves, Camilo Valdes, Hiranmayi Ranganathan, Boya Zhang, Jose Manuel Mart\'i, Car Reen Kok, Monica K. Borucki, Nisha J. Mulakken, James B. Thissen, Crystal Jaing, Alfred Hero, Nicholas A. Be
Learning Hyperparameters via a Data-Emphasized Variational Objective Authors: Ethan Harvey, Mikhail Petrov, Michael C. Hughes
Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries Authors: Huakun Luo, Haixu Wu, Hang Zhou, Lanxiang Xing, Yichen Di, Jianmin Wang, Mingsheng Long
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models Authors: Tzu-Tao Chang, Shivaram Venkataraman
Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration Authors: Younan Zhu, Linwei Tao, Minjing Dong, Chang Xu
Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA Authors: Shuangyi Chen, Yuanxin Guo, Yue Ju, Harik Dalal, Ashish Khisti
Distributionally Robust Direct Preference Optimization Authors: Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining Authors: Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar

1. Layer by Layer: Uncovering Hidden Representations in Language Models

ArXiv ID: 2502.02013

Authors: Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv

Abstract: From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer's performance. Through extensive experiments on 32 text-embedding tasks and comparisons across model architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization, including strategic use of mid-layer representations for more robust and accurate AI systems.

Comment: Author match

2. Constrained belief updates explain geometric structures in transformer representations

ArXiv ID: 2502.01954

Authors: Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam S. Shai

Abstract: What computational structures emerge in transformers trained on next-token prediction? In this work, we provide evidence that transformers implement constrained Bayesian belief updating -- a parallelized version of partial Bayesian inference shaped by architectural constraints. To do this, we integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models that generate rich geometric patterns in neural activations. We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure. We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail -- including the attention pattern, OV-vectors, and embedding vectors -- by modifying the equations for optimal future token predictions to account for the architectural constraints of attention. Our approach provides a principled lens on how gradient descent resolves the tension between optimal prediction and architectural design.

Comment: Touches on representation learning by analyzing the geometric structures and constrained Bayesian belief updates in transformer representations, providing foundational insights into encoder-decoder mechanisms.

Relevance: 10 Novelty: 9

3. Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

ArXiv ID: 2502.01770

Authors: Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen

Abstract: Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.

Comment: The paper introduces a novel framework for binarizing keys and queries in transformer attention, focusing on compression and efficiency improvements.

Relevance: 9 Novelty: 9

4. Toward Neurosymbolic Program Comprehension

ArXiv ID: 2502.01806

Authors: Alejandro Velasco, Aya Garryyeva, David N. Palacio, Antonio Mastropaolo, Denys Poshyvanyk

Abstract: Recent advancements in Large Language Models (LLMs) have paved the way for Large Code Models (LCMs), enabling automation in complex software engineering tasks, such as code generation, software testing, and program comprehension, among others. Tools like GitHub Copilot and ChatGPT have shown substantial benefits in supporting developers across various practices. However, the ambition to scale these models to trillion-parameter sizes, exemplified by GPT-4, poses significant challenges that limit the usage of Artificial Intelligence (AI)-based systems powered by large Deep Learning (DL) models. These include rising computational demands for training and deployment and issues related to trustworthiness, bias, and interpretability. Such factors can make managing these models impractical for many organizations, while their "black-box'' nature undermines key aspects, including transparency and accountability. In this paper, we question the prevailing assumption that increasing model parameters is always the optimal path forward, provided there is sufficient new data to learn additional patterns. In particular, we advocate for a Neurosymbolic research direction that combines the strengths of existing DL techniques (e.g., LLMs) with traditional symbolic methods--renowned for their reliability, speed, and determinism. To this end, we outline the core features and present preliminary results for our envisioned approach, aimed at establishing the first Neurosymbolic Program Comprehension (NsPC) framework to aid in identifying defective code components.

Comment: The paper advocates for neurosymbolic research blending DL and symbolic methods, introducing an emerging trend challenging the parameter-heavy model paradigm.

Relevance: 9 Novelty: 9

5. Choose Your Model Size: Any Compression by a Single Gradient Descent

ArXiv ID: 2502.01717

Authors: Martin Genzel, Patrick Putzky, Pengfei Zhao, Sebastian Schulze, Mattes Mollenhauer, Robert Seidel, Stefan Dietzel, Thomas Wollmann

Abstract: The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To ensure parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. The resulting pruning order gives rise to a global parameter ranking that allows us to materialize models of any target size. Importantly, the compressed models exhibit strong predictive downstream performance without the need for costly fine-tuning. We evaluate ACIP on a large selection of open-weight LLMs and tasks, and demonstrate state-of-the-art results compared to existing factorisation-based compression methods. We also show that ACIP seamlessly complements common quantization-based compression techniques.

Comment: ACIP provides a novel, singular gradient descent approach to model compression utilizing sparsity and low-rank techniques, which directly matches the model compression criterion.

Relevance: 10 Novelty: 8

6. How Memory in Optimization Algorithms Implicitly Modifies the Loss

ArXiv ID: 2502.02132

Authors: Matias D. Cattaneo, Boris Shigida

Abstract: In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient descent with momentum has exponentially decaying memory through exponentially averaged past gradients. We introduce a general technique for identifying a memoryless algorithm that approximates an optimization algorithm with memory. It is obtained by replacing all past iterates in the update by the current one, and then adding a correction term arising from memory (also a function of the current iterate). This correction term can be interpreted as a perturbation of the loss, and the nature of this perturbation can inform how memory implicitly (anti-)regularizes the optimization dynamics. As an application of our theory, we find that Lion does not have the kind of implicit anti-regularization induced by memory that AdamW does, providing a theory-based explanation for Lion's better generalization performance recently documented.

Comment: This work analyzes how memory in optimization algorithms implicitly modifies the loss landscape, providing new insights into optimization dynamics, which aligns strongly with representation learning and training dynamics.

Relevance: 9 Novelty: 9

7. Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

ArXiv ID: 2502.01659

Authors: Nathaniel Tomczak, Sanmukh Kuppannagari

Abstract: Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate pairwise interactions between individual tokens of sequential data. However, the primary limitation of this operation is its quadratic memory and time complexity in relation to the input's context length - the length of a sequence over which the interactions need to be captured. This significantly limits the length of sequences that can be inferred upon by these models. Extensive research has been conducted to reduce the number of pairwise interactions to sub-quadratic in relation to the context length by introducing sparsity into the attention mechanism through the development of sparse attention masks. However, efficient implementations that achieve "true sparsity" are lacking. In this work, we address this issue by proposing a graph computing view of attention where tokens are perceived as nodes of the graph and the attention mask determines the edges of the graph. Using this view, we develop graph processing algorithms to implement the attention mechanism. Both theoretically and empirically, we demonstrate that our algorithms only perform the needed computations, i.e., they are work optimal. We also perform extensive experimentation using popular attention masks to explore the impact of sparsity on execution time and achievable context length. Our experiments demonstrate significant speedups in execution times compared to state-of-the-art attention implementations such as FlashAttention for large sequence lengths. We also demonstrate that our algorithms are able to achieve extremely long sequence lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB).

Comment: Proposes sparse graph processing techniques to increase Transformer context length, aligning well with efficiency breakthroughs and sparsity advancements in transformers.

Relevance: 9 Novelty: 9

8. Optimal Spectral Transitions in High-Dimensional Multi-Index Models

ArXiv ID: 2502.02545

Authors: Leonardo Defilippis, Yatin Dandi, Pierre Mergny, Florent Krzakala, Bruno Loureiro

Abstract: We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message passing scheme tailored to this problem. Our main contribution is to show that the proposed methods achieve the optimal reconstruction threshold. Leveraging a high-dimensional characterization of the algorithms, we show that above the critical threshold the leading eigenvector correlates with the relevant index subspace, a phenomenon reminiscent of the Baik-Ben Arous-Peche (BBP) transition in spiked models arising in random matrix theory. Supported by numerical experiments and a rigorous theoretical framework, our work bridges critical gaps in the computational limits of weak learnability in multi-index model.

Comment: Introduces spectral methods for a theoretical problem rooted in high-dimensional reconstruction, closely aligning with Representation Learning and fundamental computational limits.

Relevance: 9 Novelty: 9

9. Enhancing Generalization via Sharpness-Aware Trajectory Matching for Dataset Condensation

ArXiv ID: 2502.01865

Authors: Boyan Gao, Bo Zhao, Shreyank N Gowda, Xingrun Xing, Yibo Yang, Timothy Hospedales, David A. Clifton

Abstract: Dataset condensation aims to synthesize datasets with a few representative samples that can effectively represent the original datasets. This enables efficient training and produces models with performance close to those trained on the original sets. Most existing dataset condensation methods conduct dataset learning under the bilevel (inner- and outer-loop) based optimization. However, the preceding methods perform with limited dataset generalization due to the notoriously complicated loss landscape and expensive time-space complexity of the inner-loop unrolling of bilevel optimization. These issues deteriorate when the datasets are learned via matching the trajectories of networks trained on the real and synthetic datasets with a long horizon inner-loop. To address these issues, we introduce Sharpness-Aware Trajectory Matching (SATM), which enhances the generalization capability of learned synthetic datasets by optimising the sharpness of the loss landscape and objective simultaneously. Moreover, our approach is coupled with an efficient hypergradient approximation that is mathematically well-supported and straightforward to implement along with controllable computational overhead. Empirical evaluations of SATM demonstrate its effectiveness across various applications, including in-domain benchmarks and out-of-domain settings. Moreover, its easy-to-implement properties afford flexibility, allowing it to integrate with other advanced sharpness-aware minimizers. Our code will be released.

Comment: This proposes a novel sharpness-aware trajectory matching method for dataset condensation aligning with fundamental principles of representation learning. The approach shows promise for enhancing generalization.

Relevance: 9 Novelty: 8

10. When Dimensionality Hurts: The Role of LLM Embedding Compression for Noisy Regression Tasks

ArXiv ID: 2502.02199

Authors: Felix Drinkall, Janet B. Pierrehumbert, Stefan Zohren

Abstract: Large language models (LLMs) have shown remarkable success in language modelling due to scaling laws found in model size and the hidden dimension of the model's text representation. Yet, we demonstrate that compressed representations of text can yield better performance in LLM-based regression tasks. In this paper, we compare the relative performance of embedding compression in three different signal-to-noise contexts: financial return prediction, writing quality assessment and review scoring. Our results show that compressing embeddings, in a minimally supervised manner using an autoencoder's hidden representation, can mitigate overfitting and improve performance on noisy tasks, such as financial return prediction; but that compression reduces performance on tasks that have high causal dependencies between the input and target data. Our results suggest that the success of interpretable compressed representations such as sentiment may be due to a regularising effect.

Comment: Explores embedding compression in LLMs via autoencoders, addressing sparsity and efficiency in noisy tasks, which ties into representation learning and compression strategies.

Relevance: 9 Novelty: 8

11. A Periodic Bayesian Flow for Material Generation

ArXiv ID: 2502.02016

Authors: Hanlin Wu, Yuxuan Song, Jingjing Gong, Ziyao Cao, Yawen Ouyang, Jianbing Zhang, Hao Zhou, Wei-Ying Ma, Jingjing Liu

Abstract: Generative modeling of crystal data distribution is an important yet challenging task due to the unique periodic physical symmetry of crystals. Diffusion-based methods have shown early promise in modeling crystal distribution. More recently, Bayesian Flow Networks were introduced to aggregate noisy latent variables, resulting in a variance-reduced parameter space that has been shown to be advantageous for modeling Euclidean data distributions with structural constraints (Song et al., 2023). Inspired by this, we seek to unlock its potential for modeling variables located in non-Euclidean manifolds e.g. those within crystal structures, by overcoming challenging theoretical issues. We introduce CrysBFN, a novel crystal generation method by proposing a periodic Bayesian flow, which essentially differs from the original Gaussian-based BFN by exhibiting non-monotonic entropy dynamics. To successfully realize the concept of periodic Bayesian flow, CrysBFN integrates a new entropy conditioning mechanism and empirically demonstrates its significance compared to time-conditioning. Extensive experiments over both crystal ab initio generation and crystal structure prediction tasks demonstrate the superiority of CrysBFN, which consistently achieves new state-of-the-art on all benchmarks. Surprisingly, we found that CrysBFN enjoys a significant improvement in sampling efficiency, e.g., ~100x speedup 10 v.s. 2000 steps network forwards) compared with previous diffusion-based methods on MP-20 dataset. Code is available at https://github.com/wu-han-lin/CrysBFN.

Comment: Introduces a periodic Bayesian flow for generative modeling of crystal structures, incorporating foundational elements of generative paradigms in material science.

Relevance: 8 Novelty: 9

12. Reasoning Bias of Next Token Prediction Training

ArXiv ID: 2502.02007

Authors: Pengxiao Lin, Zhongwang Zhang, Zhi-Qin John Xu

Abstract: Since the inception of Large Language Models (LLMs), the quest to efficiently train them for superior reasoning capabilities has been a pivotal challenge. The dominant training paradigm for LLMs is based on next token prediction (NTP). Alternative methodologies, called Critical Token Prediction (CTP), focused exclusively on specific critical tokens (such as the answer in Q\&A dataset), aiming to reduce the overfitting of extraneous information and noise. Contrary to initial assumptions, our research reveals that despite NTP's exposure to noise during training, it surpasses CTP in reasoning ability. We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics. Our empirical analysis shows that NTP-trained models exhibit enhanced generalization and robustness across various benchmark reasoning datasets, demonstrating greater resilience to perturbations and achieving flatter loss minima. These findings illuminate that NTP is instrumental in fostering reasoning abilities during pretraining, whereas CTP is more effective for finetuning, thereby enriching our comprehension of optimal training strategies in LLM development.

Comment: This study explores the reasoning biases in next-token prediction training and contrasts it with other methodologies, providing insights into LLM training strategies.

Relevance: 9 Novelty: 8

13. EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

ArXiv ID: 2502.02493

Authors: Yize Wu, Ke Gao, Yanjun Wu

Abstract: Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of the base model, leading to GPU idling during the drafting stage. To solve this problem, we propose EasySpec, a layer-parallel speculation strategy that optimizes the efficiency of multi-GPU utilization.EasySpec breaks the sequential execution order of layers in the drafting model, enabling multi-layer parallelization across devices, albeit with some induced approximation errors. After each drafting-and-verification iteration, the draft model's key-value (KV) cache is calibrated in a single forward pass, preventing long-term error accumulation at minimal additional latency. We evaluated EasySpec on several mainstream open-source LLMs, using smaller versions of models from the same series as drafters. The results demonstrate that EasySpec can achieve a peak speedup of 4.17x compared to vanilla decoding, while preserving the original distribution of the base LLMs. Specifically, the drafting stage can be accelerated by up to 1.62x with a maximum accuracy drop of only 7%, requiring no training or fine-tuning on the draft models.

Comment: The paper proposes EasySpec, which includes innovations in speculative decoding and optimizes multi-GPU utilization through layer-parallelism and KV cache calibration. This aligns with the topic of model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

14. Discovering Chunks in Neural Embeddings for Interpretability

ArXiv ID: 2502.01803

Authors: Shuchen Wu, Stephan Alaniz, Eric Schulz, Zeynep Akata

Abstract: Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.

Comment: Introduces a novel framework for interpreting neural embeddings by identifying 'chunks', contributing to representation learning and interpretability of networks.

Relevance: 9 Novelty: 8

15. Multi-level Supervised Contrastive Learning

ArXiv ID: 2502.02202

Authors: Naghmeh Ghanooni, Barbod Pajoum, Harshit Rawal, Sophie Fellenz, Vo Nguyen Le Duy, Marius Kloft

Abstract: Contrastive learning is a well-established paradigm in representation learning. The standard framework of contrastive learning minimizes the distance between "similar" instances and maximizes the distance between dissimilar ones in the projection space, disregarding the various aspects of similarity that can exist between two samples. Current methods rely on a single projection head, which fails to capture the full complexity of different aspects of a sample, leading to suboptimal performance, especially in scenarios with limited training data. In this paper, we present a novel supervised contrastive learning method in a unified framework called multilevel contrastive learning (MLCL), that can be applied to both multi-label and hierarchical classification tasks. The key strength of the proposed method is the ability to capture similarities between samples across different labels and/or hierarchies using multiple projection heads. Extensive experiments on text and image datasets demonstrate that the proposed approach outperforms state-of-the-art contrastive learning methods

Comment: The paper introduces a novel supervised contrastive learning method, which is directly aligned with foundational research in representation learning.

Relevance: 9 Novelty: 8

16. BRIDLE: Generalized Self-supervised Learning with Quantization

ArXiv ID: 2502.02118

Authors: Hoang M. Nguyen, Satya N. Shukla, Qiang Zhang, Hanchao Yu, Sreya D. Roy, Taipeng Tian, Lingjiong Zhu, Yuchen Liu

Abstract: Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.

Comment: Proposes a framework combining residual quantization with self-supervised learning, very relevant to Representation Learning and training methodologies.

Relevance: 9 Novelty: 8

17. CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

ArXiv ID: 2502.01976

Authors: Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

Abstract: Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel CITER (\textbf{C}ollaborative \textbf{I}nference with \textbf{T}oken-l\textbf{E}vel \textbf{R}outing) framework that enables efficient collaboration between small and large language models (SLMs & LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.

Comment: Introduces a token-level collaborative inference framework for LLMs aiming to optimize inference efficiency, aligning well with the Model Compression criteria.

Relevance: 9 Novelty: 8

18. Al-Khwarizmi: Discovering Physical Laws with Foundation Models

ArXiv ID: 2502.01702

Authors: Christopher E. Mower, Haitham Bou-Ammar

Abstract: Inferring physical laws from data is a central challenge in science and engineering, including but not limited to healthcare, physical sciences, biosciences, social sciences, sustainability, climate, and robotics. Deep networks offer high-accuracy results but lack interpretability, prompting interest in models built from simple components. The Sparse Identification of Nonlinear Dynamics (SINDy) method has become the go-to approach for building such modular and interpretable models. SINDy leverages sparse regression with L1 regularization to identify key terms from a library of candidate functions. However, SINDy's choice of candidate library and optimization method requires significant technical expertise, limiting its widespread applicability. This work introduces Al-Khwarizmi, a novel agentic framework for physical law discovery from data, which integrates foundational models with SINDy. Leveraging LLMs, VLMs, and Retrieval-Augmented Generation (RAG), our approach automates physical law discovery, incorporating prior knowledge and iteratively refining candidate solutions via reflection. Al-Khwarizmi operates in two steps: it summarizes system observations-comprising textual descriptions, raw data, and plots-followed by a secondary step that generates candidate feature libraries and optimizer configurations to identify hidden physics laws correctly. Evaluating our algorithm on over 198 models, we demonstrate state-of-the-art performance compared to alternatives, reaching a 20 percent increase against the best-performing alternative.

Comment: The paper introduces Al-Khwarizmi for automated physical law discovery with foundational models, aligning with AI for Science innovations and novel generative paradigms.

Relevance: 8 Novelty: 9

19. Do Graph Diffusion Models Accurately Capture and Generate Substructure Distributions?

ArXiv ID: 2502.02488

Authors: Xiyuan Wang, Yewei Liu, Lexi Pang, Siwei Chen, Muhan Zhang

Abstract: Diffusion models have gained popularity in graph generation tasks; however, the extent of their expressivity concerning the graph distributions they can learn is not fully understood. Unlike models in other domains, popular backbones for graph diffusion models, such as Graph Transformers, do not possess universal expressivity to accurately model the distribution scores of complex graph data. Our work addresses this limitation by focusing on the frequency of specific substructures as a key characteristic of target graph distributions. When evaluating existing models using this metric, we find that they fail to maintain the distribution of substructure counts observed in the training set when generating new graphs. To address this issue, we establish a theoretical connection between the expressivity of Graph Neural Networks (GNNs) and the overall performance of graph diffusion models, demonstrating that more expressive GNN backbones can better capture complex distribution patterns. By integrating advanced GNNs into the backbone architecture, we achieve significant improvements in substructure generation.

Comment: Investigates expressivity limits in graph diffusion models, which ties into foundational representation learning with implications for architecture analysis.

Relevance: 8 Novelty: 9

20. ContinuouSP: Generative Model for Crystal Structure Prediction with Invariance and Continuity

ArXiv ID: 2502.02026

Authors: Yuji Tone, Masatoshi Hanai, Mitsuaki Kawamura, Kenjiro Taura, Toyotaro Suzumura

Abstract: The discovery of new materials using crystal structure prediction (CSP) based on generative machine learning models has become a significant research topic in recent years. In this paper, we study invariance and continuity in the generative machine learning for CSP. We propose a new model, called ContinuouSP, which effectively handles symmetry and periodicity in crystals. We clearly formulate the invariance and the continuity, and construct a model based on the energy-based model. Our preliminary evaluation demonstrates the effectiveness of this model with the CSP task.

Comment: Proposes a novel generative model for crystal structure prediction with invariance and continuity, potentially relevant under AI for Science with foundational elements.

Relevance: 8 Novelty: 8

21. Local minima of the empirical risk in high dimension: General theorems and convex examples

ArXiv ID: 2502.01953

Authors: Kiana Asgari, Andrea Montanari, Basil Saeed

Abstract: We consider a general model for high-dimensional empirical risk minimization whereby the data $\mathbf{x}_i$ are $d$-dimensional isotropic Gaussian vectors, the model is parametrized by $\mathbf{\Theta}\in\mathbb{R}^{d\times k}$, and the loss depends on the data via the projection $\mathbf{\Theta}^\mathsf{T}\mathbf{x}_i$. This setting covers as special cases classical statistics methods (e.g. multinomial regression and other generalized linear models), but also two-layer fully connected neural networks with $k$ hidden neurons. We use the Kac-Rice formula from Gaussian process theory to derive a bound on the expected number of local minima of this empirical risk, under the proportional asymptotics in which $n,d\to\infty$, with $n\asymp d$. Via Markov's inequality, this bound allows to determine the positions of these minimizers (with exponential deviation bounds) and hence derive sharp asymptotics on the estimation and prediction error. In this paper, we apply our characterization to convex losses, where high-dimensional asymptotics were not (in general) rigorously established for $k\ge 2$. We show that our approach is tight and allows to prove previously conjectured results. In addition, we characterize the spectrum of the Hessian at the minimizer. A companion paper applies our general result to non-convex examples.

Comment: The paper provides insights into the geometry of empirical risk landscapes, particularly for two-layer neural networks. This aligns with foundational training dynamics and high-dimensional learning principles.

Relevance: 8 Novelty: 8

22. Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction

ArXiv ID: 2502.01706

Authors: Alexei Figueroa, Justus Westerhoff, Atefi Golzar, Dennis Fast, Benjamin Winter, Felix Alexader Gers, Alexander L\"oser, Wolfang Nejdl

Abstract: Biologically inspired neural networks offer alternative avenues to model data distributions. FlyVec is a recent example that draws inspiration from the fruit fly's olfactory circuit to tackle the task of learning word embeddings. Surprisingly, this model performs competitively even against deep learning approaches specifically designed to encode text, and it does so with the highest degree of computational efficiency. We pose the question of whether this performance can be improved further. For this, we introduce Comply. By incorporating positional information through complex weights, we enable a single-layer neural network to learn sequence representations. Our experiments show that Comply not only supersedes FlyVec but also performs on par with significantly larger state-of-the-art models. We achieve this without additional parameters. Comply yields sparse contextual representations of sentences that can be interpreted explicitly from the neuron weights.

Comment: Biologically inspired neural network for encoding, focusing on sentence representation learning with sparse contextual embeddings.

Relevance: 8 Novelty: 8

23. mPOLICE: Provable Enforcement of Multi-Region Affine Constraints in Deep Neural Networks

ArXiv ID: 2502.02434

Authors: Mohammadmehdi Ataei, Hyunmin Cheong, Adrian Butscher

Abstract: Deep neural networks are increasingly employed in fields such as climate modeling, robotics, and industrial control, where strict output constraints must be upheld. Although prior methods like the POLICE algorithm can enforce affine constraints in a single convex region by adjusting network parameters, they struggle with multiple disjoint regions, often leading to conflicts or unintended affine extensions. We present mPOLICE, a new method that extends POLICE to handle constraints imposed on multiple regions. mPOLICE assigns a distinct activation pattern to each constrained region, preserving exact affine behavior locally while avoiding overreach into other parts of the input domain. We formulate a layer-wise optimization problem that adjusts both the weights and biases to assign unique activation patterns to each convex region, ensuring that constraints are met without conflicts, while maintaining the continuity and smoothness of the learned function. Our experiments show the enforcement of multi-region constraints for multiple scenarios, including regression and classification, function approximation, and non-convex regions through approximation. Notably, mPOLICE adds zero inference overhead and minimal training overhead.

Comment: Introduces mPOLICE to handle multi-region constraints in neural networks, relevant to model architecture and training efficiency.

Relevance: 8 Novelty: 8

24. Self-supervised Subgraph Neural Network With Deep Reinforcement Walk Exploration

ArXiv ID: 2502.01809

Authors: Jianming Huang, Hiroyuki Kasai

Abstract: Graph data, with its structurally variable nature, represents complex real-world phenomena like chemical compounds, protein structures, and social networks. Traditional Graph Neural Networks (GNNs) primarily utilize the message-passing mechanism, but their expressive power is limited and their prediction lacks explainability. To address these limitations, researchers have focused on graph substructures. Subgraph neural networks (SGNNs) and GNN explainers have emerged as potential solutions, but each has its limitations. SGNNs computes graph representations based on the bags of subgraphs to enhance the expressive power. However, they often rely on predefined algorithm-based sampling strategies, which is inefficient. GNN explainers adopt data-driven approaches to generate important subgraphs to provide explanation. Nevertheless, their explanation is difficult to be translated into practical improvements on GNNs. To overcome these issues, we propose a novel self-supervised framework that integrates SGNNs with the generation approach of GNN explainers, named the Reinforcement Walk Exploration SGNN (RWE-SGNN). Our approach features a sampling model trained in an explainer fashion, optimizing subgraphs to enhance model performance. To achieve a data-driven sampling approach, unlike traditional subgraph generation approaches, we propose a novel walk exploration process, which efficiently extracts important substructures, simplifying the embedding process and avoiding isomorphism problems. Moreover, we prove that our proposed walk exploration process has equivalent generation capability to the traditional subgraph generation process. Experimental results on various graph datasets validate the effectiveness of our proposed method, demonstrating significant improvements in performance and precision.

Comment: Proposes self-supervised SGNNs which utilize reinforcement for exploring subgraph structures, relevant to representation learning through graph methods.

Relevance: 8 Novelty: 8

25. MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving

ArXiv ID: 2502.01960

Authors: Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen

Abstract: The context caching technique is employed to accelerate the Multimodal Large Language Model (MLLM) inference by prevailing serving platforms currently. However, this approach merely reuses the Key-Value (KV) cache of the initial sequence of prompt, resulting in full KV cache recomputation even if the prefix differs slightly. This becomes particularly inefficient in the context of interleaved text and images, as well as multimodal retrieval-augmented generation. This paper proposes position-independent caching as a more effective approach for multimodal information management. We have designed and implemented a caching system, named MPIC, to address both system-level and algorithm-level challenges. MPIC stores the KV cache on local or remote disks when receiving multimodal data, and calculates and loads the KV cache in parallel during inference. To mitigate accuracy degradation, we have incorporated integrated reuse and recompute mechanisms within the system. The experimental results demonstrate that MPIC can achieve up to 54% reduction in response time compared to existing context caching systems, while maintaining negligible or no accuracy loss.

Comment: The paper introduces a position-independent caching system for Multimodal Large Language Model (MLLM) inference, specifically addressing efficiency in KV cache management. This aligns well with the model compression and efficiency domain.

Relevance: 9 Novelty: 7

26. Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

ArXiv ID: 2502.02562

Authors: Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid, Sumeet Singh, Rene Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, Krzysztof Choromanski

Abstract: We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.

Comment: The paper significantly extends Rotary Position Encodings (RoPEs) into the domain of 2D and 3D position encodings using STRING. This aligns directly with foundational model architectures and contributes theoretical advancements.

Relevance: 8 Novelty: 8

27. Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

ArXiv ID: 2502.02393

Authors: Alireza Amiri, Xinting Huang, Mark Rofin, Michael Hahn

Abstract: Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers' expressivity from $TC^0$ to $PTIME$, their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in $TC^0$, such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of CoT steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.

Comment: The work presents theoretical bounds on scratchpad lengths in chain-of-thought reasoning for transformers, providing fundamental insights into LLM training dynamics and architectural limitations.

Relevance: 8 Novelty: 8

28. Modular Training of Neural Networks aids Interpretability

ArXiv ID: 2502.02470

Authors: Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots

Abstract: An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a ``clusterability loss'' function that encourages the formation of non-interacting clusters. Using automated interpretability techniques, we show that our method can help train models that are more modular and learn different, disjoint, and smaller circuits. We investigate CNNs trained on MNIST and CIFAR, small transformers trained on modular addition, and language models. Our approach provides a promising direction for training neural networks that learn simpler functions and are easier to interpret.

Comment: Presents modular training to improve interpretability and simplifies learned functions, aligning with foundational aspects of representation learning.

Relevance: 9 Novelty: 7

29. Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

ArXiv ID: 2502.01804

Authors: Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

Abstract: Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.

Comment: Proposes 'Soup-of-Experts,' a model architecture leveraging expert combinations via parameter averaging, which might be an interesting take on MoE-like approaches.

Relevance: 8 Novelty: 8

30. Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

ArXiv ID: 2502.01941

Authors: Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu

Abstract: This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of $17.4\%$-$43.3\%$. Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only $9.67\%$-$25.53\%$ performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves $9\%$-$18\%$ performance improvements on long-context generation tasks under aggressive compression ratios.

Comment: This directly explores the effects of KV cache compression on LLM capabilities, aligning with the topic of model compression and efficiency research.

Relevance: 9 Novelty: 7

31. ReMiDi: Reconstruction of Microstructure Using a Differentiable Diffusion MRI Simulator

ArXiv ID: 2502.01988

Authors: Prathamesh Pradeep Khole, Zahra Kais Petiwala, Shri Prathaa Magesh, Ehsan Mirafzali, Utkarsh Gupta, Jing-Rebecca Li, Andrada Ianus, Razvan Marinescu

Abstract: We propose ReMiDi, a novel method for inferring neuronal microstructure as arbitrary 3D meshes using a differentiable diffusion Magnetic Resonance Imaging (dMRI) simulator. We first implemented in PyTorch a differentiable dMRI simulator that simulates the forward diffusion process using a finite-element method on an input 3D microstructure mesh. To achieve significantly faster simulations, we solve the differential equation semi-analytically using a matrix formalism approach. Given a reference dMRI signal $S_{ref}$, we use the differentiable simulator to iteratively update the input mesh such that it matches $S_{ref}$ using gradient-based learning. Since directly optimizing the 3D coordinates of the vertices is challenging, particularly due to ill-posedness of the inverse problem, we instead optimize a lower-dimensional latent space representation of the mesh. The mesh is first encoded into spectral coefficients, which are further encoded into a latent $\textbf{z}$ using an auto-encoder, and are then decoded back into the true mesh. We present an end-to-end differentiable pipeline that simulates signals that can be tuned to match a reference signal by iteratively updating the latent representation $\textbf{z}$. We demonstrate the ability to reconstruct microstructures of arbitrary shapes represented by finite-element meshes, with a focus on axonal geometries found in the brain white matter, including bending, fanning and beading fibers. Our source code will be made available online.

Comment: The paper focuses on a novel reconstruction method with representation encoding using autoencoders, which aligns with the Representation Learning and Model Architecture criteria.

Relevance: 8 Novelty: 8

32. Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge

ArXiv ID: 2502.02173

Authors: Daniel Tamayo, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas

Abstract: Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at https://github.com/dtamayo-nlp/MEMAT.

Comment: Explores knowledge editing in multilingual LLMs with attention mechanisms, advancing architectural insights in LLMs and their interpretability.

Relevance: 8 Novelty: 8

33. A Revisit of Total Correlation in Disentangled Variational Auto-Encoder with Partial Disentanglement

ArXiv ID: 2502.02279

Authors: Chengrui Li, Yunmiao Wang, Yule Wang, Weihan Li, Dieter Jaeger, Anqi Wu

Abstract: A fully disentangled variational auto-encoder (VAE) aims to identify disentangled latent components from observations. However, enforcing full independence between all latent components may be too strict for certain datasets. In some cases, multiple factors may be entangled together in a non-separable manner, or a single independent semantic meaning could be represented by multiple latent components within a higher-dimensional manifold. To address such scenarios with greater flexibility, we develop the Partially Disentangled VAE (PDisVAE), which generalizes the total correlation (TC) term in fully disentangled VAEs to a partial correlation (PC) term. This framework can handle group-wise independence and can naturally reduce to either the standard VAE or the fully disentangled VAE. Validation through three synthetic experiments demonstrates the correctness and practicality of PDisVAE. When applied to real-world datasets, PDisVAE discovers valuable information that is difficult to find using fully disentangled VAEs, implying its versatility and effectiveness.

Comment: Addresses foundational aspects of representation learning by proposing a partially disentangled VAE through novel extensions like the Partial Correlation term. This directly aligns with insights into how latent variables are structured and encoded, a key topic in representation learning.

Relevance: 9 Novelty: 7

34. BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation

ArXiv ID: 2502.01697

Authors: Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia

Abstract: As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. A common assumption about synthetic data is that sampling from instruct-tuned models is sufficient; however, these models struggle to produce diverse outputs-a key requirement for generalization. Despite various prompting methods, in this work we show that achieving meaningful diversity from instruct-tuned models remains challenging. In contrast, we find base models without post-training exhibit greater diversity, but are less capable at instruction following and hence of lower quality. Leveraging this insight, we propose Base-Refine (BARE), a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models through a two-stage process. With minimal few-shot examples and curation, BARE generates diverse and high-quality datasets, improving downstream task performance. We show that fine-tuning with as few as 1,000 BARE-generated samples can reach performance comparable to the best similarly sized models on LiveCodeBench tasks. Furthermore, fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.

Comment: Presents a method (BARE) to improve synthetic data generation by combining base and post-tuned models, potentially relevant to insights into foundation models and representation learning.

Relevance: 8 Novelty: 7

ArXiv ID: 2502.01919

Authors: Lancelot F. James, Juho Lee, Abhinav Pandey

Abstract: In this work, we present a comprehensive Bayesian posterior analysis of what we term Poisson Hierarchical Indian Buffet Processes, designed for complex random sparse count species sampling models that allow for the sharing of information across and within groups. This analysis covers a potentially infinite number of species and unknown parameters, which, within a Bayesian machine learning context, we are able to learn from as more information is sampled. To achieve our refined results, we employ a range of methodologies drawn from Bayesian latent feature models, random occupancy models, and excursion theory. Despite this complexity, our goal is to make our findings accessible to practitioners, including those who may not be familiar with these areas. To facilitate understanding, we adopt a pseudo-expository style that emphasizes clarity and practical utility. We aim to express our findings in a language that resonates with experts in microbiome and ecological studies, addressing gaps in modeling capabilities while acknowledging that we are not experts ourselves in these fields. This approach encourages the use of our models as basic components of more sophisticated frameworks employed by domain experts, embodying the spirit of the seminal work on the Dirichlet Process. Ultimately, our refined posterior analysis not only yields tractable computational procedures but also enables practical statistical implementation and provides a clear mapping to relevant quantities in microbiome analysis.

Comment: The paper discusses a model for latent feature sharing (relevant to representation learning) with an emphasis on sparse methods and training dynamics. However, its focus on applications like microbiome analysis slightly dilutes its relevance to foundational research.

Relevance: 7 Novelty: 8

36. LIBRA: Measuring Bias of Large Language Model from a Local Context

ArXiv ID: 2502.01679

Authors: Bo Pang, Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh

Abstract: Large Language Models (LLMs) have significantly advanced natural language processing applications, yet their widespread use raises concerns regarding inherent biases that may reduce utility or harm for particular social groups. Despite the advancement in addressing LLM bias, existing research has two major limitations. First, existing LLM bias evaluation focuses on the U.S. cultural context, making it challenging to reveal stereotypical biases of LLMs toward other cultures, leading to unfair development and use of LLMs. Second, current bias evaluation often assumes models are familiar with the target social groups. When LLMs encounter words beyond their knowledge boundaries that are unfamiliar in their training data, they produce irrelevant results in the local context due to hallucinations and overconfidence, which are not necessarily indicative of inherent bias. This research addresses these limitations with a Local Integrated Bias Recognition and Assessment Framework (LIBRA) for measuring bias using datasets sourced from local corpora without crowdsourcing. Implementing this framework, we develop a dataset comprising over 360,000 test cases in the New Zealand context. Furthermore, we propose the Enhanced Idealized CAT Score (EiCAT), integrating the iCAT score with a beyond knowledge boundary score (bbs) and a distribution divergence-based bias measurement to tackle the challenge of LLMs encountering words beyond knowledge boundaries. Our results show that the BERT family, GPT-2, and Llama-3 models seldom understand local words in different contexts. While Llama-3 exhibits larger bias, it responds better to different cultural contexts. The code and dataset are available at: https://github.com/ipangbo/LIBRA.

Comment: The study introduces a framework for measuring biases in LLMs, focusing on local context, revealing new insights into LLM behavior beyond application.

Relevance: 8 Novelty: 7

37. Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

ArXiv ID: 2502.01694

Authors: Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki

Abstract: A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time compute by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed metastable representation of the reasoning dynamics can be distilled into a smaller, more efficient model.

Comment: Provides theoretical benefits of reasoning paradigms in LLMs, examining metastable dynamics of CoT reasoning, relevant to understanding LLM inference processes.

Relevance: 7 Novelty: 8

38. Activation-Informed Merging of Large Language Models

ArXiv ID: 2502.02421

Authors: Amin Heyrani Nobari, Kaveh Alimohammadi, Ali ArjomandBigdeli, Akash Srivastava, Faez Ahmed, Navid Azizan

Abstract: Model merging, a method that combines the parameters and embeddings of multiple fine-tuned large language models (LLMs), offers a promising approach to enhance model performance across various tasks while maintaining computational efficiency. This paper introduces Activation-Informed Merging (AIM), a technique that integrates the information from the activation space of LLMs into the merging process to improve performance and robustness. AIM is designed as a flexible, complementary solution that is applicable to any existing merging method. It aims to preserve critical weights from the base model, drawing on principles from continual learning~(CL) and model compression. Utilizing a task-agnostic calibration set, AIM selectively prioritizes essential weights during merging. We empirically demonstrate that AIM significantly enhances the performance of merged models across multiple benchmarks. Our findings suggest that considering the activation-space information can provide substantial advancements in the model merging strategies for LLMs with up to 40\% increase in benchmark performance.

Comment: AIM introduces an activation-informed merging strategy for LLMs and incorporates principles from model compression, aligning well with efficiency and foundational innovation criteria.

Relevance: 8 Novelty: 7

39. VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

ArXiv ID: 2502.02175

Authors: Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, Chang Xu

Abstract: Vision-Language-Action (VLA) model can process instructions and visual perception to directly generate actions as output in an end-to-end fashion due to its strong multi-modal reasoning capabilities. While the performance of VLA models is promising, their computational cost can be substantial. This raises challenge for applying them on robotics tasks, which requires real-time decision-making to respond quickly to environmental changes. Since robotic control involves sequential decision-making, the visual input often exhibits minimal variation between successive steps. A natural idea is to reuse the computational results of unchanged visual tokens from the last step. Motivated by this idea, we propose VLA-Cache, an efficient vision-language-action model. VLA-Cache incorporates a token-selection mechanism that compares the visual input at each step with the input from the previous step, adaptively identifying visual tokens with minimal changes. The computational results for these unchanged tokens are then reused in subsequent steps via KV-cache, thereby significantly improving the efficiency of the VLA-Cache model. Experimental results on both simulation (e.g., LIBERO benchmark and SIMPLER) and real-world robot valid VLA-Cache can achieve practical acceleration with minimal sacrifice in success rate.

Comment: Proposes KV-cache optimizations for vision-language-action models in robotic manipulation, relevant to model compression and efficiency.

Relevance: 8 Novelty: 7

40. On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach

ArXiv ID: 2502.02209

Authors: Edo Cohen-Karlik, Itamar Zimerman, Liane Galanti, Ido Atad, Amir Globerson, Lior Wolf

Abstract: Recent advances in efficient sequence modeling have introduced selective state-space layers, a key component of the Mamba architecture, which have demonstrated remarkable success in a wide range of NLP and vision tasks. While Mamba's empirical performance has matched or surpassed SoTA transformers on such diverse benchmarks, the theoretical foundations underlying its powerful representational capabilities remain less explored. In this work, we investigate the expressivity of selective state-space layers using multivariate polynomials, and prove that they surpass linear transformers in expressiveness. Consequently, our findings reveal that Mamba offers superior representational power over linear attention-based models for long sequences, while not sacrificing their generalization. Our theoretical insights are validated by a comprehensive set of empirical experiments on various datasets.

Comment: The paper analyzes selective state-space layers, contributing theoretical insights into efficient sequence modeling, which has relevance to foundational architectural research.

Relevance: 8 Novelty: 7

41. Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

ArXiv ID: 2502.02431

Authors: Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade

Abstract: Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.

Comment: The paper establishes connections between advanced optimizers and SGD variants, which contributes novel theoretical insights into training dynamics.

Relevance: 7 Novelty: 8

42. Avoiding spurious sharpness minimization broadens applicability of SAM

ArXiv ID: 2502.02407

Authors: Sidak Pal Singh, Hossein Mobahi, Atish Agarwala, Yann Dauphin

Abstract: Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance -- even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics -- instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional-SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs).

Comment: Proposes Functional-SAM, which refines sharpness minimization to improve applicability across NLP and LLM domains. This contributes to training dynamics and generalization in large models, aligning with insights into optimization techniques for foundational models.

Relevance: 8 Novelty: 7

43. T-SCEND: Test-time Scalable MCTS-enhanced Diffusion Model

ArXiv ID: 2502.01989

Authors: Tao Zhang, Jia-Shu Pan, Ruiqi Feng, Tailin Wu

Abstract: We introduce Test-time Scalable MCTS-enhanced Diffusion Model (T-SCEND), a novel framework that significantly improves diffusion model's reasoning capabilities with better energy-based training and scaling up test-time computation. We first show that na\"ively scaling up inference budget for diffusion models yields marginal gain. To address this, the training of T-SCEND consists of a novel linear-regression negative contrastive learning objective to improve the performance-energy consistency of the energy landscape, and a KL regularization to reduce adversarial sampling. During inference, T-SCEND integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS), which sequentially performs best-of-N random search and MCTS as denoising proceeds. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of T-SCEND's training objective and scalable inference method. In particular, trained with Maze sizes of up to $6\times6$, our T-SCEND solves $88\%$ of Maze problems with much larger sizes of $15\times15$, while standard diffusion completely fails.Code to reproduce the experiments can be found at https://github.com/AI4Science-WestlakeU/t_scend.

Comment: The T-SCEND framework targets reasoning tasks using enhanced diffusion models with better energy-based training. It offers methodological innovations related to tuning and optimization processes for complex tasks.

Relevance: 7 Novelty: 7

44. Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis

ArXiv ID: 2502.02552

Authors: Haonan Zhu, Andre R. Goncalves, Camilo Valdes, Hiranmayi Ranganathan, Boya Zhang, Jose Manuel Mart\'i, Car Reen Kok, Monica K. Borucki, Nisha J. Mulakken, James B. Thissen, Crystal Jaing, Alfred Hero, Nicholas A. Be

Abstract: This paper proposes a hierarchical Bayesian multitask learning model that is applicable to the general multi-task binary classification learning problem where the model assumes a shared sparsity structure across different tasks. We derive a computationally efficient inference algorithm based on variational inference to approximate the posterior distribution. We demonstrate the potential of the new approach on various synthetic datasets and for predicting human health status based on microbiome profile. Our analysis incorporates data pooled from multiple microbiome studies, along with a comprehensive comparison with other benchmark methods. Results in synthetic datasets show that the proposed approach has superior support recovery property when the underlying regression coefficients share a common sparsity structure across different tasks. Our experiments on microbiome classification demonstrate the utility of the method in extracting informative taxa while providing well-calibrated predictions with uncertainty quantification and achieving competitive performance in terms of prediction metrics. Notably, despite the heterogeneity of the pooled datasets (e.g., different experimental objectives, laboratory setups, sequencing equipment, patient demographics), our method delivers robust results.

Comment: The paper focuses on a hierarchical Bayesian multitask learning model and sparsity, offering insights into representation learning, specifically shared sparsity structures across tasks.

Relevance: 8 Novelty: 6

45. Learning Hyperparameters via a Data-Emphasized Variational Objective

ArXiv ID: 2502.01861

Authors: Ethan Harvey, Mikhail Petrov, Michael C. Hughes

Abstract: When training large flexible models, practitioners often rely on grid search to select hyperparameters that control over-fitting. This grid search has several disadvantages: the search is computationally expensive, requires carving out a validation set that reduces the available data for training, and requires users to specify candidate values. In this paper, we propose an alternative: directly learning regularization hyperparameters on the full training set via the evidence lower bound ("ELBo") objective from variational methods. For deep neural networks with millions of parameters, we recommend a modified ELBo that upweights the influence of the data likelihood relative to the prior. Our proposed technique overcomes all three disadvantages of grid search. In a case study on transfer learning of image classifiers, we show how our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable length-scale kernels.

Comment: The paper proposes learning hyperparameters via a variational objective, touching on theoretical insights in model training dynamics which aligns with representation learning.

Relevance: 7 Novelty: 7

46. Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries

ArXiv ID: 2502.02414

Authors: Huakun Luo, Haixu Wu, Hang Zhou, Lanxiang Xing, Yichen Di, Jianmin Wang, Mingsheng Long

Abstract: Although deep models have been widely explored in solving partial differential equations (PDEs), previous works are primarily limited to data only with up to tens of thousands of mesh points, far from the million-point scale required by industrial simulations that involve complex geometries. In the spirit of advancing neural PDE solvers to real industrial applications, we present Transolver++, a highly parallel and efficient neural solver that can accurately solve PDEs on million-scale geometries. Building upon previous advancements in solving PDEs by learning physical states via Transolver, Transolver++ is further equipped with an extremely optimized parallelism framework and a local adaptive mechanism to efficiently capture eidetic physical states from massive mesh points, successfully tackling the thorny challenges in computation and physics learning when scaling up input mesh size. Transolver++ increases the single-GPU input capacity to million-scale points for the first time and is capable of continuously scaling input size in linear complexity by increasing GPUs. Experimentally, Transolver++ yields 13% relative promotion across six standard PDE benchmarks and achieves over 20% performance gain in million-scale high-fidelity industrial simulations, whose sizes are 100$\times$ larger than previous benchmarks, covering car and 3D aircraft designs.

Comment: Transolver++ enhances PDE solving with scalable architectures, touching upon efficient model architecture and parallelism.

Relevance: 7 Novelty: 7

47. LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

ArXiv ID: 2502.02406

Authors: Tzu-Tao Chang, Shivaram Venkataraman

Abstract: Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique enabling support for longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 5.58$\times$ end-to-end speedup compared to existing approaches.

Comment: LV-XAttn introduces cross-attention mechanisms for LLMs to handle long visual inputs efficiently, focusing on distributed attention and model architecture.

Relevance: 7 Novelty: 7

48. Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration

ArXiv ID: 2502.01969

Authors: Younan Zhu, Linwei Tao, Minjing Dong, Chang Xu

Abstract: Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning capabilities but remain highly susceptible to object hallucination, where models generate responses that are not factually aligned with the visual content. Recent works attribute this issue to an inherent bias of LVLMs where vision token attention map has a fixed correlation with spatial position, and propose to mitigate this issue by reordering visual tokens. However, we find that different LVLMs exhibit different correlations between attention and spatial position, which makes the existing solution difficult to generalize to other LVLMs. To address this issue, we first introduce a training-free solution, Uniform Attention Calibration (UAC), that estimates the bias from single meaningless input image and applies a calibration matrix to rectify attention imbalances. To further alleviate the bias, we relax the assumption of single meaningless input in UAC and introduce a fine-tuning solution, Dynamic Attention Calibration (DAC), that enforces the consistent outputs wherever the object locates in the image via a plug-and-plays module. Comprehensive experiments across multiple benchmarks demonstrate that UAC and DAC significantly reduce object hallucination while improving general multimodal alignment. Our methods achieve state-of-the-art performance across diverse LVLM architectures on various metrics.

Comment: Addresses object hallucination in vision-language models using attention calibration techniques relevant to the interpretability of LLMs but does not explore foundational architectural transformations.

Relevance: 7 Novelty: 6

49. Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

ArXiv ID: 2502.01755

Authors: Shuangyi Chen, Yuanxin Guo, Yue Ju, Harik Dalal, Ashish Khisti

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) optimize federated training by reducing computational and communication costs. We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters. Our approach emphasizes the importance of learning up and down projection matrices to enhance expressiveness and robustness. We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches that either generate imperfect model updates or limit expressiveness of the model. We present theoretical analysis on a simplified linear model to demonstrate the importance of learning both down-projection and up-projection matrices in LoRA. We provide extensive experimental evaluations on a toy neural network on MNIST as well as large language models including RoBERTa-Large, Llama-2-7B on diverse tasks to demonstrate the advantages of RoLoRA over other methods.

Comment: Focuses on LoRA, a parameter-efficient fine-tuning method, which aligns with model compression techniques like low-rank adaptation. However, it is applied in a federated learning setup, slightly diluting relevance.

Relevance: 7 Novelty: 6

50. Distributionally Robust Direct Preference Optimization

ArXiv ID: 2502.01930

Authors: Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran

Abstract: A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.

Comment: Addresses alignment of LLMs with human preferences under distribution shifts, which is tangentially related but not specifically foundational to LLM architectural or theoretical breakthroughs.

Relevance: 7 Novelty: 6

51. Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

ArXiv ID: 2502.02494

Authors: Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar

Abstract: Similarity between training examples is used to curate pretraining datasets for language models by many methods -- for diversification and to select examples similar to high-quality data. However, similarity is typically measured with off-the-shelf embedding models that are generic or trained for tasks such as retrieval. This paper introduces a framework to analyze the suitability of embedding models specifically for data curation in the language model pretraining setting. We quantify the correlation between similarity in the embedding space to similarity in pretraining loss between different training examples, and how diversifying in the embedding space affects pretraining quality. We analyze a variety of embedding models in our framework, with experiments using the Pile dataset for pretraining a 1.7B parameter decoder-only language model. We find that the embedding models we consider are all useful for pretraining data curation. Moreover, a simple approach of averaging per-token embeddings proves to be surprisingly competitive with more sophisticated embedding models -- likely because the latter are not designed specifically for pretraining data curation. Indeed, we believe our analysis and evaluation framework can serve as a foundation for the design of embedding models that specifically reason about similarity in pretraining datasets.

Comment: The paper analyzes similarity metrics for data selection in LLM pretraining, examining embedding models and their impact. While not proposing architectural changes, it adds theoretical insights into data curation for large-scale models, subtly aligning with foundational LLM research.

Relevance: 7 Novelty: 6

Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research, avoiding purely application-driven work:

Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords for Relevant Domains: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.

Hints on Irrelevant Domains: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.

Hints on Application Tasks: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Relevance 9-10 (Completely Relevant)
Focus: Fully aligned with core topics with no deviation, score the highest if contains keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
Focus: Purely unrelated to our topics. Completely a different domain.
Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

Novelty 9-10 (Breakthrough)
Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
Definition: Minimal originality, applying standard approaches without real innovation.
Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

ARXIVID: should be the ArXiv ID.
COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
RELEVANCE: should be a score from 1-10.
NOVELTY: should be a score from 1-10.

Personalized Daily Arxiv Papers 02/05/2025

1. Layer by Layer: Uncovering Hidden Representations in Language Models

2. Constrained belief updates explain geometric structures in transformer representations

3. Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

4. Toward Neurosymbolic Program Comprehension

5. Choose Your Model Size: Any Compression by a Single Gradient Descent

6. How Memory in Optimization Algorithms Implicitly Modifies the Loss

7. Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

8. Optimal Spectral Transitions in High-Dimensional Multi-Index Models

9. Enhancing Generalization via Sharpness-Aware Trajectory Matching for Dataset Condensation

10. When Dimensionality Hurts: The Role of LLM Embedding Compression for Noisy Regression Tasks

11. A Periodic Bayesian Flow for Material Generation

12. Reasoning Bias of Next Token Prediction Training

13. EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

14. Discovering Chunks in Neural Embeddings for Interpretability

15. Multi-level Supervised Contrastive Learning

16. BRIDLE: Generalized Self-supervised Learning with Quantization

17. CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

18. Al-Khwarizmi: Discovering Physical Laws with Foundation Models

19. Do Graph Diffusion Models Accurately Capture and Generate Substructure Distributions?

20. ContinuouSP: Generative Model for Crystal Structure Prediction with Invariance and Continuity

21. Local minima of the empirical risk in high dimension: General theorems and convex examples

22. Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction

23. mPOLICE: Provable Enforcement of Multi-Region Affine Constraints in Deep Neural Networks

24. Self-supervised Subgraph Neural Network With Deep Reinforcement Walk Exploration

25. MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving

26. Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

27. Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

28. Modular Training of Neural Networks aids Interpretability

29. Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

30. Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

31. ReMiDi: Reconstruction of Microstructure Using a Differentiable Diffusion MRI Simulator

32. Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge

33. A Revisit of Total Correlation in Disentangled Variational Auto-Encoder with Partial Disentanglement

34. BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation

35. Poisson Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features-With Indications for Microbiome Species Sampling Models

36. LIBRA: Measuring Bias of Large Language Model from a Local Context

37. Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

38. Activation-Informed Merging of Large Language Models

39. VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

40. On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach

41. Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

42. Avoiding spurious sharpness minimization broadens applicability of SAM

43. T-SCEND: Test-time Scalable MCTS-enhanced Diffusion Model

44. Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis

45. Learning Hyperparameters via a Data-Emphasized Variational Objective

46. Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries

47. LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

48. Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration

49. Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

50. Distributionally Robust Direct Preference Optimization

51. Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Paper Selection Prompt

Relevant Topics

Scoring Criteria

Relevance Scoring

Novelty Scoring

Papers

Instructions