Personalized Daily Arxiv Papers 02/12/2025

	Prompt	Completion	Total
Token	105980	8781	114761
Cost	$0.26	$0.09	$0.35

Total scanned papers: 472

Total relevant papers: 35

Table of contents with paper titles:

Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models Authors: Siddarth Venkatraman, Mohsin Hasan, Minsu Kim, Luca Scimeca, Marcin Sendera, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Monte Carlo Tree Diffusion for System 2 Planning Authors: Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, Sungjin Ahn
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving Authors: Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin
MoENAS: Mixture-of-Expert based Neural Architecture Search for jointly Accurate, Fair, and Robust Edge Deep Neural Networks Authors: Lotfi Abdelkrim Mecharbat, Alberto Marchisio, Muhammad Shafique, Mohammad M. Ghassemi, Tuka Alhanai
Revisiting Non-Acyclic GFlowNets in Discrete Environments Authors: Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov
Global Universal Scaling and Ultra-Small Parameterization in Machine Learning Interatomic Potentials with Super-Linearity Authors: Yanxiao Hu, Ye Sheng, Jing Huang, Xiaoxin Xu, Yuyan Yang, Mingqiang Zhang, Yabei Wu, Caichao Ye, Jiong Yang, Wenqing Zhang
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More Authors: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline Authors: Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension Authors: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models Authors: Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation Authors: Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen
Model Fusion via Neuron Transplantation Authors: Muhammed \"Oz, Nicholas Kiefer, Charlotte Debus, Jasmin H\"orter, Achim Streit, Markus G\"otz
Harnessing Language's Fractal Geometry with Recursive Inference Scaling Authors: Ibrahim Alabdulmohsin, Xiaohua Zhai
Online Scheduling for LLM Inference with KV Cache Constraints Authors: Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou
Enabling Autoregressive Models to Fill In Masked Tokens Authors: Daniel Israel, Aditya Grover, Guy Van den Broeck
Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization Authors: Yuqiao Wen, Yanshuai Cao, Lili Mou
Private Low-Rank Approximation for Covariance Matrices, Dyson Brownian Motion, and Eigenvalue-Gap Bounds for Gaussian Perturbations Authors: Oren Mangoubi, Nisheeth K. Vishnoi
HRP: High-Rank Preheating for Superior LoRA Initialization Authors: Yuzhu Chen, Yingjie Wang, Shi Fu, Li Shen, Yongcheng Jing, Xinmei Tian, Dacheng Tao
When More is Less: Understanding Chain-of-Thought Length in LLMs Authors: Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, Yisen Wang
Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents Authors: Mathis Pink, Qinyuan Wu, Vy Ai Vo, Javier Turek, Jianing Mu, Alexander Huth, Mariya Toneva
Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure Authors: Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, Yiqin Gao
Negative Dependence as a toolbox for machine learning : review and new developments Authors: Hoang-Son Tran, Vladimir Petrovic, Remi Bardenet, Subhroshekhar Ghosh
MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces Authors: Loris Gaven, Thomas Carta, Cl\'ement Romac, C\'edric Colas, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer
Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification Authors: Zicheng Liu, Siyuan Li, Zhiyuan Chen, Lei Xin, Fang Wu, Chang Yu, Qirong Yang, Yucheng Guo, Yujie Yang, Stan Z. Li
Understanding the Generalization Error of Markov algorithms through Poissonization Authors: Benjamin Dupuis, Maxime Haddouche, George Deligiannidis, Umut Simsekli
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid Authors: Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification Authors: Lin Zhang, Wenshuo Dong, Zhuoran Zhang, Shu Yang, Lijie Hu, Ninghao Liu, Pan Zhou, Di Wang
Quantification of model error for inverse problems in the Weak Neural Variational Inference framework Authors: Vincent C. Scholz, P. S. Koutsourelakis
Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject Authors: Zenghao Duan, Wenbin Duan, Zhiyi Yin, Yinghan Shen, Shaoling Jing, Jie Zhang, Huawei Shen, Xueqi Cheng
Variational Learning Induces Adaptive Label Smoothing Authors: Sin-Han Yang, Zhedong Liu, Gian Maria Marconi, Mohammad Emtiyaz Khan
Does Training on Synthetic Data Make Models Less Robust? Authors: Lingze Zhang, Ellie Pavlick
XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units Authors: Arghadip Das, Arnab Raha, Shamik Kundu, Soumendu Kumar Ghosh, Deepak Mathaikutty, Vijay Raghunathan
Dataset Ownership Verification in Contrastive Pre-trained Models Authors: Yuechen Xie, Jie Song, Mengqi Xue, Haofei Zhang, Xingen Wang, Bingde Hu, Genlang Chen, Mingli Song
Automated Consistency Analysis of LLMs Authors: Aditya Patwardhan, Vivek Vaidya, Ashish Kundu
Auditing Prompt Caching in Language Model APIs Authors: Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto

1. Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models

ArXiv ID: 2502.06999

Authors: Siddarth Venkatraman, Mohsin Hasan, Minsu Kim, Luca Scimeca, Marcin Sendera, Yoshua Bengio, Glen Berseth, Nikolay Malkin

Abstract: Any well-behaved generative model over a variable $\mathbf{x}$ can be expressed as a deterministic transformation of an exogenous ('outsourced') Gaussian noise variable $\mathbf{z}$: $\mathbf{x}=f_\theta(\mathbf{z})$. In such a model (e.g., a VAE, GAN, or continuous-time flow-based model), sampling of the target variable $\mathbf{x} \sim p_\theta(\mathbf{x})$ is straightforward, but sampling from a posterior distribution of the form $p(\mathbf{x}\mid\mathbf{y}) \propto p_\theta(\mathbf{x})r(\mathbf{x},\mathbf{y})$, where $r$ is a constraint function depending on an auxiliary variable $\mathbf{y}$, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space ($\mathbf{z}$). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples $f_\theta(\mathbf{z})$ are distributed according to the posterior in the data space ($\mathbf{x}$). For many models and constraints of interest, the posterior in the noise space is smoother than the posterior in the data space, making it more amenable to such amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably both with current amortized and non-amortized inference methods. We demonstrate the proposed outsourced diffusion sampling in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.

Comment: Author match

2. Monte Carlo Tree Diffusion for System 2 Planning

ArXiv ID: 2502.07202

Authors: Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, Sungjin Ahn

Abstract: Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with additional test-time computation (TTC), standard diffusion-based planners offer only limited avenues for TTC scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as TTC increases.

Comment: Author match

3. Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

ArXiv ID: 2502.07640

Authors: Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin

Abstract: We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems. The key challenge in this field is the scarcity of formalized math statements and proofs, which we tackle in the following ways. We train statement formalizers to translate the natural language math problems from Numina into formal language (Lean 4), creating a dataset of 1.64 million formal statements. LLMs are used to check that the formal statements accurately preserve the content of the original natural language problems. We then iteratively build a large dataset of formal proofs by training a series of provers. Each prover succeeds in proving many statements that the previous ones could not, and these new proofs are added to the training set for the next prover. The final prover outperforms all existing open-source models in whole-proof generation. On the miniF2F benchmark, it achieves a 57.6% success rate (Pass@32), exceeding the previous best open-source model by 7.6%. On PutnamBench, Goedel-Prover successfully solves 7 problems (Pass@512), ranking first on the leaderboard. Furthermore, it generates 29.7K formal proofs for Lean Workbook problems, nearly doubling the 15.7K produced by earlier works.

Comment: The paper introduces Goedel-Prover, a state-of-the-art LLM for automated theorem proving. It aligns with foundational research in LLMs, particularly in advancing their capabilities and training methodologies.

Relevance: 9 Novelty: 9

4. MoENAS: Mixture-of-Expert based Neural Architecture Search for jointly Accurate, Fair, and Robust Edge Deep Neural Networks

ArXiv ID: 2502.07422

Authors: Lotfi Abdelkrim Mecharbat, Alberto Marchisio, Muhammad Shafique, Mohammad M. Ghassemi, Tuka Alhanai

Abstract: There has been a surge in optimizing edge Deep Neural Networks (DNNs) for accuracy and efficiency using traditional optimization techniques such as pruning, and more recently, employing automatic design methodologies. However, the focus of these design techniques has often overlooked critical metrics such as fairness, robustness, and generalization. As a result, when evaluating SOTA edge DNNs' performance in image classification using the FACET dataset, we found that they exhibit significant accuracy disparities (14.09%) across 10 different skin tones, alongside issues of non-robustness and poor generalizability. In response to these observations, we introduce Mixture-of-Experts-based Neural Architecture Search (MoENAS), an automatic design technique that navigates through a space of mixture of experts to discover accurate, fair, robust, and general edge DNNs. MoENAS improves the accuracy by 4.02% compared to SOTA edge DNNs and reduces the skin tone accuracy disparities from 14.09% to 5.60%, while enhancing robustness by 3.80% and minimizing overfitting to 0.21%, all while keeping model size close to state-of-the-art models average size (+0.4M). With these improvements, MoENAS establishes a new benchmark for edge DNN design, paving the way for the development of more inclusive and robust edge DNNs.

Comment: The paper introduces MoENAS, a Mixture-of-Experts-based NAS method for edge DNNs, which aligns with architectural innovations and MoE research. It also addresses fairness and robustness, adding to its relevance.

Relevance: 10 Novelty: 8

5. Revisiting Non-Acyclic GFlowNets in Discrete Environments

ArXiv ID: 2502.07735

Authors: Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov

Abstract: Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects from a given probability distribution, potentially known up to a normalizing constant. Instead of working in the object space, GFlowNets proceed by sampling trajectories in an appropriately constructed directed acyclic graph environment, greatly relying on the acyclicity of the graph. In our paper, we revisit the theory that relaxes the acyclicity assumption and present a simpler theoretical framework for non-acyclic GFlowNets in discrete environments. Moreover, we provide various novel theoretical insights related to training with fixed backward policies, the nature of flow functions, and connections between entropy-regularized RL and non-acyclic GFlowNets, which naturally generalize the respective concepts and theoretical results from the acyclic setting. In addition, we experimentally re-examine the concept of loss stability in non-acyclic GFlowNet training, as well as validate our own theoretical findings.

Comment: The paper revisits non-acyclic GFlowNets and provides theoretical insights, which align with emerging trends and foundational research in generative models.

Relevance: 9 Novelty: 8

6. Global Universal Scaling and Ultra-Small Parameterization in Machine Learning Interatomic Potentials with Super-Linearity

ArXiv ID: 2502.07293

Authors: Yanxiao Hu, Ye Sheng, Jing Huang, Xiaoxin Xu, Yuyan Yang, Mingqiang Zhang, Yabei Wu, Caichao Ye, Jiong Yang, Wenqing Zhang

Abstract: Using machine learning (ML) to construct interatomic interactions and thus potential energy surface (PES) has become a common strategy for materials design and simulations. However, those current models of machine learning interatomic potential (MLIP) provide no relevant physical constrains, and thus may owe intrinsic out-of-domain difficulty which underlies the challenges of model generalizability and physical scalability. Here, by incorporating physics-informed Universal-Scaling law and nonlinearity-embedded interaction function, we develop a Super-linear MLIP with both Ultra-Small parameterization and greatly expanded expressive capability, named SUS2-MLIP. Due to the global scaling rooting in universal equation of state (UEOS), SUS2-MLIP not only has significantly-reduced parameters by decoupling the element space from coordinate space, but also naturally outcomes the out-of-domain difficulty and endows the potentials with inherent generalizability and scalability even with relatively small training dataset. The nonlinearity-enbeding transformation for interaction function expands the expressive capability and make the potentials super-linear. The SUS2-MLIP outperforms the state-of-the-art MLIP models with its exceptional computational efficiency especially for multiple-element materials and physical scalability in property prediction. This work not only presents a highly-efficient universal MLIP model but also sheds light on incorporating physical constraints into artificial-intelligence-aided materials simulation.

Comment: The paper introduces a physics-informed MLIP model with ultra-small parameterization and scalability, aligning with foundational research in efficiency and sparsity. It incorporates physical constraints, which is a novel approach.

Relevance: 8 Novelty: 9

7. Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

ArXiv ID: 2502.07490

Authors: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu

Abstract: Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.

Comment: The paper introduces a novel training paradigm (MEAP) for LLMs that integrates Masked Language Modeling into Next-Token Prediction, which aligns with foundational research in representation learning and training dynamics of neural networks.

Relevance: 9 Novelty: 8

8. Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline

ArXiv ID: 2502.06888

Authors: Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

Abstract: Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters. Although offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency, the computation for each expert in MoE models is often less than the I/O, resulting in numerous bubbles in the pipeline. Therefore, we propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm. The proposed paradigm uses batch processing to extend the computation time of the current layer to overlap with the loading time of the next layer. Although this idea has been effectively applied to dense models, more batches may activate more experts in the MoE, leading to longer loading times and more bubbles. Thus, unlike traditional approaches, we balance computation and I/O time and minimise bubbles by orchestrating their inference orders based on their heterogeneous computation and I/O requirements and activation patterns under different batch numbers. Moreover, to adapt to different hardware environments and models, we design a constraint-sensitive I/O-compute planner and a correlation-aware expert prefetcher for a schedule that minimises pipeline bubbles. Experimental results demonstrate that Klotski achieves a superior throughput-latency trade-off compared to state-of-the-art techniques, with throughput improvements of up to 85.12x.

Comment: The paper proposes Klotski, an efficient MoE inference engine, which aligns with the core topic of Mixture-of-Experts and introduces novel efficiency improvements.

Relevance: 9 Novelty: 8

9. Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

ArXiv ID: 2502.07752

Authors: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds

Abstract: Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

Comment: This paper introduces a low-rank extension framework for efficient optimizers in LLMs, which aligns with the model compression and efficiency criteria. The use of structured Fisher approximation and novel optimizer designs adds significant methodological contributions.

Relevance: 9 Novelty: 8

10. A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

ArXiv ID: 2502.07222

Authors: Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen

Abstract: The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

Comment: The paper introduces a randomized subspace optimization method for training LLMs, addressing memory efficiency challenges. This aligns with model compression and efficiency criteria and provides strong theoretical contributions.

Relevance: 9 Novelty: 8

11. LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

ArXiv ID: 2502.07365

Authors: Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen

Abstract: Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines.

Comment: The paper proposes a method to mitigate performance degradation in LLMs with extended context windows, focusing on theoretical insights into distribution drift and catastrophic forgetting. This aligns with the interest in foundational research on LLM behavior.

Relevance: 9 Novelty: 8

12. Model Fusion via Neuron Transplantation

ArXiv ID: 2502.06849

Authors: Muhammed \"Oz, Nicholas Kiefer, Charlotte Debus, Jasmin H\"orter, Achim Streit, Markus G\"otz

Abstract: Ensemble learning is a widespread technique to improve the prediction performance of neural networks. However, it comes at the price of increased memory and inference time. In this work we propose a novel model fusion technique called \emph{Neuron Transplantation (NT)} in which we fuse an ensemble of models by transplanting important neurons from all ensemble members into the vacant space obtained by pruning insignificant neurons. An initial loss in performance post-transplantation can be quickly recovered via fine-tuning, consistently outperforming individual ensemble members of the same model capacity and architecture. Furthermore, NT enables all the ensemble members to be jointly pruned and jointly trained in a combined model. Comparing it to alignment-based averaging (like Optimal-Transport-fusion), it requires less fine-tuning than the corresponding OT-fused model, the fusion itself is faster and requires less memory, while the resulting model performance is comparable or better. The code is available under the following link: https://github.com/masterbaer/neuron-transplantation.

Comment: The paper introduces a novel model fusion technique called Neuron Transplantation, which aligns with model compression and efficiency breakthroughs by reducing memory and inference costs.

Relevance: 9 Novelty: 8

13. Harnessing Language's Fractal Geometry with Recursive Inference Scaling

ArXiv ID: 2502.07503

Authors: Ibrahim Alabdulmohsin, Xiaohua Zhai

Abstract: Recent research in language modeling reveals two scaling effects: the well-known improvement from increased training compute, and a lesser-known boost from applying more sophisticated or computationally intensive inference methods. Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time. For a given fixed model architecture and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. These advantages are maintained even when compared to state-of-the-art recursive techniques like the "repeat-all-over" (RAO) strategy in Mobile LLM. Finally, stochastic RINS not only can enhance performance further but also provides the flexibility to optionally forgo increased inference computation at test time with minimal performance degradation.

Comment: The paper introduces Recursive Inference Scaling (RINS), which provides theoretical insights into scaling laws and inference methods for LLMs, aligning with foundational research in LLM behavior.

Relevance: 9 Novelty: 8

14. Online Scheduling for LLM Inference with KV Cache Constraints

ArXiv ID: 2502.07115

Authors: Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou

Abstract: Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose novel batching and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory. We analyze both semi-online and fully online scheduling models, and our results are threefold. First, we provide a polynomial-time algorithm that achieves exact optimality in terms of average latency in the semi-online prompt arrival model. Second, in the fully online case with a stochastic prompt arrival, we introduce an efficient online scheduling algorithm with constant regret. Third, we prove that no algorithm (deterministic or randomized) can achieve a constant competitive ratio in fully online adversarial settings. Our empirical evaluations on a public LLM inference dataset, using the Llama-70B model on A100 GPUs, show that our approach significantly outperforms benchmark algorithms used currently in practice, achieving lower latency while reducing energy consumption. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.

Comment: The paper addresses KV cache constraints in LLM inference with novel scheduling algorithms, contributing to foundational research in model efficiency and compression.

Relevance: 9 Novelty: 8

15. Enabling Autoregressive Models to Fill In Masked Tokens

ArXiv ID: 2502.06901

Authors: Daniel Israel, Aditya Grover, Guy Van den Broeck

Abstract: Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently incapable of masked infilling, which is the ability to predict masked tokens between past and future context. In contrast, MLM models suffer from intrinsic computational inefficiencies during both training and inference that hinder their scalability. This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that leverages the strengths of both paradigms to achieve state-of-the-art masked infilling performance. MARIA combines a pre-trained MLM and AR model by training a linear decoder that takes their concatenated hidden states as input. This minimal modification enables the AR model to perform infilling while retaining its inherent advantages in terms of faster inference with KV caching. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.

Comment: The paper introduces MARIA, a novel architecture combining MLM and AR models for masked infilling, which aligns with foundational research in model architecture and LLM behavior.

Relevance: 9 Novelty: 8

16. Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization

ArXiv ID: 2502.06844

Authors: Yuqiao Wen, Yanshuai Cao, Lili Mou

Abstract: Large language models have been increasing in size due to their success in a wide range of applications. This calls for a pressing need to reduce memory usage to make them more accessible. Post-training quantization is a popular technique which uses fewer bits (e.g., 4--8 bits) to represent the model without retraining it. However, it remains a challenging task to perform quantization in an ultra-low-bit setup (e.g., 2 bits). In this paper, we propose InvarExplore, a unified framework that systematically explores different model invariance at the same time, allowing us to take advantage of the synergy between each type of invariance. Importantly, InvarExplore features a discrete search algorithm that enables us to explore permutation invariance, which is under-studied as it cannot be optimized with gradient-based methods. Results show that InvarExplore is compatible with existing state-of-the-art methods, achieving an add-on performance improvement over strong competing methods.

Comment: The paper proposes InvarExplore, a framework for ultra-low-bit quantization with novel discrete search algorithms, aligning with foundational research in model compression.

Relevance: 9 Novelty: 8

17. Private Low-Rank Approximation for Covariance Matrices, Dyson Brownian Motion, and Eigenvalue-Gap Bounds for Gaussian Perturbations

ArXiv ID: 2502.07657

Authors: Oren Mangoubi, Nisheeth K. Vishnoi

Abstract: We consider the problem of approximating a $d \times d$ covariance matrix $M$ with a rank-$k$ matrix under $(\varepsilon,\delta)$-differential privacy. We present and analyze a complex variant of the Gaussian mechanism and obtain upper bounds on the Frobenius norm of the difference between the matrix output by this mechanism and the best rank-$k$ approximation to $M$. Our analysis provides improvements over previous bounds, particularly when the spectrum of $M$ satisfies natural structural assumptions. The novel insight is to view the addition of Gaussian noise to a matrix as a continuous-time matrix Brownian motion. This viewpoint allows us to track the evolution of eigenvalues and eigenvectors of the matrix, which are governed by stochastic differential equations discovered by Dyson. These equations enable us to upper bound the Frobenius distance between the best rank-$k$ approximation of $M$ and that of a Gaussian perturbation of $M$ as an integral that involves inverse eigenvalue gaps of the stochastically evolving matrix, as opposed to a sum of perturbation bounds obtained via Davis-Kahan-type theorems. Subsequently, again using the Dyson Brownian motion viewpoint, we show that the eigenvalues of the matrix $M$ perturbed by Gaussian noise have large gaps with high probability. These results also contribute to the analysis of low-rank approximations under average-case perturbations, and to an understanding of eigenvalue gaps for random matrices, both of which may be of independent interest.

Comment: The paper introduces a novel approach to low-rank approximation with differential privacy, leveraging Dyson Brownian motion. This aligns with the model compression topic, particularly low-rank approaches, and provides theoretical insights.

Relevance: 9 Novelty: 8

18. HRP: High-Rank Preheating for Superior LoRA Initialization

ArXiv ID: 2502.07739

Authors: Yuzhu Chen, Yingjie Wang, Shi Fu, Li Shen, Yongcheng Jing, Xinmei Tian, Dacheng Tao

Abstract: This paper studies the crucial impact of initialization on the convergence properties of Low-Rank Adaptation (LoRA). We theoretically demonstrate that random initialization, a widely used schema, will likely lead LoRA to random low-rank results, rather than the best low-rank result. While this issue can be mitigated by adjusting initialization towards a well-informed direction, it relies on prior knowledge of the target, which is typically unknown in real-world scenarios. To approximate this well-informed initial direction, we propose High-Rank Preheating (HRP), which fine-tunes high-rank LoRA for a few steps and uses the singular value decomposition of the preheated result as a superior initialization. HRP initialization is theory-supported to combine the convergence strengths of high-rank LoRA and the generalization strengths of low-rank LoRA. Extensive experiments demonstrate that HRP significantly enhances LoRA's effectiveness across various models and tasks, achieving performance comparable to full-parameter fine-tuning and outperforming other initialization strategies.

Comment: The paper introduces a novel initialization method for LoRA, which directly contributes to low-rank adaptation and aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

19. When More is Less: Understanding Chain-of-Thought Length in LLMs

ArXiv ID: 2502.07266

Authors: Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, Yisen Wang

Abstract: Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that longer reasoning processes are increasingly susceptible to noise. We theoretically prove the existence of an optimal CoT length and derive a scaling law for this optimal length based on model capability and task difficulty. Inspired by our theory, we conduct experiments on both synthetic and real world datasets and propose Length-filtered Vote to alleviate the effects of excessively long or short CoTs. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.

Comment: The paper provides theoretical insights into Chain-of-Thought (CoT) reasoning in LLMs, including optimal CoT length and noise susceptibility. This aligns with the LLM behavior/interpretability criterion.

Relevance: 9 Novelty: 8

20. Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

ArXiv ID: 2502.06975

Authors: Mathis Pink, Qinyuan Wu, Vy Ai Vo, Javier Turek, Jianing Mu, Alexander Huth, Mariya Toneva

Abstract: As Large Language Models (LLMs) evolve from text-completion tools into fully fledged agents operating in dynamic environments, they must address the challenge of continually learning and retaining long-term knowledge. Many biological systems solve these challenges with episodic memory, which supports single-shot learning of instance-specific contexts. Inspired by this, we present an episodic memory framework for LLM agents, centered around five key properties of episodic memory that underlie adaptive and context-sensitive behavior. With various research efforts already partially covering these properties, this position paper argues that now is the right time for an explicit, integrated focus on episodic memory to catalyze the development of long-term agents. To this end, we outline a roadmap that unites several research directions under the goal to support all five properties of episodic memory for more efficient long-term LLM agents.

Comment: The paper discusses episodic memory for LLMs, which aligns with emerging trends and foundational research in LLM behavior and long-term memory integration.

Relevance: 9 Novelty: 8

21. Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure

ArXiv ID: 2502.06846

Authors: Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, Yiqin Gao

Abstract: Proteins play a pivotal role in living organisms, yet understanding their functions presents significant challenges, including the limited flexibility of classification-based methods, the inability to effectively leverage spatial structural information, and the lack of systematic evaluation metrics for protein Q&A systems. To address these limitations, we propose Prot2Chat, a novel framework that integrates multimodal protein representations with natural language through a unified module, enabling large language model (LLM)-driven answer generation. Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, a protein-text adapter with cross-attention mechanisms, and a LLaMA3 decoder. To optimize training efficiency, we freeze the encoder and employ LoRA techniques for the decoder. We conducted experiments on two datasets, both automated metrics and expert evaluations demonstrate the superior performance of our model. Furthermore, zero-shot prediction results highlight its strong generalization capabilities. This framework offers a promising solution for bridging protein domain knowledge with natural language understanding, paving the way for transformative advancements in protein-related research.

Comment: The paper introduces a protein LLM framework integrating sequence and structure, which aligns with foundational research in AI for science and multimodal representation learning.

Relevance: 8 Novelty: 8

22. Negative Dependence as a toolbox for machine learning : review and new developments

ArXiv ID: 2502.07285

Authors: Hoang-Son Tran, Vladimir Petrovic, Remi Bardenet, Subhroshekhar Ghosh

Abstract: Negative dependence is becoming a key driver in advancing learning capabilities beyond the limits of traditional independence. Recent developments have evidenced support towards negatively dependent systems as a learning paradigm in a broad range of fundamental machine learning challenges including optimization, sampling, dimensionality reduction and sparse signal recovery, often surpassing the performance of current methods based on statistical independence. The most popular negatively dependent model has been that of determinantal point processes (DPPs), which have their origins in quantum theory. However, other models, such as perturbed lattice models, strongly Rayleigh measures, zeros of random functions have gained salience in various learning applications. In this article, we review this burgeoning field of research, as it has developed over the past two decades or so. We also present new results on applications of DPPs to the parsimonious representation of neural networks. In the limited scope of the article, we mostly focus on aspects of this area to which the authors contributed over the recent years, including applications to Monte Carlo methods, coresets and stochastic gradient descent, stochastic networks, signal processing and connections to quantum computation. However, starting from basics of negative dependence for the uninitiated reader, extensive references are provided to a broad swath of related developments which could not be covered within our limited scope. While existing works and reviews generally focus on specific negatively dependent models (e.g. DPPs), a notable feature of this article is that it addresses negative dependence as a machine learning methodology as a whole. In this vein, it covers within its span an array of negatively dependent models and their applications well beyond DPPs, thereby putting forward a very general and rather unique perspective.

Comment: The paper reviews negative dependence as a machine learning methodology and explores its applications, including neural networks, which aligns with emerging trends in foundational research.

Relevance: 8 Novelty: 8

23. MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

ArXiv ID: 2502.07709

Authors: Loris Gaven, Thomas Carta, Cl\'ement Romac, C\'edric Colas, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

Abstract: Open-ended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high-dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one's own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and LP online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open-ended goal spaces.

Comment: The paper proposes a metacognitive framework for goal prioritization in LLM agents, which aligns with emerging trends in LLM behavior and interpretability. The focus on metacognitive learning is novel and impactful.

Relevance: 8 Novelty: 8

24. Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

ArXiv ID: 2502.07299

Authors: Zicheng Liu, Siyuan Li, Zhiyuan Chen, Lei Xin, Fang Wu, Chang Yu, Qirong Yang, Yucheng Guo, Yujie Yang, Stan Z. Li

Abstract: The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. While modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains under-explored. In this paper, we follow the guidance of the central dogma to redesign both the data and model pipeline and offer a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions of both coding and non-coding regions with masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive Experiments show that Life-Code achieves state-of-the-art performance on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.

Comment: The paper introduces a multi-omics framework with architectural innovations like codon tokenizers and hybrid long-sequence models, aligning with foundational research in AI for science.

Relevance: 8 Novelty: 8

25. Understanding the Generalization Error of Markov algorithms through Poissonization

ArXiv ID: 2502.07584

Authors: Benjamin Dupuis, Maxime Haddouche, George Deligiannidis, Umut Simsekli

Abstract: Using continuous-time stochastic differential equation (SDE) proxies to stochastic optimization algorithms has proven fruitful for understanding their generalization abilities. A significant part of these approaches are based on the so-called `entropy flows'', which greatly simplify the generalization analysis. Unfortunately, such well-structured entropy flows cannot be obtained for most discrete-time algorithms, and the existing SDE approaches remain limited to specific noise and algorithmic structures. We aim to alleviate this issue by introducing a generic framework for analyzing the generalization error of Markov algorithms throughPoissonization', a continuous-time approximation of discrete-time processes with formal approximation guarantees. Through this approach, we first develop a novel entropy flow, which directly leads to PAC-Bayesian generalization bounds. We then draw novel links to modified versions of the celebrated logarithmic Sobolev inequalities (LSI), identify cases where such LSIs are satisfied, and obtain improved bounds. Beyond its generality, our framework allows exploiting specific properties of learning algorithms. In particular, we incorporate the noise structure of different algorithm types - namely, those with additional noise injections (noisy) and those without (non-noisy) - through various technical tools. This illustrates the capacity of our methods to achieve known (yet, Poissonized) and new generalization bounds.

Comment: The paper provides a theoretical framework for analyzing generalization error in Markov algorithms using Poissonization, which contributes to understanding training dynamics in neural networks.

Relevance: 8 Novelty: 8

26. LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

ArXiv ID: 2502.07563

Authors: Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng

Abstract: Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.

Comment: The paper introduces LASP-2, a sequence parallelism method for linear attention models, which aligns with the model architecture criterion, particularly for efficiency in transformers.

Relevance: 8 Novelty: 8

27. EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

ArXiv ID: 2502.06852

Authors: Lin Zhang, Wenshuo Dong, Zhuoran Zhang, Shu Yang, Lijie Hu, Ninghao Liu, Pan Zhou, Di Wang

Abstract: Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.

Comment: The paper proposes a method to mitigate saturation effects in gradient-based circuit identification for transformer models, which aligns with interpretability and mechanistic insights into LLMs.

Relevance: 8 Novelty: 7

28. Quantification of model error for inverse problems in the Weak Neural Variational Inference framework

ArXiv ID: 2502.07415

Authors: Vincent C. Scholz, P. S. Koutsourelakis

Abstract: We present a novel extension of the Weak Neural Variational Inference (WNVI) framework for probabilistic material property estimation that explicitly quantifies model errors in PDE-based inverse problems. Traditional approaches assume the correctness of all governing equations, including potentially unreliable constitutive laws, which can lead to biased estimates and misinterpretations. Our proposed framework addresses this limitation by distinguishing between reliable governing equations, such as conservation laws, and uncertain constitutive relationships. By treating all state variables as latent random variables, we enforce these equations through separate sets of residuals, leveraging a virtual likelihood approach with weighted residuals. This formulation not only identifies regions where constitutive laws break down but also improves robustness against model uncertainties without relying on a fully trustworthy forward model. We demonstrate the effectiveness of our approach in the context of elastography, showing that it provides a structured, interpretable, and computationally efficient alternative to traditional model error correction techniques. Our findings suggest that the proposed framework enhances the accuracy and reliability of material property estimation by offering a principled way to incorporate uncertainty in constitutive modeling.

Comment: The paper extends the Weak Neural Variational Inference framework to quantify model errors in PDE-based inverse problems, which aligns with foundational research in AI for Science.

Relevance: 8 Novelty: 7

ArXiv ID: 2502.06868

Authors: Zenghao Duan, Wenbin Duan, Zhiyi Yin, Yinghan Shen, Shaoling Jing, Jie Zhang, Huawei Shen, Xueqi Cheng

Abstract: Knowledge editing has become a promising approach for efficiently and precisely updating knowledge embedded in large language models (LLMs). In this work, we focus on Same-Subject Editing, which involves modifying multiple attributes of a single entity to ensure comprehensive and consistent updates to entity-centric knowledge. Through preliminary observation, we identify a significant challenge: Current state-of-the-art editing methods struggle when tasked with editing multiple related knowledge pieces for the same subject. To address the lack of relevant editing data for identical subjects in traditional benchmarks, we introduce the $\text{S}^2\text{RKE}$(Same-Subject Related Knowledge Editing) benchmark. Our extensive experiments reveal that only mainstream locate-then-edit methods, such as ROME and MEMIT, exhibit "related knowledge perturbation," where subsequent edits interfere with earlier ones. Further analysis reveals that these methods over-rely on subject information, neglecting other critical factors, resulting in reduced editing effectiveness.

Comment: The paper addresses knowledge editing in LLMs and introduces a benchmark for Same-Subject Related Knowledge Editing, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 8 Novelty: 7

30. Variational Learning Induces Adaptive Label Smoothing

ArXiv ID: 2502.07273

Authors: Sin-Han Yang, Zhedong Liu, Gian Maria Marconi, Mohammad Emtiyaz Khan

Abstract: We show that variational learning naturally induces an adaptive label smoothing where label noise is specialized for each example. Such label-smoothing is useful to handle examples with labeling errors and distribution shifts, but designing a good adaptivity strategy is not always easy. We propose to skip this step and simply use the natural adaptivity induced during the optimization of a variational objective. We show empirical results where a variational algorithm called IVON outperforms traditional label smoothing and yields adaptivity strategies similar to those of an existing approach. By connecting Bayesian methods to label smoothing, our work provides a new way to handle overconfident predictions.

Comment: The paper connects variational learning to adaptive label smoothing, providing insights into handling overconfident predictions. This aligns with representation learning and training dynamics in neural networks.

Relevance: 8 Novelty: 7

31. Does Training on Synthetic Data Make Models Less Robust?

ArXiv ID: 2502.07164

Authors: Lingze Zhang, Ellie Pavlick

Abstract: An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain "blindspots" by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our "blindspot" task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn't necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.

Comment: The paper investigates the robustness of LLMs trained on synthetic data, providing insights into LLM behavior and interpretability. This aligns with the interest in foundational research on LLMs.

Relevance: 8 Novelty: 7

32. XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units

ArXiv ID: 2502.06924

Authors: Arghadip Das, Arnab Raha, Shamik Kundu, Soumendu Kumar Ghosh, Deepak Mathaikutty, Vijay Raghunathan

Abstract: State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks, offering linear or near-linear scalability with sequence length, making them ideal for long-sequence applications in NLP, vision, and edge AI, including real-time transcription, translation, and contextual search. These applications require lightweight, high-performance models for deployment on resource-constrained devices like laptops and PCs. Designing specialized accelerators for every emerging neural network is costly and impractical; instead, optimizing models for existing NPUs in AI PCs provides a scalable solution. To this end, we propose XAMBA, the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs. XAMBA follows a three-step methodology: (1) enabling SSMs on NPUs, (2) optimizing performance to meet KPI requirements, and (3) trading accuracy for additional performance gains. After enabling SSMs on NPUs, XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations, significantly improving execution speed and memory efficiency. Additionally, ActiBA enhances performance by approximating expensive activation functions (e.g., Swish, Softplus) using piecewise linear mappings, reducing latency with minimal accuracy loss. Evaluations on an Intel Core Ultra Series 2 AI PC show that XAMBA achieves up to 2.6X speed-up over the baseline. Our implementation is available at https://github.com/arghadippurdue/XAMBA.

Comment: The paper introduces XAMBA, a framework for optimizing state-space models on NPUs, which aligns with foundational research in model efficiency and compression.

Relevance: 8 Novelty: 7

33. Dataset Ownership Verification in Contrastive Pre-trained Models

ArXiv ID: 2502.07276

Authors: Yuechen Xie, Jie Song, Mengqi Xue, Haofei Zhang, Xingen Wang, Bingde Hu, Genlang Chen, Mingli Song

Abstract: High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a $p$-value markedly below $0.05$, surpassing all previous methodologies. Our code is available at https://github.com/xieyc99/DOV4CL.

Comment: The paper proposes a dataset ownership verification method for contrastive pre-trained models, which aligns with representation learning and provides novel insights into embedding space relationships.

Relevance: 8 Novelty: 7

34. Automated Consistency Analysis of LLMs

ArXiv ID: 2502.07036

Authors: Aditya Patwardhan, Vivek Vaidya, Ashish Kundu

Abstract: Generative AI (Gen AI) with large language models (LLMs) are being widely adopted across the industry, academia and government. Cybersecurity is one of the key sectors where LLMs can be and/or are already being used. There are a number of problems that inhibit the adoption of trustworthy Gen AI and LLMs in cybersecurity and such other critical areas. One of the key challenge to the trustworthiness and reliability of LLMs is: how consistent an LLM is in its responses? In this paper, we have analyzed and developed a formal definition of consistency of responses of LLMs. We have formally defined what is consistency of responses and then develop a framework for consistency evaluation. The paper proposes two approaches to validate consistency: self-validation, and validation across multiple LLMs. We have carried out extensive experiments for several LLMs such as GPT4oMini, GPT3.5, Gemini, Cohere, and Llama3, on a security benchmark consisting of several cybersecurity questions: informational and situational. Our experiments corroborate the fact that even though these LLMs are being considered and/or already being used for several cybersecurity tasks today, they are often inconsistent in their responses, and thus are untrustworthy and unreliable for cybersecurity.

Comment: The paper focuses on consistency analysis of LLMs, which aligns with interpretability and theoretical insights into LLM behavior, making it relevant to foundational research in LLMs.

Relevance: 8 Novelty: 7

35. Auditing Prompt Caching in Language Model APIs

ArXiv ID: 2502.07776

Authors: Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto

Abstract: Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.

Comment: The paper audits prompt caching in LLM APIs, which provides insights into LLM behavior and architecture, aligning with foundational research in LLMs.

Relevance: 8 Novelty: 7

Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research, avoiding purely application-driven work:

Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords for Relevant Domains: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.

Hints on Irrelevant Domains: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.

Hints on Application Tasks: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Relevance 9-10 (Completely Relevant)
Focus: Fully aligned with core topics with no deviation, score the highest if contains keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
Focus: Purely unrelated to our topics. Completely a different domain.
Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

Novelty 9-10 (Breakthrough)
Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
Definition: Minimal originality, applying standard approaches without real innovation.
Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

ARXIVID: should be the ArXiv ID.
COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
RELEVANCE: should be a score from 1-10.
NOVELTY: should be a score from 1-10.

Personalized Daily Arxiv Papers 02/12/2025

1. Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models

2. Monte Carlo Tree Diffusion for System 2 Planning

3. Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

4. MoENAS: Mixture-of-Expert based Neural Architecture Search for jointly Accurate, Fair, and Robust Edge Deep Neural Networks

5. Revisiting Non-Acyclic GFlowNets in Discrete Environments

6. Global Universal Scaling and Ultra-Small Parameterization in Machine Learning Interatomic Potentials with Super-Linearity

7. Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

8. Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline

9. Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

10. A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

11. LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

12. Model Fusion via Neuron Transplantation

13. Harnessing Language's Fractal Geometry with Recursive Inference Scaling

14. Online Scheduling for LLM Inference with KV Cache Constraints

15. Enabling Autoregressive Models to Fill In Masked Tokens

16. Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization

17. Private Low-Rank Approximation for Covariance Matrices, Dyson Brownian Motion, and Eigenvalue-Gap Bounds for Gaussian Perturbations

18. HRP: High-Rank Preheating for Superior LoRA Initialization

19. When More is Less: Understanding Chain-of-Thought Length in LLMs

20. Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

21. Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure

22. Negative Dependence as a toolbox for machine learning : review and new developments

23. MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

24. Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

25. Understanding the Generalization Error of Markov algorithms through Poissonization

26. LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

27. EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

28. Quantification of model error for inverse problems in the Weak Neural Variational Inference framework

29. Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject

30. Variational Learning Induces Adaptive Label Smoothing

31. Does Training on Synthetic Data Make Models Less Robust?

32. XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units

33. Dataset Ownership Verification in Contrastive Pre-trained Models

34. Automated Consistency Analysis of LLMs

35. Auditing Prompt Caching in Language Model APIs

Paper Selection Prompt

Relevant Topics

Scoring Criteria

Relevance Scoring

Novelty Scoring

Papers

Instructions