Personalized Daily Arxiv Papers 02/18/2025
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 94745 | 14401 | 109146 |
| Cost | $0.24 | $0.14 | $0.38 |
Total ArXiv papers: 1184
Total scanned papers: 681
Total relevant papers: 90
Table of contents with paper titles:
-
Intuitive physics understanding emerges from self-supervised pretraining on natural videos Authors: Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
-
In-Context Parametric Inference: Point or Distribution Estimators? Authors: Sarthak Mittal, Yoshua Bengio, Nikolay Malkin, Guillaume Lajoie
-
Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time Authors: Robert Dahlke, Henrik Klagges, Dan Zecha, Benjamin Merkel, Sven Rohr, Fabian Klemm
-
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
-
System Message Generation for User Preferences using Open-Source Models Authors: Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong
-
A Power Transform Authors: Jonathan T. Barron
-
Does Editing Provide Evidence for Localization? Authors: Zihao Wang, Victor Veitch
-
Controlling Neural Collapse Enhances Out-of-Distribution Detection and Transfer Learning Authors: Md Yousuf Harun, Jhair Gallardo, Christopher Kanan
-
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia Authors: Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre K{\i}c{\i}man, Hamid Palangi, Barun Patra, Robert West
-
An Efficient Row-Based Sparse Fine-Tuning Authors: Cen-Jhih Li, Aditya Bhaskara
-
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Li, Tiancheng Hu, Zhixia Liu, Xusheng Chen, Tao Xie, Yizhou Shan
-
Large Language-Geometry Model: When LLM meets Equivariance Authors: Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao
-
Sparse Autoencoder Features for Classifications and Transferability Authors: Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman
-
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
-
CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation Authors: Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang
-
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws Authors: Prasanna Mayilvahanan, Thadd\"aus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
-
The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval Authors: Ting-Rui Chiang, Dani Yogatama
-
The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training Authors: Matteo Saponati, Pascal Sager, Pau Vilimelis Aceituno, Thilo Stadelmann, Benjamin Grewe
-
CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation Authors: Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na
-
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Authors: Ge Lei, Samuel J. Cooper
-
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Authors: Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
-
Weighted quantization using MMD: From mean field to mean shift via gradient flows Authors: Ayoub Belhadji, Daniel Sharp, Youssef Marzouk
-
Neural Interpretable Reasoning Authors: Pietro Barbiero, Giuseppe Marra, Gabriele Ciravegna, David Debot, Francesco De Santis, Michelangelo Diligenti, Mateo Espinosa Zarlenga, Francesco Giannini
-
Atom of Thoughts for Markov LLM Test-Time Scaling Authors: Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
-
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache Authors: Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
-
Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size Authors: Naoki Takeshita, Masaaki Imaizumi
-
Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise Authors: Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos
-
Teleportation With Null Space Gradient Projection for Optimization Acceleration Authors: Zihao Wu, Juncheng Dong, Ahmed Aloui, Vahid Tarokh
-
Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning Authors: Libo Wang
-
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding Authors: Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin
-
Exact Upper and Lower Bounds for the Output Distribution of Neural Networks with Random Inputs Authors: Andrey Kofnov, Daniel Kapla, Ezio Bartocci, Efstathia Bura
-
AdaSplash: Adaptive Sparse Flash Attention Authors: Nuno Gon\c{c}alves, Marcos Treviso, Andr\'e F. T. Martins
-
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu
-
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling Authors: Hong-Ye Hu, Muzhou Ma, Weiyuan Gong, Qi Ye, Yu Tong, Steven T. Flammia, Susanne F. Yelin
-
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It Authors: Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
-
The geometry of BERT Authors: Matteo Bonino, Giorgia Ghione, Giansalvo Cirrincione
-
A Mathematics Framework of Artificial Shifted Population Risk and Its Further Understanding Related to Consistency Regularization Authors: Xiliang Yang, Shenyang Deng, Shicong Liu, Yuanchi Suo, Wing. W. Y NG, Jianjun Zhang
-
From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics Authors: Qinshuo Liu, Weiqin Zhao, Wei Huang, Yanwen Fang, Lequan Yu, Guodong Li
-
LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging Authors: Zehua Liu, Han Wu, Yuxuan Yao, Ruifeng She, Xiongwei Han, Tao Zhong, Mingxuan Yuan
-
Low-Rank Thinning Authors: Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey
-
On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning Authors: \'Alvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, Pierre Vandergheynst
-
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training Authors: Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen
-
On the Query Complexity of Verifier-Assisted Language Generation Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
-
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
-
Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens Authors: Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso
-
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Authors: Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du
-
Generalization of the Gibbs algorithm with high probability at low temperatures Authors: Andreas Maurer
-
Learning the Exact Time Integration Algorithm for Initial Value Problems by Randomized Neural Networks Authors: Suchuan Dong, Naxian Ni
-
Continuous Diffusion Model for Language Modeling Authors: Jaehyeong Jo, Sung Ju Hwang
-
MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models Authors: Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang
-
How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines Authors: Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
-
Meta-Statistical Learning: Supervised Learning of Statistical Inference Authors: Maxime Peyrard, Kyunghyun Cho
-
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding Authors: Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, Pan Zhou
-
Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning Authors: Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, Yin Wei
-
Diversified Sampling Improves Scaling LLM inference Authors: Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Haifeng Chen, Xiang Zhang, Wei Cheng
-
Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs Authors: Fei Yu, Yingru Li, Benyou Wang
-
Towards Reasoning Ability of Small Language Models Authors: Gaurav Srivastava, Shuxiang Cao, Xuan Wang
-
DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning Authors: Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu
-
AdaGC: Improving Training Stability for Large Language Model Pretraining Authors: Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Tao Sun, Yanjun Ma, Dianhai Yu, Li Shen
-
Why is prompting hard? Understanding prompts on binary sequence predictors Authors: Li Kevin Wenliang, Anian Ruoss, Jordi Grau-Moya, Marcus Hutter, Tim Genewein
-
Superpose Singular Features for Model Merging Authors: Haiquan Qiu, You Wu, Quanming Yao
-
Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning Authors: Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
-
Logarithmic Width Suffices for Robust Memorization Authors: Amitsour Egosi, Gilad Yehudai, Ohad Shamir
-
Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs Authors: Xin Gao, Jian Pu
-
Towards Watermarking of Open-Source LLMs Authors: Thibaud Gloaguen, Nikola Jovanovi\'c, Robin Staab, Martin Vechev
-
ReLearn: Unlearning via Learning for Large Language Models Authors: Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
-
Minimal Ranks, Maximum Confidence: Parameter-efficient Uncertainty Quantification for LoRA Authors: Patryk Marsza{\l}ek, Klaudia Ba{\l}azy, Jacek Tabor, Tomasz Ku\'smierczyk
-
Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models? Authors: Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke
-
The Relationship between No-Regret Learning and Online Conformal Prediction Authors: Ramya Ramalingam, Shayan Kiyani, Aaron Roth
-
Learning Identifiable Structures Helps Avoid Bias in DNN-based Supervised Causal Learning Authors: Jiaru Zhang, Rui Ding, Qiang Fu, Bojun Huang, Zizhen Deng, Yang Hua, Haibing Guan, Shi Han, Dongmei Zhang
-
K-Edit: Language Model Editing with Contextual Knowledge Awareness Authors: Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan
-
Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation Authors: Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman
-
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization Authors: Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, Robert Tibshirani
-
ReReLRP - Remembering and Recognizing Tasks with LRP Authors: Karolina Bogacka, Maximilian H\"ofler, Maria Ganzha, Wojciech Samek, Katarzyna Wasielewska-Michniewska
-
On the kernel learning problem Authors: Yang Li, Feng Ruan
-
Error Bound Analysis for the Regularized Loss of Deep Linear Neural Networks Authors: Po Chen, Rujun Jiang, Peng Wang
-
Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL Authors: Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, Yong Liu
-
MixMin: Finding Data Mixtures via Convex Minimization Authors: Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison
-
Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving Authors: Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
-
Neuron Platonic Intrinsic Representation From Dynamics Using Contrastive Learning Authors: Wei Wu, Can Liao, Zizhen Deng, Zhengrui Guo, Jinzhuo Wang
-
Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling Authors: Yanbiao Ma, Bowei Liu, Wei Dai, Jiayi Chen, Shuo Li
-
Cognitive Neural Architecture Search Reveals Hierarchical Entailment Authors: Lukas Kuhn, Sari Saba-Sadiya, Gemma Roig
-
LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search Authors: Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang
-
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection Authors: Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
-
Fast Proxies for LLM Robustness Evaluation Authors: Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan G\"unnemann
-
Large Language Models and Mathematical Reasoning Failures Authors: Johan Boye, Birger Moell
-
A recurrent vision transformer shows signatures of primate visual attention Authors: Jonathan Morgan, Badr Albanna, James P. Herman
-
ADO: Automatic Data Optimization for Inputs in LLM Prompts Authors: Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang
-
An Empirical Analysis of Uncertainty in Large Language Model Evaluations Authors: Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang
-
MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training Authors: Hui Huang, Jiaheng Liu, Yancheng He, Shilong Li, Bing Xu, Conghui Zhu, Muyun Yang, Tiejun Zhao
1. Intuitive physics understanding emerges from self-supervised pretraining on natural videos
ArXiv ID: 2502.11831
Authors: Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
Abstract: We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.
Comment: Author match
2. In-Context Parametric Inference: Point or Distribution Estimators?
ArXiv ID: 2502.11617
Authors: Sarthak Mittal, Yoshua Bengio, Nikolay Malkin, Guillaume Lajoie
Abstract: Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.
Comment: Author match
3. Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time
ArXiv ID: 2502.11096
Authors: Robert Dahlke, Henrik Klagges, Dan Zecha, Benjamin Merkel, Sven Rohr, Fabian Klemm
Abstract: We present the Mixture-of-Tunable-Experts (MoTE), a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs). Without additional training, MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time. By analyzing the digital LLM brain of DeepSeek-R1 using a technique we dub 'functional Token Resonance Imaging' (fTRI) - inspired by fMRI and using prompts designed to elicit specific behavior (e.g., 'What happened {time}{place}?') - we empirically identify distinctive experts associated with behaviors like refusal responses. Using MoTE we are able to intervene and control such specific behavior. We switched off the top 10 most refusal-relevant experts (0.07% of R1's 14,848 routed experts), achieving a 52% refusal reduction on sensitive reference prompts without performance degradation on MT-Bench. Random expert deactivation resulted in smaller behavioral shifts with increased noise, whereas forced expert activation led to significantly higher refusal rates. Our approach shares similarities with sparse autoencoders (SAEs) in terms of explainability and steerability. Unlike SAEs, MoTE does not require large training efforts, as within MoEs with a vast number of experts, specialization already emerged naturally during pretraining. Our findings suggest that significant functional mechanisms in Mixture-of-Experts architectures can at least partially be localized in a small number of specific experts, rather than being distributed throughout the model's weights. Expert subgroups can be tuned to trigger significant behavior variations, providing insights into the inner workings of LLMs.
Comment: The paper focuses on Mixture-of-Experts (MoE) and provides insights into the behavior and control of specific experts in LLMs, aligning closely with the 'Model Architecture' and 'Representation Learning' criteria.
Relevance: 10 Novelty: 8
4. Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
ArXiv ID: 2502.11880
Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
Abstract: The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
Comment: Introduces groundbreaking system Bitnet.cpp enabling efficient inference for ternary LLMs, directly relevant to model compression and efficiency topics.
Relevance: 9 Novelty: 9
5. System Message Generation for User Preferences using Open-Source Models
ArXiv ID: 2502.11330
Authors: Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong
Abstract: System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, specify various output formats and communication styles. Despite such versatility, publicly available data are often lack system messages and subject to strict license constraints in the industry field. Manual labeling of publicly available data with system messages that align with user instructions demands significant resources. In view of such challenges, our work introduces SysGen, a pipeline for generating system messages with better aligned assistant responses from the supervised fine-tuning dataset without system messages. Training on SysGen data has demonstrated substantial improvements in the alignment of model responses with system messages and user instructions, as demonstrated across various open-source models on the Multifacet benchmark, while maintaining minimal impact on other unseen benchmarks such as Open LLM Leaderboard 2. Our qualitative analysis highlights the importance of diverse system messages to ensure better adaptability across different contexts.
Comment: The paper introduces Inverse Flow for generative models, which aligns with foundational research in representation learning and generative paradigms. The proposed methods (IFM and ICM) are novel and impactful.
Relevance: 9 Novelty: 9
6. A Power Transform
ArXiv ID: 2502.10647
Authors: Jonathan T. Barron
Abstract: Power transforms, such as the Box-Cox transform and Tukey's ladder of powers, are a fundamental tool in mathematics and statistics. These transforms are primarily used for normalizing and standardizing datasets, effectively by raising values to a power. In this work I present a novel power transform, and I show that it serves as a unifying framework for wide family of loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions.
Comment: The novel power transform framework connects across loss functions, activations, and kernels, offering a significant theoretical contribution to foundational methods like representation learning.
Relevance: 9 Novelty: 9
7. Does Editing Provide Evidence for Localization?
ArXiv ID: 2502.11447
Authors: Zihao Wang, Victor Veitch
Abstract: A basic aspiration for interpretability research in large language models is to "localize" semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.
Comment: The paper critically examines interpretability in LLMs by analyzing the evidence provided by localized edits, which aligns with foundational research in LLM behavior and interpretability.
Relevance: 9 Novelty: 8
8. Controlling Neural Collapse Enhances Out-of-Distribution Detection and Transfer Learning
ArXiv ID: 2502.10691
Authors: Md Yousuf Harun, Jhair Gallardo, Christopher Kanan
Abstract: Out-of-distribution (OOD) detection and OOD generalization are widely studied in Deep Neural Networks (DNNs), yet their relationship remains poorly understood. We empirically show that the degree of Neural Collapse (NC) in a network layer is inversely related with these objectives: stronger NC improves OOD detection but degrades generalization, while weaker NC enhances generalization at the cost of detection. This trade-off suggests that a single feature space cannot simultaneously achieve both tasks. To address this, we develop a theoretical framework linking NC to OOD detection and generalization. We show that entropy regularization mitigates NC to improve generalization, while a fixed Simplex Equiangular Tight Frame (ETF) projector enforces NC for better detection. Based on these insights, we propose a method to control NC at different DNN layers. In experiments, our method excels at both tasks across OOD datasets and DNN architectures.
Comment: The paper explores Neural Collapse and its impact on OOD detection and generalization, providing theoretical insights into representation learning and training dynamics.
Relevance: 9 Novelty: 8
9. A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
ArXiv ID: 2312.02073
Authors: Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre K{\i}c{\i}man, Hamid Palangi, Barun Patra, Robert West
Abstract: Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown, especially in situations where contextual information contradicts factual knowledge stored in the parameters, which LLMs also excel at recalling. Favoring the contextual information is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify outdated or noisy stored knowledge. We present a novel method to study grounding abilities using Fakepedia, a novel dataset of counterfactual texts constructed to clash with a model's internal parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the internal parametric knowledge clashes with the contextual information. We benchmark various LLMs with Fakepedia and conduct a causal mediation analysis of LLM components when answering Fakepedia queries, based on our Masked Grouped Causal Tracing (MGCT) method. Through this analysis, we identify distinct computational patterns between grounded and ungrounded responses. We finally demonstrate that distinguishing grounded from ungrounded responses is achievable through computational analysis alone. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.
Comment: The paper investigates grounding mechanisms in LLMs using a novel dataset, which aligns with foundational research in LLM behavior and interpretability.
Relevance: 9 Novelty: 8
10. An Efficient Row-Based Sparse Fine-Tuning
ArXiv ID: 2502.11439
Authors: Cen-Jhih Li, Aditya Bhaskara
Abstract: Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SFT framework, based on ideas from neural network pruning. At a high level, we first identify "important" neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Using experiments on common language tasks, we demonstrate that our method significantly improves the memory efficiency of SFT without increasing training time complexity and implementation complexity, while achieving accuracy comparable to state-of-the-art methods such as LoRA and its variants.
Comment: The paper proposes a sparse fine-tuning framework based on pruning, which is highly relevant to model compression and efficiency research.
Relevance: 9 Novelty: 8
11. Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity
ArXiv ID: 2502.11147
Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Li, Tiancheng Hu, Zhixia Liu, Xusheng Chen, Tao Xie, Yizhou Shan
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires long decoding chains (of thoughts), which incur $O(N)$ time and memory consumption, where $N$ is the chain length. To mitigate $O(N)$ time and memory consumption, existing sparsity-based algorithms propose retaining only the most critical token's intermediate data (i.e., key-value cache) and discarding the rest. However, these existing algorithms struggle with the ``impossible trinity'' of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with $O(L)$ time but $O(N)$ memory ($L$ is the cache budget, $L \ll N$). To address this issue, in this paper, we identify a new attention pattern during the decode stage of reasoning tasks, where milestone tokens (analogous to lemmas in mathematical proofs) emerge, are utilized, and then become unimportant afterward. Based on this pattern, we propose a new algorithm named RaaS that identifies and retains milestone tokens only until they are no longer needed, achieving high accuracy with $O(L)$ time and $O(L)$ memory complexity.
Comment: The paper proposes a reasoning-aware attention sparsity method for efficient long-decoding inference, which is highly relevant to foundational research in LLM efficiency and sparsity.
Relevance: 9 Novelty: 8
12. Large Language-Geometry Model: When LLM meets Equivariance
ArXiv ID: 2502.11149
Authors: Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao
Abstract: Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fall in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integrates E(3)-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adaptor modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.
Comment: The paper proposes a novel framework integrating E(3)-equivariance with LLM capabilities for handling 3D physical systems. It introduces architectural innovations aligning with foundational AI for Science.
Relevance: 9 Novelty: 8
13. Sparse Autoencoder Features for Classifications and Transferability
ArXiv ID: 2502.11367
Authors: Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman
Abstract: Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.
Comment: Sparse autoencoders are explored for feature learning, which relates closely to 'Representation Learning' and 'Model Compression', particularly given the focus on sparsity and transferable features.
Relevance: 9 Novelty: 8
14. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
ArXiv ID: 2502.11089
Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
Comment: The paper discusses a hardware-aligned sparse attention mechanism, relevant to 'Model Compression' due to its sparsity and efficiency focus.
Relevance: 9 Novelty: 8
15. CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation
ArXiv ID: 2502.10940
Authors: Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang
Abstract: Large language models (LLMs) are revolutionizing many science and engineering fields. However, their huge model sizes impose extremely demanding needs of computational resources in the pre-training stage. Although low-rank factorizations can reduce model parameters, their direct application in LLM pre-training often lead to non-negligible performance loss. To address this fundamental challenge, we introduce CoLA and its memory-efficient implementation, CoLA-M. We leverage the low-rank structure observed widely in model activations, enforcing non-linear transformations between factorized weight matrices to reduce model size, boost model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $\bf 2\pmb{\times}$ and improves training throughput by $\bf 1.86\pmb{\times}$ while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also $\bf 2\pmb{\times}$ smaller, enabling faster inference with lower memory cost on resource-constrained platforms
Comment: Introduces a low-rank activation mechanism to pre-train LLMs more efficiently, aligning with the model compression and foundational enhancements for training efficiency.
Relevance: 9 Novelty: 8
16. LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
ArXiv ID: 2502.12120
Authors: Prasanna Mayilvahanan, Thadd\"aus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
Abstract: Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.
Comment: Provides theoretical insights into loss-to-loss scaling laws for LLMs, deeply relevant for foundational research into training dynamics and scalability.
Relevance: 9 Novelty: 8
17. The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval
ArXiv ID: 2502.11276
Authors: Ting-Rui Chiang, Dani Yogatama
Abstract: The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.
Comment: The paper investigates the Rotary Position Embedding (RoPE) and its inefficiencies in long-distance retrieval, which aligns with foundational research on LLM behavior and interpretability.
Relevance: 9 Novelty: 8
18. The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training
ArXiv ID: 2502.10927
Authors: Matteo Saponati, Pascal Sager, Pau Vilimelis Aceituno, Thilo Stadelmann, Benjamin Grewe
Abstract: Self-attention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models - including ModernBERT, GPT, LLaMA3, and Mistral - and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models.
Comment: The paper provides a mathematical framework to analyze self-attention matrices, which aligns with foundational research on Transformer architectures and their training dynamics.
Relevance: 9 Novelty: 8
19. CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation
ArXiv ID: 2502.11101
Authors: Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na
Abstract: Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive Positional Allocation Strategy dynamically reassigns cache positions to maximize the use of the available positional encoding range. Experiments on the Natural Questions and TriviaQA datasets demonstrate that CacheFocus outperforms alternative methods even when inputs exceed the $4$K limit of the \texttt{LLaMA-2} model, emphasizing its practical effectiveness for long-context LLMs. Moreover, even with large maximum input length of \texttt{Qwen2}, the performance of CacheFocus shows that it maintains consistent performance even as the number of documents increases, effectively managing long-text generation without degradation.
Comment: The paper introduces a novel cache management approach, addressing foundational challenges in model efficiency for long-context LLMs.
Relevance: 9 Novelty: 8
20. The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
ArXiv ID: 2502.10871
Authors: Ge Lei, Samuel J. Cooper
Abstract: This study investigates how large language models (LLMs) represent and recall multi-associated attributes across transformer layers. We show that intermediate layers encode factual knowledge by superimposing related attributes in overlapping spaces, along with effective recall even when attributes are not explicitly prompted. In contrast, later layers refine linguistic patterns and progressively separate attribute representations, optimizing task-specific outputs while appropriately narrowing attribute recall. We identify diverse encoding patterns including, for the first time, the observation of 3D spiral structures when exploring information related to the periodic table of elements. Our findings reveal a dynamic transition in attribute representations across layers, contributing to mechanistic interpretability and providing insights for understanding how LLMs handle complex, interrelated knowledge.
Comment: Analyzes multi-layer and geometric encoding patterns in LLMs, offering insights into representation dynamics, which strongly aligns with foundational research criteria.
Relevance: 9 Novelty: 8
21. Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
ArXiv ID: 2502.11812
Authors: Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
Abstract: Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, which are closer to the practical setting. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, which is in contrast to the previous work \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that show circuits only add some additional components after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method, which assigns ranks to layers based on edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA algorithm achieves an average performance improvement of 2.46\% over standard LoRA with similar parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, providing new insights into the design of such tasks and deepening the understanding of circuit dynamics and fine-tuning mechanisms.
Comment: Provides a mechanistic interpretability analysis of fine-tuning in LLMs and proposes novel circuit-aware LoRA adaptations for performance gains.
Relevance: 9 Novelty: 8
22. Weighted quantization using MMD: From mean field to mean shift via gradient flows
ArXiv ID: 2502.10600
Authors: Ayoub Belhadji, Daniel Sharp, Youssef Marzouk
Abstract: Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a finite weighted mixture of Dirac measures that best approximates the target distribution. While much existing work relies on the Wasserstein distance to quantify approximation errors, maximum mean discrepancy (MMD) has received comparatively less attention, especially when allowing for variable particle weights. We study the quantization problem from the perspective of minimizing MMD via gradient flow in the Wasserstein-Fisher-Rao (WFR) geometry. This gradient flow yields an ODE system from which we further derive a fixed-point algorithm called mean shift interacting particles (MSIP). We show that MSIP extends the (non-interacting) mean shift algorithm, widely used for identifying modes in kernel density estimates. Moreover, we show that MSIP can be interpreted as preconditioned gradient descent, and that it acts as a relaxation of Lloyd's algorithm for clustering. Our numerical experiments demonstrate that MSIP and the WFR ODEs outperform other algorithms for quantization of multi-modal and high-dimensional targets.
Comment: The paper introduces a novel quantization method using MMD and gradient flows, which aligns with the model compression criterion. The proposed MSIP algorithm and its theoretical grounding add significant novelty.
Relevance: 9 Novelty: 8
23. Neural Interpretable Reasoning
ArXiv ID: 2502.11639
Authors: Pietro Barbiero, Giuseppe Marra, Gabriele Ciravegna, David Debot, Francesco De Santis, Michelangelo Diligenti, Mateo Espinosa Zarlenga, Francesco Giannini
Abstract: We formalize a novel modeling framework for achieving interpretability in deep learning, anchored in the principle of inference equivariance. While the direct verification of interpretability scales exponentially with the number of variables of the system, we show that this complexity can be mitigated by treating interpretability as a Markovian property and employing neural re-parametrization techniques. Building on these insights, we propose a new modeling paradigm -- neural generation and interpretable execution -- that enables scalable verification of equivariance. This paradigm provides a general approach for designing Neural Interpretable Reasoners that are not only expressive but also transparent.
Comment: The paper introduces a novel framework for interpretable reasoning in neural networks, which aligns with representation learning and interpretability. The Markovian property and neural re-parametrization add theoretical depth.
Relevance: 9 Novelty: 8
24. Atom of Thoughts for Markov LLM Test-Time Scaling
ArXiv ID: 2502.12018
Authors: Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
Abstract: Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at https://github.com/qixucen/atom.
Comment: The paper introduces Atom of Thoughts (AoT) for test-time scaling in LLMs, which aligns with theoretical insights into LLM behavior. The Markovian reasoning framework adds methodological novelty.
Relevance: 9 Novelty: 8
25. QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
ArXiv ID: 2502.10424
Authors: Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Abstract: Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.
Comment: This paper introduces a novel framework for efficient KV cache optimization in LLMs, which is relevant to 'Model Compression'.
Relevance: 9 Novelty: 8
26. Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size
ArXiv ID: 2502.11467
Authors: Naoki Takeshita, Masaaki Imaizumi
Abstract: Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of transformers has garnered significant attention. A notable example is the mathematical analysis of their approximation power, which validates the empirical expressive capability of transformers. In this study, we investigate the ability of transformers to approximate column-symmetric polynomials, an extension of symmetric polynomials that take matrices as input. Consequently, we establish an explicit relationship between the size of the transformer network and its approximation capability, leveraging the parameter efficiency of transformers and their compatibility with symmetry by focusing on the algebraic properties of symmetric polynomials.
Comment: The study explores approximation capabilities of transformers concerning column-symmetric polynomials, advancing theoretical understanding of model expressivity.
Relevance: 9 Novelty: 8
27. Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise
ArXiv ID: 2502.11413
Authors: Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos
Abstract: We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $\sigma: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels. As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomial Statistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.
Comment: The paper provides theoretical insights into the complexity of multiclass linear classification with random noise, which aligns with 'Emerging Trends' through its foundational focus.
Relevance: 9 Novelty: 8
28. Teleportation With Null Space Gradient Projection for Optimization Acceleration
ArXiv ID: 2502.11362
Authors: Zihao Wu, Juncheng Dong, Ahmed Aloui, Vahid Tarokh
Abstract: Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.
Comment: The paper introduces a novel optimization technique for advanced architectures like Transformers, aligning with 'Model Architecture' and 'Emerging Trends'.
Relevance: 9 Novelty: 8
29. Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning
ArXiv ID: 2502.10428
Authors: Libo Wang
Abstract: To reduce the cost and consumption of computing resources caused by computational redundancy and delayed reward assignment in long CoT, this research proposes the dynamic chain-of-thought with adaptive reasoning time and steps. The researcher used simulation experiment to simulate the integration of D-CoT through Python 3.13 IDLE combined with a Python simulator based on GPTs. At the same time, the researcher used DeepSeek R1 as a control group to test and compare the performance of the D-CoT simulator in processing MIT OpenCourseWare's linear algebra exam questions. Experimental results show that D-CoT is better than DeepSeek R1 based on long CoT in three indicators: reasoning time, CoT length (reasoning steps) and token count, which achieves a significant reduction in computing resource consumption. In addition, this research has potential value in deep reasoning optimization and can be used as a reference for future dynamic deep reasoning frameworks.
Comment: The paper proposes a dynamic chain-of-thought reasoning framework, which aligns with foundational research in adaptive reasoning and efficiency in LLMs.
Relevance: 9 Novelty: 8
30. Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
ArXiv ID: 2502.11517
Authors: Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin
Abstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.
Comment: The paper proposes a learning-based system for parallel decoding in LLMs, which aligns with foundational research in efficiency and decoding innovations.
Relevance: 9 Novelty: 8
31. Exact Upper and Lower Bounds for the Output Distribution of Neural Networks with Random Inputs
ArXiv ID: 2502.11672
Authors: Andrey Kofnov, Daniel Kapla, Ezio Bartocci, Efstathia Bura
Abstract: We derive exact upper and lower bounds for the cumulative distribution function (cdf) of the output of a neural network over its entire support subject to noisy (stochastic) inputs. The upper and lower bounds converge to the true cdf over its domain as the resolution increases. Our method applies to any feedforward NN using continuous monotonic piecewise differentiable activation functions (e.g., ReLU, tanh and softmax) and convolutional NNs, which were beyond the scope of competing approaches. The novelty and an instrumental tool of our approach is to bound general NNs with ReLU NNs. The ReLU NN based bounds are then used to derive upper and lower bounds of the cdf of the NN output. Experiments demonstrate that our method delivers guaranteed bounds of the predictive output distribution over its support, thus providing exact error guarantees, in contrast to competing approaches.
Comment: The paper derives exact bounds for the output distribution of neural networks with stochastic inputs, which is foundational in terms of theoretical contributions to neural network behavior.
Relevance: 9 Novelty: 8
32. AdaSplash: Adaptive Sparse Flash Attention
ArXiv ID: 2502.12082
Authors: Nuno Gon\c{c}alves, Marcos Treviso, Andr\'e F. T. Martins
Abstract: The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches -- and in some cases surpasses -- the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.
Comment: AdaSplash improves sparse attention mechanisms, directly impacting Transformer efficiency and aligning well with topics like sparsity and low-rank adaptations.
Relevance: 9 Novelty: 8
33. One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs
ArXiv ID: 2502.10454
Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu
Abstract: Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.
Comment: The paper explores foundational aspects of LLMs by introducing a novel benchmark (CounterMATH) and focusing on counterexample-driven reasoning, which aligns with the 'Large Language Models' criterion for theoretical insights into LLM behavior.
Relevance: 9 Novelty: 8
34. Ansatz-free Hamiltonian learning with Heisenberg-limited scaling
ArXiv ID: 2502.11900
Authors: Hong-Ye Hu, Muzhou Ma, Weiyuan Gong, Qi Ye, Yu Tong, Steven T. Flammia, Susanne F. Yelin
Abstract: Learning the unknown interactions that govern a quantum system is crucial for quantum information processing, device benchmarking, and quantum sensing. The problem, known as Hamiltonian learning, is well understood under the assumption that interactions are local, but this assumption may not hold for arbitrary Hamiltonians. Previous methods all require high-order inverse polynomial dependency with precision, unable to surpass the standard quantum limit and reach the gold standard Heisenberg-limited scaling. Whether Heisenberg-limited Hamiltonian learning is possible without prior assumptions about the interaction structures, a challenge we term \emph{ansatz-free Hamiltonian learning}, remains an open question. In this work, we present a quantum algorithm to learn arbitrary sparse Hamiltonians without any structure constraints using only black-box queries of the system's real-time evolution and minimal digital controls to attain Heisenberg-limited scaling in estimation error. Our method is also resilient to state-preparation-and-measurement errors, enhancing its practical feasibility. Moreover, we establish a fundamental trade-off between total evolution time and quantum control on learning arbitrary interactions, revealing the intrinsic interplay between controllability and total evolution time complexity for any learning algorithm. These results pave the way for further exploration into Heisenberg-limited Hamiltonian learning in complex quantum systems under minimal assumptions, potentially enabling new benchmarking and verification protocols.
Comment: The study introduces Heisenberg-limited precision in Hamiltonian learning, which falls under 'Emerging Trends' for foundational quantum system insights.
Relevance: 8 Novelty: 9
35. The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
ArXiv ID: 2502.11771
Authors: Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
Abstract: The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models' internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on $\textit{consistency heads}$--attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models' internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.
Comment: The paper provides a mechanistic analysis of error detection in LLMs, focusing on arithmetic validation, which aligns with interpretability and foundational research in LLM behavior.
Relevance: 9 Novelty: 7
36. The geometry of BERT
ArXiv ID: 2502.12033
Authors: Matteo Bonino, Giorgia Ghione, Giansalvo Cirrincione
Abstract: Transformer neural networks, particularly Bidirectional Encoder Representations from Transformers (BERT), have shown remarkable performance across various tasks such as classification, text summarization, and question answering. However, their internal mechanisms remain mathematically obscure, highlighting the need for greater explainability and interpretability. In this direction, this paper investigates the internal mechanisms of BERT proposing a novel perspective on the attention mechanism of BERT from a theoretical perspective. The analysis encompasses both local and global network behavior. At the local level, the concept of directionality of subspace selection as well as a comprehensive study of the patterns emerging from the self-attention matrix are presented. Additionally, this work explores the semantic content of the information stream through data distribution analysis and global statistical measures including the novel concept of cone index. A case study on the classification of SARS-CoV-2 variants using RNA which resulted in a very high accuracy has been selected in order to observe these concepts in an application. The insights gained from this analysis contribute to a deeper understanding of BERT's classification process, offering potential avenues for future architectural improvements in Transformer models and further analysis in the training process.
Comment: The paper provides a theoretical analysis of BERT's attention mechanism and internal geometry, which aligns with foundational research in Transformer interpretability.
Relevance: 9 Novelty: 7
37. A Mathematics Framework of Artificial Shifted Population Risk and Its Further Understanding Related to Consistency Regularization
ArXiv ID: 2502.10723
Authors: Xiliang Yang, Shenyang Deng, Shicong Liu, Yuanchi Suo, Wing. W. Y NG, Jianjun Zhang
Abstract: Data augmentation is an important technique in training deep neural networks as it enhances their ability to generalize and remain robust. While data augmentation is commonly used to expand the sample size and act as a consistency regularization term, there is a lack of research on the relationship between them. To address this gap, this paper introduces a more comprehensive mathematical framework for data augmentation. Through this framework, we establish that the expected risk of the shifted population is the sum of the original population risk and a gap term, which can be interpreted as a consistency regularization term. The paper also provides a theoretical understanding of this gap, highlighting its negative effects on the early stages of training. We also propose a method to mitigate these effects. To validate our approach, we conducted experiments using same data augmentation techniques and computing resources under several scenarios, including standard training, out-of-distribution, and imbalanced classification. The results demonstrate that our methods surpass compared methods under all scenarios in terms of generalization ability and convergence stability. We provide our code implementation at the following link: https://github.com/ydlsfhll/ASPR.
Comment: The proposed mathematical framework for understanding consistency regularization in data augmentation contributes to theoretical insights and training dynamics, aligning with 'Representation Learning'.
Relevance: 9 Novelty: 7
38. From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics
ArXiv ID: 2502.10463
Authors: Qinshuo Liu, Weiqin Zhao, Wei Huang, Yanwen Fang, Lequan Yu, Guodong Li
Abstract: The depth of neural networks is a critical factor for their capability, with deeper models often demonstrating superior performance. Motivated by this, significant efforts have been made to enhance layer aggregation - reusing information from previous layers to better extract features at the current layer, to improve the representational power of deep neural networks. However, previous works have primarily addressed this problem from a discrete-state perspective which is not suitable as the number of network layers grows. This paper novelly treats the outputs from layers as states of a continuous process and considers leveraging the state space model (SSM) to design the aggregation of layers in very deep neural networks. Moreover, inspired by its advancements in modeling long sequences, the Selective State Space Models (S6) is employed to design a new module called Selective State Space Model Layer Aggregation (S6LA). This module aims to combine traditional CNN or transformer architectures within a sequential framework, enhancing the representational capabilities of state-of-the-art vision networks. Extensive experiments show that S6LA delivers substantial improvements in both image classification and detection tasks, highlighting the potential of integrating SSMs with contemporary deep learning techniques.
Comment: This paper introduces a state-space model layer aggregation for deep networks. It aligns with 'Model Architecture', offering insights into layer dynamics and integration of SSM techniques.
Relevance: 9 Novelty: 7
39. LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging
ArXiv ID: 2502.10749
Authors: Zehua Liu, Han Wu, Yuxuan Yao, Ruifeng She, Xiongwei Han, Tao Zhong, Mingxuan Yuan
Abstract: While most current approaches rely on further training techniques, such as fine-tuning or reinforcement learning, to enhance model capacities, model merging stands out for its ability of improving models without requiring any additional training. In this paper, we propose a unified framework for model merging based on low-rank estimation of task vectors without the need for access to the base model, named \textsc{LoRE-Merging}. Our approach is motivated by the observation that task vectors from fine-tuned models frequently exhibit a limited number of dominant singular values, making low-rank estimations less prone to interference. We implement the method by formulating the merging problem as an optimization problem. Extensive empirical experiments demonstrate the effectiveness of our framework in mitigating interference and preserving task-specific information, thereby advancing the state-of-the-art performance in model merging techniques.
Comment: Introduces low-rank estimation techniques for model merging, advancing low-rank methodologies closely tied to compression and scaling topics.
Relevance: 9 Novelty: 7
40. Low-Rank Thinning
ArXiv ID: 2502.12063
Authors: Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey
Abstract: The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially reducing the number of summary points. However, existing guarantees cover only a restricted range of distributions and kernel-based quality measures and suffer from pessimistic dimension dependence. To address these deficiencies, we introduce a new low-rank analysis of sub-Gaussian thinning that applies to any distribution and any kernel, guaranteeing high-quality compression whenever the kernel or data matrix is approximately low-rank. To demonstrate the broad applicability of the techniques, we design practical sub-Gaussian thinning approaches that improve upon the best known guarantees for approximating attention in transformers, accelerating stochastic gradient training through reordering, and distinguishing distributions in near-linear time.
Comment: The paper introduces a low-rank analysis for sub-Gaussian thinning, which has implications for model compression and efficiency, particularly in attention mechanisms.
Relevance: 8 Novelty: 8
41. On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning
ArXiv ID: 2502.10818
Authors: \'Alvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, Pierre Vandergheynst
Abstract: Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, this approach is well known to suffer from the over-smoothing and over-squashing phenomena, which result in representational collapse as the number of layers increases and insensitivity to the information contained at distant and poorly connected nodes, respectively. In this paper, we present a unified view of these problems through the lens of vanishing gradients, using ideas from linear control theory for our analysis. We propose an interpretation of GNNs as recurrent models and empirically demonstrate that a simple state-space formulation of a GNN effectively alleviates over-smoothing and over-squashing at no extra trainable parameter cost. Further, we show theoretically and empirically that (i) GNNs are by design prone to extreme gradient vanishing even after a few layers; (ii) Over-smoothing is directly related to the mechanism causing vanishing gradients; (iii) Over-squashing is most easily alleviated by a combination of graph rewiring and vanishing gradient mitigation. We believe our work will help bridge the gap between the recurrent and graph neural network literature and will unlock the design of new deep and performant GNNs.
Comment: The paper provides a unification of over-smoothing, over-squashing, and vanishing gradients in GNNs with theoretical insights and proposes improvements. Foundational relevance for graph neural networks.
Relevance: 8 Novelty: 8
42. How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
ArXiv ID: 2502.11196
Authors: Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen
Abstract: Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.
Comment: The exploration of knowledge circuit evolution in LLMs aligns with 'Representation Learning', focusing on interpretability and continual pre-training insights.
Relevance: 8 Novelty: 8
43. On the Query Complexity of Verifier-Assisted Language Generation
ArXiv ID: 2502.12123
Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
Abstract: Recently, a plethora of works have proposed inference-time algorithms (e.g. best-of-n), which incorporate verifiers to assist the generation process. Their quality-efficiency trade-offs have been empirically benchmarked on a variety of constrained generation tasks, but the algorithmic design landscape is still largely poorly understood. In this paper, we develop a mathematical framework for reasoning about constrained generation using a pre-trained language model generator oracle and a process verifier--which can decide whether a prefix can be extended to a string which satisfies the constraints of choice. We show that even in very simple settings, access to a verifier can render an intractable problem (information-theoretically or computationally) to a tractable one. In fact, we show even simple algorithms, like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier. Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to "backtrack" (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling)--both in terms of computational efficiency, accuracy and diversity.
Comment: This paper explores verifier-assisted constrained generation, offering novel mathematical insights and advancing theoretical understanding of inference-time algorithms. Relevant to foundational LLM research.
Relevance: 8 Novelty: 8
44. APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
ArXiv ID: 2502.12085
Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
Abstract: While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.
Comment: The paper introduces APB, a framework for accelerating long-context inference in LLMs, which is relevant to model efficiency and compression with significant speedup contributions.
Relevance: 8 Novelty: 8
45. Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens
ArXiv ID: 2502.11245
Authors: Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso
Abstract: Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we first extend RSs to the more complex setting of Concept-based Models and then derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of reasoning shortcuts and show that existing methods, even when combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.
Comment: The paper studies concept-based models and reasoning shortcuts, which aligns with representation learning and interpretability. The theoretical conditions for identifiability add significant novelty.
Relevance: 8 Novelty: 8
46. SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
ArXiv ID: 2502.11356
Authors: Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du
Abstract: The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instructions. Through analysis of SAE latent activations, we identify specific latents responsible for instruction following behavior. Our findings reveal that instruction following capabilities are encoded by a distinct set of instruction-relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Additionally, we demonstrate that our methodology scales effectively across SAEs and LLMs of varying sizes.
Comment: This paper uses Sparse Autoencoders (SAEs) to interpret instruction-following in LLMs, connecting both 'Representation Learning' and interpretability of models.
Relevance: 8 Novelty: 8
47. Generalization of the Gibbs algorithm with high probability at low temperatures
ArXiv ID: 2502.11071
Authors: Andreas Maurer
Abstract: The paper gives a bound on the generalization error of the Gibbs algorithm, which recovers known data-independent bounds for the high temperature range and extends to the low-temperature range, where generalization depends critically on the data-dependent loss-landscape. It is shown, that with high probability the generalization error of a single hypothesis drawn from the Gibbs posterior decreases with the total prior volume of all hypotheses with similar or smaller empirical error. This gives theoretical support to the belief in the benefit of flat minima. The zero temperature limit is discussed and the bound is extended to a class of similar stochastic algorithms.
Comment: The paper addresses generalization bounds for the Gibbs algorithm with a focus on flat minima, relevant to emerging foundational insights in optimization.
Relevance: 8 Novelty: 8
48. Learning the Exact Time Integration Algorithm for Initial Value Problems by Randomized Neural Networks
ArXiv ID: 2502.10949
Authors: Suchuan Dong, Naxian Ni
Abstract: We present a method leveraging extreme learning machine (ELM) type randomized neural networks (NNs) for learning the exact time integration algorithm for initial value problems (IVPs). The exact time integration algorithm for non-autonomous systems can be represented by an algorithmic function in higher dimensions, which satisfies an associated system of partial differential equations with corresponding boundary conditions. Our method learns the algorithmic function by solving this associated system using ELM with a physics informed approach. The trained ELM network serves as the learned algorithm and can be used to solve the IVP with arbitrary initial data or step sizes from some domain. When the right hand side of the non-autonomous system exhibits a periodicity with respect to any of its arguments, while the solution itself to the problem is not periodic, we show that the algorithmic function is either periodic, or when it is not, satisfies a well-defined relation for different periods. This property can greatly simplify the algorithm learning in many problems. We consider explicit and implicit NN formulations, leading to explicit or implicit time integration algorithms, and discuss how to train the ELM network by the nonlinear least squares method. Extensive numerical experiments with benchmark problems, including non-stiff, stiff and chaotic systems, show that the learned NN algorithm produces highly accurate solutions in long-time simulations, with its time-marching errors decreasing nearly exponentially with increasing degrees of freedom in the neural network. We compare extensively the computational performance (accuracy vs.~cost) between the current NN algorithm and the leading traditional time integration algorithms. The learned NN algorithm is computationally competitive, markedly outperforming the traditional algorithms in many problems.
Comment: The paper introduces a method for learning time integration algorithms using randomized neural networks, which is foundational in terms of algorithmic innovation and efficiency.
Relevance: 8 Novelty: 8
49. Continuous Diffusion Model for Language Modeling
ArXiv ID: 2502.11564
Authors: Jaehyeong Jo, Sung Ju Hwang
Abstract: Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.
Comment: This paper introduces a continuous diffusion model for language modeling with connections to statistical manifolds, providing theoretical innovations in generative modeling for discrete data.
Relevance: 8 Novelty: 8
50. MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models
ArXiv ID: 2502.11513
Authors: Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang
Abstract: Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has largely focused on single-task learning, leaving its application to multi-task learning unexplored. Multi-task learning is crucial for leveraging shared knowledge across tasks to improve generalization, yet it introduces unique challenges under ZO settings, such as amplified gradient variance and collinearity. In this paper, we present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization. MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters, reducing the dimensionality of the parameter space and mitigating task conflicts. Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.
Comment: The paper introduces MaZO, a novel framework for multi-task fine-tuning of LLMs using zeroth-order optimization, which aligns with the 'Model Compression' criterion due to its focus on memory-efficient optimization and parameter-level innovations.
Relevance: 8 Novelty: 8
51. How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines
ArXiv ID: 2502.12051
Authors: Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
Abstract: Neural scaling laws have revolutionized the design and optimization of large-scale AI models by revealing predictable relationships between model size, dataset volume, and computational resources. Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies. However, recent studies highlighted their limitations across architectures, modalities, and deployment contexts. Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns. Moreover, scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches. In this survey, we synthesize insights from over 50 studies, examining the theoretical foundations, empirical findings, and practical implications of scaling laws. We also explore key challenges, including data efficiency, inference scaling, and architecture-specific constraints, advocating for adaptive scaling strategies tailored to real-world applications. We suggest that while scaling laws provide a useful guide, they do not always generalize across all architectures and training strategies.
Comment: This paper offers a broad survey of scaling laws, which encompass topics such as sparse models and mixture-of-experts. This aligns with the 'Model Architecture' and 'Representation Learning' criteria as it touches on foundational and theoretical aspects of scaling models.
Relevance: 9 Novelty: 6
52. Meta-Statistical Learning: Supervised Learning of Statistical Inference
ArXiv ID: 2502.12088
Authors: Maxime Peyrard, Kyunghyun Cho
Abstract: This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks, where the goal is to predict properties of the data-generating distribution rather than labels for individual datapoints. These tasks encompass statistical inference problems such as parameter estimation, hypothesis testing, or mutual information estimation. Framing these tasks within traditional machine learning pipelines is challenging, as supervision is typically tied to individual datapoint. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems. In this approach, entire datasets are treated as single inputs to neural networks, which predict distribution-level parameters. Transformer-based architectures, without positional encoding, provide a natural fit due to their permutation-invariance properties. By training on large-scale synthetic datasets, meta-statistical models can leverage the scalability and optimization infrastructure of Transformer-based LLMs. We demonstrate the framework's versatility with applications in hypothesis testing and mutual information estimation, showing strong performance, particularly for small datasets where traditional neural methods struggle.
Comment: The paper introduces a novel framework for statistical inference using Transformer-based architectures, which aligns with representation learning and architectural insights. The use of permutation-invariant Transformers is particularly relevant.
Relevance: 8 Novelty: 7
53. GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
ArXiv ID: 2502.11018
Authors: Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, Pan Zhou
Abstract: Speculative decoding accelerates inference in large language models (LLMs) by generating multiple draft tokens simultaneously. However, existing methods often struggle with token misalignment between the training and decoding phases, limiting their performance. To address this, we propose GRIFFIN, a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model to mitigate misalignment. The training strategy employs a loss masking mechanism to exclude highly misaligned tokens during training, preventing them from negatively impacting the draft model's optimization. The token-alignable draft model introduces input tokens to correct inconsistencies in generated features. Experiments on LLaMA-series and Vicuna models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 7\% and a speedup ratio exceeding 8%, outperforming current SoTAs as shown in Fig. 1 (a) and (b).
Comment: The paper proposes a novel token alignment strategy for speculative decoding in LLMs, which is relevant to model efficiency and compression. The improvements in decoding speed and alignment are notable.
Relevance: 8 Novelty: 7
54. Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning
ArXiv ID: 2502.11019
Authors: Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, Yin Wei
Abstract: Catastrophic forgetting (CF) poses a significant challenge in machine learning, where a model forgets previously learned information upon learning new tasks. Despite the advanced capabilities of Large Language Models (LLMs), they continue to face challenges with CF during continual learning. The majority of existing research focuses on analyzing forgetting patterns through a singular training sequence, thereby overlooking the intricate effects that diverse tasks have on model behavior. Our study explores CF across various settings, discovering that model forgetting is influenced by both the specific training tasks and the models themselves. To this end, we interpret forgetting by examining the function vector (FV), a compact representation of functions in LLMs, offering a model-dependent indicator for the occurrence of CF. Through theoretical and empirical analyses, we demonstrated that CF in LLMs primarily stems from biases in function activation rather than the overwriting of task processing functions. Leveraging these insights, we propose a novel function vector guided training methodology, incorporating a regularization technique to stabilize the FV and mitigate forgetting. Empirical tests on four benchmarks confirm the effectiveness of our proposed training method, substantiating our theoretical framework concerning CF and model function dynamics. We plan to make our code publicly accessible in the near future.
Comment: The paper addresses catastrophic forgetting in LLMs using function vectors, which aligns with representation learning and training dynamics, offering theoretical insights.
Relevance: 8 Novelty: 7
55. Diversified Sampling Improves Scaling LLM inference
ArXiv ID: 2502.11027
Authors: Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Haifeng Chen, Xiang Zhang, Wei Cheng
Abstract: While increasing training compute has significantly improved the performance of large language models (LLMs), similar gains have not been observed when scaling inference compute. We hypothesize that the primary issue lies in the uniformity of LLM outputs, which leads to inefficient sampling as models repeatedly generate similar but inaccurate responses. Motivated by an intriguing relationship between solution accuracy (Pass@10) and response diversity, we propose DivSampling-a novel and versatile sampling technique designed to enhance the diversity of candidate solutions by introducing prompt perturbations.DivSampling incorporates two categories of perturbations: task-agnostic approaches, which are general and not tailored to any specific task, and task-specific approaches, which are customized based on task content. Our theoretical analysis demonstrates that, under mild assumptions, the error rates of responses generated from diverse prompts are significantly lower compared to those produced by stationary prompts. Comprehensive evaluations across various tasks -including reasoning, mathematics, and code generation - highlight the effectiveness of DivSampling in improving solution accuracy. This scalable and efficient approach offers a new perspective on optimizing test-time inference, addressing limitations in current sampling strategies.
Comment: The paper introduces a novel sampling technique to improve LLM inference by enhancing diversity, which aligns with foundational research in LLM efficiency and inference optimization.
Relevance: 8 Novelty: 7
56. Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs
ArXiv ID: 2502.11155
Authors: Fei Yu, Yingru Li, Benyou Wang
Abstract: Value model-guided search is effective in steering the generation but suffers from scaling flaws: Its superiority diminishes with larger sample sizes, underperforming non-search baselines. This limitation arises from reliability degradation in value models in unseen reasoning paths. To address this, we propose an uncertainty-aware search framework that includes two key components: (1) uncertainty-aware value models that incorporate uncertainty into predictions, and (2) an uncertainty-aware selection process using the proposed efficient Group Thompson Sampling algorithm. Experiments on GSM8K show that our method mitigates search scaling flaws, achieving 90.5% coverage at 16 samples compared to 85.8% for conventional value-guided search. This work establishes the first systematic integration of uncertainty quantification in LLM search paradigms.
Comment: The paper addresses uncertainty-aware search in LLMs, which is relevant to foundational research in LLM behavior and inference optimization.
Relevance: 8 Novelty: 7
57. Towards Reasoning Ability of Small Language Models
ArXiv ID: 2502.11569
Authors: Gaurav Srivastava, Shuxiang Cao, Xuan Wang
Abstract: Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale ($\sim$100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning-intensive tasks.
Comment: The paper systematically studies reasoning abilities in small language models, which is relevant to foundational research in LLM behavior and interpretability.
Relevance: 8 Novelty: 7
58. DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning
ArXiv ID: 2502.11482
Authors: Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu
Abstract: Continual learning (CL) is essential for Large Language Models (LLMs) to adapt to evolving real-world demands, yet they are susceptible to catastrophic forgetting (CF). While traditional CF solutions rely on expensive data rehearsal, recent rehearsal-free methods employ model-based and regularization-based strategies to address this issue. However, these approaches often neglect the model's plasticity, which is crucial to achieving optimal performance on newly learned tasks. Consequently, a key challenge in CL is striking a balance between preserving plasticity and mitigating CF. To tackle this challenge, we propose the $\textbf{D}$ecomposed $\textbf{A}$ttention-based $\textbf{T}$ask $\textbf{A}$daptation (DATA), which explicitly decouples and learns both task-specific and task-shared knowledge using high-rank and low-rank task adapters (e.g., LoRAs). For new tasks, DATA dynamically adjusts the weights of adapters of different ranks based on their relevance and distinction from previous tasks, allowing the model to acquire new task-specific skills while effectively retaining previously learned knowledge. Specifically, we implement a decomposed component weighting strategy comprising learnable components that collectively generate attention-based weights, allowing the model to integrate and utilize diverse knowledge from each DATA. Extensive experiments on three widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Notably, our approach significantly enhances model plasticity and mitigates CF by extending learnable components and employing stochastic restoration during training iterations.
Comment: The paper introduces a decomposed attention-based task adaptation method for continual learning, which is relevant to foundational research in representation learning and model efficiency.
Relevance: 8 Novelty: 7
59. AdaGC: Improving Training Stability for Large Language Model Pretraining
ArXiv ID: 2502.11034
Authors: Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Tao Sun, Yanjun Ma, Dianhai Yu, Li Shen
Abstract: Large Language Models (LLMs) face increasing loss spikes during scaling, undermining training stability and final performance. While gradient clipping mitigates this issue, traditional global approaches poorly handle parameter-specific gradient variations and decaying gradient norms. We propose AdaGC, an adaptive gradient clipping framework that automatically adjusts local thresholds per parameter through exponential moving average of gradient norms. Theoretical analysis proves AdaGC's convergence under non-convex conditions. Extensive experiments demonstrate significant improvements: On Llama-2 7B/13B, AdaGC completely eliminates loss spikes while reducing WikiText perplexity by 3.5% (+0.14pp LAMBADA accuracy) for 7B and achieving 0.65% lower training loss with 1.47% reduced validation perplexity for 13B compared to global clipping. For CLIP ViT-Base, AdaGC converges 25% faster than StableAdamW with full spike elimination. The method shows universal effectiveness across architectures (Llama-2 7B/13B) and modalities (CLIP), with successful integration into diverse optimizers like AdamW and Lion. Source code will be released on GitHub.
Comment: The paper proposes AdaGC, an adaptive gradient clipping framework for LLM pretraining, which is relevant to foundational research in training stability and optimization.
Relevance: 8 Novelty: 7
60. Why is prompting hard? Understanding prompts on binary sequence predictors
ArXiv ID: 2502.10760
Authors: Li Kevin Wenliang, Anian Ruoss, Jordi Grau-Moya, Marcus Hutter, Tim Genewein
Abstract: Large language models (LLMs) can be prompted to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We explore these issues by viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources. Through numerous prompt search experiments, we show that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice. Moreover, even using exhaustive search, reliably identifying optimal prompts from practical neural predictors can be difficult. Further, we demonstrate that common prompting methods, such as using intuitive prompts or samples from the targeted task, are in fact suboptimal. Thus, this work takes an initial step towards understanding the difficulties in finding and understanding optimal prompts from a statistical and empirical perspective.
Comment: This paper provides statistical and empirical analysis on prompting and sheds light on LLM behavior and training paradigms, aligning with theoretical insights on LLM interpretability.
Relevance: 8 Novelty: 7
61. Superpose Singular Features for Model Merging
ArXiv ID: 2502.10698
Authors: Haiquan Qiu, You Wu, Quanming Yao
Abstract: Model merging is a critical technique for combining the capabilities of multiple fine-tuned models without requiring additional training. While existing methods treat parameters as vectors, they overlook the intrinsic structure of linear transformation matrices - the core components that comprise the majority of model parameters. These matrices are fundamental to neural networks, mapping input representations to output features through linear combinations. Motivated by the linear representation hypothesis, we introduce task matrix and propose to Superpose Features from Task Matrix (SFTM), a novel approach that superposes features from individual task models into a merged model. SFTM employs singular value decomposition to identify feature bases of linear transformation matrices and solves a linear system to optimally combine them while preserving input-output mappings from individual task models. Extensive experiments on vision transformers and language models demonstrate that our method consistently outperforms existing methods, achieving superior performance and enhanced out-of-distribution generalization.
Comment: The work introduces a novel approach to model merging using singular value decomposition, which may have implications for foundational architecture studies such as model compression techniques.
Relevance: 8 Novelty: 7
62. Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning
ArXiv ID: 2502.11962
Authors: Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
Abstract: Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language Models (LLMs), but it may lower their truthfulness. This trade-off arises because IFT steers LLMs to generate responses with long-tail knowledge that is not well covered during pre-training, leading to more informative but less truthful answers when generalizing to unseen tasks. In this paper, we empirically demonstrate this helpfulness-truthfulness trade-off in IFT and propose $\textbf{UNIT}$, a novel IFT paradigm to address it. UNIT teaches LLMs to recognize their uncertainty and explicitly reflect it at the end of their responses. Experimental results show that UNIT-tuned models maintain their helpfulness while distinguishing between certain and uncertain claims, thereby reducing hallucinations.
Comment: The paper examines a novel uncertainty-aware instruction fine-tuning paradigm for LLMs, focusing on balancing helpfulness and truthfulness, providing theoretical insights relevant to LLM behavior.
Relevance: 8 Novelty: 7
63. Logarithmic Width Suffices for Robust Memorization
ArXiv ID: 2502.11162
Authors: Amitsour Egosi, Gilad Yehudai, Ohad Shamir
Abstract: The memorization capacity of neural networks with a given architecture has been thoroughly studied in many works. Specifically, it is well-known that memorizing $N$ samples can be done using a network of constant width, independent of $N$. However, the required constructions are often quite delicate. In this paper, we consider the natural question of how well feedforward ReLU neural networks can memorize robustly, namely while being able to withstand adversarial perturbations of a given radius. We establish both upper and lower bounds on the possible radius for general $l_p$ norms, implying (among other things) that width logarithmic in the number of input samples is necessary and sufficient to achieve robust memorization (with robustness radius independent of $N$).
Comment: The paper provides a theoretical analysis of memorization capability in neural networks with respect to robust conditions. This aligns with 'Representation Learning', as it explores training dynamics and capacity in feedforward networks.
Relevance: 8 Novelty: 7
64. Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs
ArXiv ID: 2502.11037
Authors: Xin Gao, Jian Pu
Abstract: Multi-View Representation Learning (MVRL) aims to derive a unified representation from multi-view data by leveraging shared and complementary information across views. However, when views are irregularly missing, the incomplete data can lead to representations that lack sufficiency and consistency. To address this, we propose Multi-View Permutation of Variational Auto-Encoders (MVP), which excavates invariant relationships between views in incomplete data. MVP establishes inter-view correspondences in the latent space of Variational Auto-Encoders, enabling the inference of missing views and the aggregation of more sufficient information. To derive a valid Evidence Lower Bound (ELBO) for learning, we apply permutations to randomly reorder variables for cross-view generation and then partition them by views to maintain invariant meanings under permutations. Additionally, we enhance consistency by introducing an informational prior with cyclic permutations of posteriors, which turns the regularization term into a similarity measure across distributions. We demonstrate the effectiveness of our approach on seven diverse datasets with varying missing ratios, achieving superior performance in multi-view clustering and generation tasks.
Comment: The paper introduces a cyclic permutation method for incomplete multi-view data in variational autoencoders, relevant to 'Representation Learning' via multi-view generative modeling.
Relevance: 8 Novelty: 7
65. Towards Watermarking of Open-Source LLMs
ArXiv ID: 2502.10525
Authors: Thibaud Gloaguen, Nikola Jovanovi\'c, Robin Staab, Martin Vechev
Abstract: While watermarks for closed LLMs have matured and have been included in large-scale deployments, these methods are not applicable to open-source models, which allow users full control over the decoding process. This setting is understudied yet critical, given the rising performance of open-source models. In this work, we lay the foundation for systematic study of open-source LLM watermarking. For the first time, we explicitly formulate key requirements, including durability against common model modifications such as model merging, quantization, or finetuning, and propose a concrete evaluation setup. Given the prevalence of these modifications, durability is crucial for an open-source watermark to be effective. We survey and evaluate existing methods, showing that they are not durable. We also discuss potential ways to improve their durability and highlight remaining challenges. We hope our work enables future progress on this important problem.
Comment: Explores watermarking for open-source LLMs with considerations of durability and robustness, contributing to foundational security in LLM frameworks.
Relevance: 8 Novelty: 7
66. ReLearn: Unlearning via Learning for Large Language Models
ArXiv ID: 2502.11190
Authors: Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Abstract: Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.
Comment: The paper proposes a method for unlearning in LLMs, which is relevant to foundational research in LLM behavior and interpretability, particularly in preserving linguistic coherence.
Relevance: 8 Novelty: 7
67. Minimal Ranks, Maximum Confidence: Parameter-efficient Uncertainty Quantification for LoRA
ArXiv ID: 2502.12122
Authors: Patryk Marsza{\l}ek, Klaudia Ba{\l}azy, Jacek Tabor, Tomasz Ku\'smierczyk
Abstract: Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large language models by decomposing weight updates into low-rank matrices, significantly reducing storage and computational overhead. While effective, standard LoRA lacks mechanisms for uncertainty quantification, leading to overconfident and poorly calibrated models. Bayesian variants of LoRA address this limitation, but at the cost of a significantly increased number of trainable parameters, partially offsetting the original efficiency gains. Additionally, these models are harder to train and may suffer from unstable convergence. In this work, we propose a novel parameter-efficient Bayesian LoRA, demonstrating that effective uncertainty quantification can be achieved in very low-dimensional parameter spaces. The proposed method achieves strong performance with improved calibration and generalization while maintaining computational efficiency. Our empirical findings show that, with the appropriate projection of the weight space: (1) uncertainty can be effectively modeled in a low-dimensional space, and (2) weight covariances exhibit low ranks.
Comment: The paper proposes a Bayesian variant of LoRA for uncertainty quantification, which aligns with foundational research in model compression and parameter-efficient methods.
Relevance: 8 Novelty: 7
68. Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?
ArXiv ID: 2502.11895
Authors: Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke
Abstract: Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have undergone 16-bit training. We further investigate the effects of retaining the optimizer state at the transition point and gradually phasing in quantization strength -- finding that both techniques alleviate the magnitude of loss spikes, but also that these effects can be compensated through further training.
Comment: This paper focuses on continual quantization-aware pretraining with foundational implications for model compression, aligning well with efficiency improvements in large-scale models.
Relevance: 8 Novelty: 7
69. The Relationship between No-Regret Learning and Online Conformal Prediction
ArXiv ID: 2502.10947
Authors: Ramya Ramalingam, Shayan Kiyani, Aaron Roth
Abstract: Existing algorithms for online conformal prediction -- guaranteeing marginal coverage in adversarial settings -- are variants of online gradient descent (OGD), but their analyses of worst-case coverage do not follow from the regret guarantee of OGD. What is the relationship between no-regret learning and online conformal prediction? We observe that although standard regret guarantees imply marginal coverage in i.i.d. settings, this connection fails as soon as we either move to adversarial environments or ask for group conditional coverage. On the other hand, we show a tight connection between threshold calibrated coverage and swap-regret in adversarial settings, which extends to group-conditional (multi-valid) coverage. We also show that algorithms in the follow the perturbed leader family of no regret learning algorithms (which includes online gradient descent) can be used to give group-conditional coverage guarantees in adversarial settings for arbitrary grouping functions. Via this connection we analyze and conduct experiments using a multi-group generalization of the ACI algorithm of Gibbs & Candes [2021] (arXiv:2106.00170).
Comment: Discusses theoretical links between no-regret learning and online conformal prediction, which is well-aligned with foundational research insights into ML algorithms.
Relevance: 8 Novelty: 7
70. Learning Identifiable Structures Helps Avoid Bias in DNN-based Supervised Causal Learning
ArXiv ID: 2502.10883
Authors: Jiaru Zhang, Rui Ding, Qiang Fu, Bojun Huang, Zizhen Deng, Yang Hua, Haibing Guan, Shi Han, Dongmei Zhang
Abstract: Causal discovery is a structured prediction task that aims to predict causal relations among variables based on their data samples. Supervised Causal Learning (SCL) is an emerging paradigm in this field. Existing Deep Neural Network (DNN)-based methods commonly adopt the "Node-Edge approach", in which the model first computes an embedding vector for each variable-node, then uses these variable-wise representations to concurrently and independently predict for each directed causal-edge. In this paper, we first show that this architecture has some systematic bias that cannot be mitigated regardless of model size and data size. We then propose SiCL, a DNN-based SCL method that predicts a skeleton matrix together with a v-tensor (a third-order tensor representing the v-structures). According to the Markov Equivalence Class (MEC) theory, both the skeleton and the v-structures are identifiable causal structures under the canonical MEC setting, so predictions about skeleton and v-structures do not suffer from the identifiability limit in causal discovery, thus SiCL can avoid the systematic bias in Node-Edge architecture, and enable consistent estimators for causal discovery. Moreover, SiCL is also equipped with a specially designed pairwise encoder module with a unidirectional attention layer to model both internal and external relationships of pairs of nodes. Experimental results on both synthetic and real-world benchmarks show that SiCL significantly outperforms other DNN-based SCL approaches.
Comment: Focuses on causal discovery with a bias-free approach to DNN architectures, connecting strongly to representation learning and structured prediction.
Relevance: 8 Novelty: 7
71. K-Edit: Language Model Editing with Contextual Knowledge Awareness
ArXiv ID: 2502.10626
Authors: Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan
Abstract: As the world changes, we need to be able to update our models and correct false information without costly retraining. Knowledge-based model editing enables precise modifications to the weights of large language models in order to modify the information encoded within. Recent approaches have seen success in enabling recall of edited information for thousands of edits at once. However, these approaches fail to produce edits that account for associated contextual information. We present K-Edit, an effective approach to generating contextually consistent knowledge edits. By using knowledge graphs, which maintain contextual consistency when an edge is edited, we are able to generate additional \textit{contextual edits} that ensure consistency of related information in the language model. Our experiments demonstrate significant improvements in multi-hop question answering while maintaining the general effectiveness and scalability of model edits.
Comment: The paper discusses knowledge-based model editing for LLMs, which aligns with the criterion of theoretical insights into LLM behavior. The use of knowledge graphs for contextual consistency adds methodological novelty.
Relevance: 8 Novelty: 7
72. Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation
ArXiv ID: 2502.11306
Authors: Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman
Abstract: Large language models (LLMs) often suffer from hallucination, generating factually incorrect or ungrounded content, which limits their reliability in high-stakes applications. A key factor contributing to hallucination is the use of hard labels during training, which enforce deterministic supervision, encourage overconfidence, and disregard the uncertainty inherent in natural language. To address this, we propose mitigating hallucination through knowledge distillation (KD), where a teacher model provides smoothed soft labels to a student model, reducing overconfidence and improving factual grounding. We apply KD during supervised finetuning on instructional data, evaluating its effectiveness across LLMs from different families. Experimental results on summarization benchmarks demonstrate that KD reduces hallucination compared to standard finetuning while preserving performance on general NLP tasks. These findings highlight KD as a promising approach for mitigating hallucination in LLMs and improving model reliability.
Comment: The paper addresses mitigating hallucination in LLMs via knowledge distillation, which connects to understanding and improving LLM behavior.
Relevance: 8 Novelty: 7
73. LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization
ArXiv ID: 2502.10648
Authors: Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, Robert Tibshirani
Abstract: We introduce LLM-Lasso, a novel framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. Unlike traditional methods that rely solely on numerical data, LLM-Lasso incorporates domain-specific knowledge extracted from natural language, enhanced through a retrieval-augmented generation (RAG) pipeline, to seamlessly integrate data-driven modeling with contextual insights. Specifically, the LLM generates penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model, while less relevant features are assigned higher penalties, reducing their influence. Importantly, LLM-Lasso has an internal validation step that determines how much to trust the contextual knowledge in our prediction pipeline. Hence it addresses key challenges in robustness, making it suitable for mitigating potential inaccuracies or hallucinations from the LLM. In various biomedical case studies, LLM-Lasso outperforms standard Lasso and existing feature selection baselines, all while ensuring the LLM operates without prior access to the datasets. To our knowledge, this is the first approach to effectively integrate conventional feature selection techniques directly with LLM-based domain-specific reasoning.
Comment: The paper proposes a novel framework for feature selection using LLMs, which aligns with 'Representation Learning' through its focus on integrating domain-specific reasoning into feature selection.
Relevance: 8 Novelty: 7
74. ReReLRP - Remembering and Recognizing Tasks with LRP
ArXiv ID: 2502.10789
Authors: Karolina Bogacka, Maximilian H\"ofler, Maria Ganzha, Wojciech Samek, Katarzyna Wasielewska-Michniewska
Abstract: Deep neural networks have revolutionized numerous research fields and applications. Despite their widespread success, a fundamental limitation known as catastrophic forgetting remains, where models fail to retain their ability to perform previously learned tasks after being trained on new ones. This limitation is particularly acute in certain continual learning scenarios, where models must integrate the knowledge from new domains with their existing capabilities. Traditional approaches to mitigate this problem typically rely on memory replay mechanisms, storing either original data samples, prototypes, or activation patterns. Although effective, these methods often introduce significant computational overhead, raise privacy concerns, and require the use of dedicated architectures. In this work we present ReReLRP (Remembering and Recognizing with LRP), a novel solution that leverages Layerwise Relevance Propagation (LRP) to preserve information across tasks. Our contribution provides increased privacy of existing replay-free methods while additionally offering built-in explainability, flexibility of model architecture and deployment, and a new mechanism to increase memory storage efficiency. We validate our approach on a wide variety of datasets, demonstrating results comparable with a well-known replay-based method in selected scenarios.
Comment: The paper addresses catastrophic forgetting using Layerwise Relevance Propagation (LRP), which aligns with 'Representation Learning' and provides insights into memory efficiency and explainability.
Relevance: 8 Novelty: 7
75. On the kernel learning problem
ArXiv ID: 2502.11665
Authors: Yang Li, Feng Ruan
Abstract: The classical kernel ridge regression problem aims to find the best fit for the output $Y$ as a function of the input data $X\in \mathbb{R}^d$, with a fixed choice of regularization term imposed by a given choice of a reproducing kernel Hilbert space, such as a Sobolev space. Here we consider a generalization of the kernel ridge regression problem, by introducing an extra matrix parameter $U$, which aims to detect the scale parameters and the feature variables in the data, and thereby improve the efficiency of kernel ridge regression. This naturally leads to a nonlinear variational problem to optimize the choice of $U$. We study various foundational mathematical aspects of this variational problem, and in particular how this behaves in the presence of multiscale structures in the data.
Comment: The paper addresses kernel learning with a novel variational problem, which aligns with foundational research in representation learning and multiscale data structures.
Relevance: 8 Novelty: 7
76. Error Bound Analysis for the Regularized Loss of Deep Linear Neural Networks
ArXiv ID: 2502.11152
Authors: Po Chen, Rujun Jiang, Peng Wang
Abstract: The optimization foundations of deep linear networks have received significant attention lately. However, due to the non-convexity and hierarchical structure, analyzing the regularized loss of deep linear networks remains a challenging task. In this work, we study the local geometric landscape of the regularized squared loss of deep linear networks, providing a deeper understanding of its optimization properties. Specifically, we characterize the critical point set and establish an error-bound property for all critical points under mild conditions. Notably, we identify the sufficient and necessary conditions under which the error bound holds. To support our theoretical findings, we conduct numerical experiments demonstrating that gradient descent exhibits linear convergence when optimizing the regularized loss of deep linear networks.
Comment: The paper provides theoretical insights into the optimization landscape of deep linear networks, which aligns with foundational research in training dynamics and optimization.
Relevance: 8 Novelty: 7
77. Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL
ArXiv ID: 2502.11107
Authors: Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, Yong Liu
Abstract: As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence-whose mass-covering behavior risks overfitting to imperfect weak signals-with reverse KL divergence. Reverse KL divergence's zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence, establishing that reverse KL achieves at least comparable guarantees to forward KL. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last layer, reverse KL uniquely guarantees that it outperforms its weak supervisor by the magnitude of their disagreement-a guarantee that forward KL cannot provide. Empirically, we demonstrate that reverse KL and reverse cross-entropy enable strong models to consistently outperform those trained with forward KL and standard cross-entropy across most settings, highlighting the practical advantages of these reverse losses.
Comment: The paper explores reverse KL divergence for weak-to-strong generalization, which provides theoretical insights into optimization and generalization in LLMs.
Relevance: 8 Novelty: 7
78. MixMin: Finding Data Mixtures via Convex Minimization
ArXiv ID: 2502.10510
Authors: Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison
Abstract: Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.
Comment: The paper introduces a method (MixMin) for optimizing data mixtures, which aligns with foundational research in data efficiency and representation learning.
Relevance: 8 Novelty: 7
79. Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving
ArXiv ID: 2502.12022
Authors: Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
Abstract: Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model's unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.
Comment: TATA framework for teaching LLMs adaptive reasoning strategies is relevant to foundational research on LLM behavior and training improvements, particularly its aptitude-aware data selection component.
Relevance: 8 Novelty: 7
80. Neuron Platonic Intrinsic Representation From Dynamics Using Contrastive Learning
ArXiv ID: 2502.10425
Authors: Wei Wu, Can Liao, Zizhen Deng, Zhengrui Guo, Jinzhuo Wang
Abstract: The Platonic Representation Hypothesis suggests a universal, modality-independent reality representation behind different data modalities. Inspired by this, we view each neuron as a system and detect its multi-segment activity data under various peripheral conditions. We assume there's a time-invariant representation for the same neuron, reflecting its intrinsic properties like molecular profiles, location, and morphology. The goal of obtaining these intrinsic neuronal representations has two criteria: (I) segments from the same neuron should have more similar representations than those from different neurons; (II) the representations must generalize well to out-of-domain data. To meet these, we propose the NeurPIR (Neuron Platonic Intrinsic Representation) framework. It uses contrastive learning, with segments from the same neuron as positive pairs and those from different neurons as negative pairs. In implementation, we use VICReg, which focuses on positive pairs and separates dissimilar samples via regularization. We tested our method on Izhikevich model-simulated neuronal population dynamics data. The results accurately identified neuron types based on preset hyperparameters. We also applied it to two real-world neuron dynamics datasets with neuron type annotations from spatial transcriptomics and neuron locations. Our model's learned representations accurately predicted neuron types and locations and were robust on out-of-domain data (from unseen animals). This shows the potential of our approach for understanding neuronal systems and future neuroscience research.
Comment: This work applies contrastive learning to detect intrinsic neuron representations, aligning with 'Representation Learning' and exploring interpretability via neuronal dynamics. However, it is somewhat domain-specific.
Relevance: 7 Novelty: 7
81. Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling
ArXiv ID: 2502.11809
Authors: Yanbiao Ma, Bowei Liu, Wei Dai, Jiayi Chen, Shuo Li
Abstract: Deep neural networks (DNNs) often exhibit biases toward certain categories during object recognition, even under balanced training data conditions. The intrinsic mechanisms underlying these biases remain unclear. Inspired by the human visual system, which decouples object manifolds through hierarchical processing to achieve object recognition, we propose a geometric analysis framework linking the geometric complexity of class-specific perceptual manifolds in DNNs to model bias. Our findings reveal that differences in geometric complexity can lead to varying recognition capabilities across categories, introducing biases. To support this analysis, we present the Perceptual-Manifold-Geometry library, designed for calculating the geometric properties of perceptual manifolds.
Comment: Proposes geometric analysis for bias formation in DNNs, providing insights into representation learning influenced by visual decoupling mechanisms.
Relevance: 7 Novelty: 7
82. Cognitive Neural Architecture Search Reveals Hierarchical Entailment
ArXiv ID: 2502.11141
Authors: Lukas Kuhn, Sari Saba-Sadiya, Gemma Roig
Abstract: Recent research has suggested that the brain is more shallow than previously thought, challenging the traditionally assumed hierarchical structure of the ventral visual pathway. Here, we demonstrate that optimizing convolutional network architectures for brain-alignment via evolutionary neural architecture search results in models with clear representational hierarchies. Despite having random weights, the identified models achieve brain-alignment scores surpassing even those of pretrained classification models - as measured by both regression and representational similarity analysis. Furthermore, through traditional supervised training, architectures optimized for alignment with late ventral regions become competitive classification models. These findings suggest that hierarchical structure is a fundamental mechanism of primate visual processing. Finally, this work demonstrates the potential of neural architecture search as a framework for computational cognitive neuroscience research that could reduce the field's reliance on manually designed convolutional networks.
Comment: The paper explores a neural architecture search optimized for brain-alignment and analyses representational hierarchies, linking it to foundational research on model architecture and representation learning.
Relevance: 7 Novelty: 7
83. LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search
ArXiv ID: 2502.10459
Authors: Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang
Abstract: Graph Neural Architecture Search (GNAS) facilitates the automatic design of Graph Neural Networks (GNNs) tailored to specific downstream graph learning tasks. However, existing GNAS approaches often require manual adaptation to new graph search spaces, necessitating substantial code optimization and domain-specific knowledge. To address this challenge, we present LLM4GNAS, a toolkit for GNAS that leverages the generative capabilities of Large Language Models (LLMs). LLM4GNAS includes an algorithm library for graph neural architecture search algorithms based on LLMs, enabling the adaptation of GNAS methods to new search spaces through the modification of LLM prompts. This approach reduces the need for manual intervention in algorithm adaptation and code modification. The LLM4GNAS toolkit is extensible and robust, incorporating LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture search, and LLM-enhanced hyperparameter optimization. Experimental results indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving both homogeneous and heterogeneous graphs.
Comment: The paper introduces a toolkit for Graph Neural Architecture Search (GNAS) using LLMs, which aligns with 'Model Architecture' through its focus on automating GNN design.
Relevance: 7 Novelty: 7
84. PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
ArXiv ID: 2502.12119
Authors: Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
Abstract: Visual instruction tuning refines pre-trained Multimodal Large Language Models (MLLMs) to enhance their real-world task performance. However, the rapid expansion of visual instruction datasets introduces significant data redundancy, leading to excessive computational costs. Existing data selection methods predominantly rely on proxy models or loss-based metrics, both of which impose substantial computational overheads due to the necessity of model inference and backpropagation. To address this challenge, we propose PRISM, a novel training-free approach for efficient multimodal data selection. Unlike existing methods, PRISM eliminates the reliance on proxy models, warm-up pretraining, and gradient-based optimization. Instead, it leverages Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs, computing a task-specific correlation score to identify high-value instances. This not only enbles data-efficient selection,but maintains the original performance. Empirical evaluations across multiple MLLMs demonstrate that PRISM reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods, while surpassing fully fine-tuned models across eight multimodal and three language understanding benchmarks, achieving a 101.7% relative improvement in final performance.
Comment: PRISM introduces a training-free method for data selection, leveraging intrinsic properties of MLLMs, making it somewhat relevant in the context of model efficiency and data pruning.
Relevance: 7 Novelty: 7
85. Fast Proxies for LLM Robustness Evaluation
ArXiv ID: 2502.10487
Authors: Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan G\"unnemann
Abstract: Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.
Comment: The paper is about fast proxy metrics for evaluating LLM robustness against adversarial attacks. While the topic of robustness is related to foundational work, this specific contribution seems more empirical and evaluation-focused without deep theoretical insights.
Relevance: 7 Novelty: 6
86. Large Language Models and Mathematical Reasoning Failures
ArXiv ID: 2502.11574
Authors: Johan Boye, Birger Moell
Abstract: This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.
Comment: The paper analyzes reasoning failures in LLMs, which aligns with the criterion of theoretical insights into LLM behavior. However, it focuses on empirical evaluation rather than introducing new methods or theories.
Relevance: 7 Novelty: 6
87. A recurrent vision transformer shows signatures of primate visual attention
ArXiv ID: 2502.10955
Authors: Jonathan Morgan, Badr Albanna, James P. Herman
Abstract: Attention is fundamental to both biological and artificial intelligence, yet research on animal attention and AI self attention remains largely disconnected. We propose a Recurrent Vision Transformer (Recurrent ViT) that integrates self-attention with recurrent memory, allowing both current inputs and stored information to guide attention allocation. Trained solely via sparse reward feedback on a spatially cued orientation change detection task, a paradigm used in primate studies, our model exhibits primate like signatures of attention, including improved accuracy and faster responses for cued stimuli that scale with cue validity. Analysis of self-attention maps reveals dynamic spatial prioritization with reactivation prior to expected changes, and targeted perturbations produce performance shifts similar to those observed in primate frontal eye fields and superior colliculus. These findings demonstrate that incorporating recurrent feedback into self attention can capture key aspects of primate visual attention.
Comment: The paper discusses an architectural innovation with a recurrent mechanism in vision transformers, which aligns with the 'Model Architecture' criterion.
Relevance: 7 Novelty: 6
88. ADO: Automatic Data Optimization for Inputs in LLM Prompts
ArXiv ID: 2502.11436
Authors: Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang
Abstract: This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at https://anonymous.4open.science/r/ADO-6BC5/
Comment: The paper explores input data optimization for LLM prompts, which aligns with 'Representation Learning' through its focus on improving input representations.
Relevance: 7 Novelty: 6
89. An Empirical Analysis of Uncertainty in Large Language Model Evaluations
ArXiv ID: 2502.10709
Authors: Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang
Abstract: As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty.
Comment: The paper examines uncertainty in LLM evaluations, which touches on interpretability and reliability of foundational models. It remains empirical without groundbreaking theoretical insights.
Relevance: 7 Novelty: 6
90. MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training
ArXiv ID: 2502.11541
Authors: Hui Huang, Jiaheng Liu, Yancheng He, Shilong Li, Bing Xu, Conghui Zhu, Muyun Yang, Tiejun Zhao
Abstract: Complex instruction-following with elaborate constraints is imperative for Large Language Models (LLMs). While existing methods have constructed data for complex instruction alignment, they all rely on a more advanced model, especially GPT-4, limiting their application. In this paper, we propose a Multi-granularity Self-Contrastive Training (MuSC) framework, to improve the complex instruction alignment without relying on a stronger model. Our method is conducted on both coarse and fine granularity. On coarse-granularity, we construct constraint-aware preference data based on instruction decomposition and recombination. On fine-granularity, we perform token-aware preference optimization with dynamic token-level supervision. Our method is evaluated on open-sourced models, and experiment results show our method achieves significant improvement on both complex and general instruction-following benchmarks, surpassing previous self-alignment methods.
Comment: MuSC proposes a novel multi-granularity self-contrastive training regime relevant to LLM instruction alignment, though it leans more on practical enhancements than theoretical innovation.
Relevance: 7 Novelty: 6
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application Work: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.