Personalized Daily Arxiv Papers 02/18/2025

[gpt-4o]	Prompt	Completion	Total
Token	94745	14401	109146
Cost	$0.24	$0.14	$0.38

Total ArXiv papers: 1184

Total scanned papers: 681

Total relevant papers: 90

Table of contents with paper titles:

Intuitive physics understanding emerges from self-supervised pretraining on natural videos Authors: Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
In-Context Parametric Inference: Point or Distribution Estimators? Authors: Sarthak Mittal, Yoshua Bengio, Nikolay Malkin, Guillaume Lajoie
Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time Authors: Robert Dahlke, Henrik Klagges, Dan Zecha, Benjamin Merkel, Sven Rohr, Fabian Klemm
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
System Message Generation for User Preferences using Open-Source Models Authors: Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong
A Power Transform Authors: Jonathan T. Barron
Does Editing Provide Evidence for Localization? Authors: Zihao Wang, Victor Veitch
Controlling Neural Collapse Enhances Out-of-Distribution Detection and Transfer Learning Authors: Md Yousuf Harun, Jhair Gallardo, Christopher Kanan
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia Authors: Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre K{\i}c{\i}man, Hamid Palangi, Barun Patra, Robert West
An Efficient Row-Based Sparse Fine-Tuning Authors: Cen-Jhih Li, Aditya Bhaskara
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Li, Tiancheng Hu, Zhixia Liu, Xusheng Chen, Tao Xie, Yizhou Shan
Large Language-Geometry Model: When LLM meets Equivariance Authors: Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao
Sparse Autoencoder Features for Classifications and Transferability Authors: Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation Authors: Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws Authors: Prasanna Mayilvahanan, Thadd\"aus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval Authors: Ting-Rui Chiang, Dani Yogatama
The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training Authors: Matteo Saponati, Pascal Sager, Pau Vilimelis Aceituno, Thilo Stadelmann, Benjamin Grewe
CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation Authors: Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Authors: Ge Lei, Samuel J. Cooper
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Authors: Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
Weighted quantization using MMD: From mean field to mean shift via gradient flows Authors: Ayoub Belhadji, Daniel Sharp, Youssef Marzouk
Neural Interpretable Reasoning Authors: Pietro Barbiero, Giuseppe Marra, Gabriele Ciravegna, David Debot, Francesco De Santis, Michelangelo Diligenti, Mateo Espinosa Zarlenga, Francesco Giannini
Atom of Thoughts for Markov LLM Test-Time Scaling Authors: Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache Authors: Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size Authors: Naoki Takeshita, Masaaki Imaizumi
Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise Authors: Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos
Teleportation With Null Space Gradient Projection for Optimization Acceleration Authors: Zihao Wu, Juncheng Dong, Ahmed Aloui, Vahid Tarokh
Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning Authors: Libo Wang
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding Authors: Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin
Exact Upper and Lower Bounds for the Output Distribution of Neural Networks with Random Inputs Authors: Andrey Kofnov, Daniel Kapla, Ezio Bartocci, Efstathia Bura
AdaSplash: Adaptive Sparse Flash Attention Authors: Nuno Gon\c{c}alves, Marcos Treviso, Andr\'e F. T. Martins
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu
Ansatz-free Hamiltonian learning with Heisenberg-limited scaling Authors: Hong-Ye Hu, Muzhou Ma, Weiyuan Gong, Qi Ye, Yu Tong, Steven T. Flammia, Susanne F. Yelin
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It Authors: Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
The geometry of BERT Authors: Matteo Bonino, Giorgia Ghione, Giansalvo Cirrincione
A Mathematics Framework of Artificial Shifted Population Risk and Its Further Understanding Related to Consistency Regularization Authors: Xiliang Yang, Shenyang Deng, Shicong Liu, Yuanchi Suo, Wing. W. Y NG, Jianjun Zhang
From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics Authors: Qinshuo Liu, Weiqin Zhao, Wei Huang, Yanwen Fang, Lequan Yu, Guodong Li
LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging Authors: Zehua Liu, Han Wu, Yuxuan Yao, Ruifeng She, Xiongwei Han, Tao Zhong, Mingxuan Yuan
Low-Rank Thinning Authors: Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey
On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning Authors: \'Alvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, Pierre Vandergheynst
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training Authors: Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen
On the Query Complexity of Verifier-Assisted Language Generation Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens Authors: Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Authors: Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du
Generalization of the Gibbs algorithm with high probability at low temperatures Authors: Andreas Maurer
Learning the Exact Time Integration Algorithm for Initial Value Problems by Randomized Neural Networks Authors: Suchuan Dong, Naxian Ni
Continuous Diffusion Model for Language Modeling Authors: Jaehyeong Jo, Sung Ju Hwang
MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models Authors: Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang
How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines Authors: Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
Meta-Statistical Learning: Supervised Learning of Statistical Inference Authors: Maxime Peyrard, Kyunghyun Cho
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding Authors: Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, Pan Zhou
Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning Authors: Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, Yin Wei
Diversified Sampling Improves Scaling LLM inference Authors: Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Haifeng Chen, Xiang Zhang, Wei Cheng
Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs Authors: Fei Yu, Yingru Li, Benyou Wang
Towards Reasoning Ability of Small Language Models Authors: Gaurav Srivastava, Shuxiang Cao, Xuan Wang
DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning Authors: Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu
AdaGC: Improving Training Stability for Large Language Model Pretraining Authors: Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Tao Sun, Yanjun Ma, Dianhai Yu, Li Shen
Why is prompting hard? Understanding prompts on binary sequence predictors Authors: Li Kevin Wenliang, Anian Ruoss, Jordi Grau-Moya, Marcus Hutter, Tim Genewein
Superpose Singular Features for Model Merging Authors: Haiquan Qiu, You Wu, Quanming Yao
Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning Authors: Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
Logarithmic Width Suffices for Robust Memorization Authors: Amitsour Egosi, Gilad Yehudai, Ohad Shamir
Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs Authors: Xin Gao, Jian Pu
Towards Watermarking of Open-Source LLMs Authors: Thibaud Gloaguen, Nikola Jovanovi\'c, Robin Staab, Martin Vechev
ReLearn: Unlearning via Learning for Large Language Models Authors: Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Minimal Ranks, Maximum Confidence: Parameter-efficient Uncertainty Quantification for LoRA Authors: Patryk Marsza{\l}ek, Klaudia Ba{\l}azy, Jacek Tabor, Tomasz Ku\'smierczyk
Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models? Authors: Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke
The Relationship between No-Regret Learning and Online Conformal Prediction Authors: Ramya Ramalingam, Shayan Kiyani, Aaron Roth
Learning Identifiable Structures Helps Avoid Bias in DNN-based Supervised Causal Learning Authors: Jiaru Zhang, Rui Ding, Qiang Fu, Bojun Huang, Zizhen Deng, Yang Hua, Haibing Guan, Shi Han, Dongmei Zhang
K-Edit: Language Model Editing with Contextual Knowledge Awareness Authors: Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan
Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation Authors: Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization Authors: Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, Robert Tibshirani
ReReLRP - Remembering and Recognizing Tasks with LRP Authors: Karolina Bogacka, Maximilian H\"ofler, Maria Ganzha, Wojciech Samek, Katarzyna Wasielewska-Michniewska
On the kernel learning problem Authors: Yang Li, Feng Ruan
Error Bound Analysis for the Regularized Loss of Deep Linear Neural Networks Authors: Po Chen, Rujun Jiang, Peng Wang
Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL Authors: Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, Yong Liu
MixMin: Finding Data Mixtures via Convex Minimization Authors: Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison
Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving Authors: Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
Neuron Platonic Intrinsic Representation From Dynamics Using Contrastive Learning Authors: Wei Wu, Can Liao, Zizhen Deng, Zhengrui Guo, Jinzhuo Wang
Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling Authors: Yanbiao Ma, Bowei Liu, Wei Dai, Jiayi Chen, Shuo Li
Cognitive Neural Architecture Search Reveals Hierarchical Entailment Authors: Lukas Kuhn, Sari Saba-Sadiya, Gemma Roig
LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search Authors: Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection Authors: Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
Fast Proxies for LLM Robustness Evaluation Authors: Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan G\"unnemann
Large Language Models and Mathematical Reasoning Failures Authors: Johan Boye, Birger Moell
A recurrent vision transformer shows signatures of primate visual attention Authors: Jonathan Morgan, Badr Albanna, James P. Herman
ADO: Automatic Data Optimization for Inputs in LLM Prompts Authors: Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang
An Empirical Analysis of Uncertainty in Large Language Model Evaluations Authors: Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang
MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training Authors: Hui Huang, Jiaheng Liu, Yancheng He, Shilong Li, Bing Xu, Conghui Zhu, Muyun Yang, Tiejun Zhao

1. Intuitive physics understanding emerges from self-supervised pretraining on natural videos

ArXiv ID: 2502.11831

Authors: Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun

Abstract: We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

Comment: Author match

2. In-Context Parametric Inference: Point or Distribution Estimators?

ArXiv ID: 2502.11617

Authors: Sarthak Mittal, Yoshua Bengio, Nikolay Malkin, Guillaume Lajoie

Abstract: Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.

Comment: Author match

3. Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time

ArXiv ID: 2502.11096

Authors: Robert Dahlke, Henrik Klagges, Dan Zecha, Benjamin Merkel, Sven Rohr, Fabian Klemm

Abstract: We present the Mixture-of-Tunable-Experts (MoTE), a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs). Without additional training, MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time. By analyzing the digital LLM brain of DeepSeek-R1 using a technique we dub 'functional Token Resonance Imaging' (fTRI) - inspired by fMRI and using prompts designed to elicit specific behavior (e.g., 'What happened {time}{place}?') - we empirically identify distinctive experts associated with behaviors like refusal responses. Using MoTE we are able to intervene and control such specific behavior. We switched off the top 10 most refusal-relevant experts (0.07% of R1's 14,848 routed experts), achieving a 52% refusal reduction on sensitive reference prompts without performance degradation on MT-Bench. Random expert deactivation resulted in smaller behavioral shifts with increased noise, whereas forced expert activation led to significantly higher refusal rates. Our approach shares similarities with sparse autoencoders (SAEs) in terms of explainability and steerability. Unlike SAEs, MoTE does not require large training efforts, as within MoEs with a vast number of experts, specialization already emerged naturally during pretraining. Our findings suggest that significant functional mechanisms in Mixture-of-Experts architectures can at least partially be localized in a small number of specific experts, rather than being distributed throughout the model's weights. Expert subgroups can be tuned to trigger significant behavior variations, providing insights into the inner workings of LLMs.

Comment: The paper focuses on Mixture-of-Experts (MoE) and provides insights into the behavior and control of specific experts in LLMs, aligning closely with the 'Model Architecture' and 'Representation Learning' criteria.

Relevance: 10 Novelty: 8

4. Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

ArXiv ID: 2502.11880

Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei

Abstract: The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.

Comment: Introduces groundbreaking system Bitnet.cpp enabling efficient inference for ternary LLMs, directly relevant to model compression and efficiency topics.

Relevance: 9 Novelty: 9

5. System Message Generation for User Preferences using Open-Source Models

ArXiv ID: 2502.11330

Authors: Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong

Abstract: System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, specify various output formats and communication styles. Despite such versatility, publicly available data are often lack system messages and subject to strict license constraints in the industry field. Manual labeling of publicly available data with system messages that align with user instructions demands significant resources. In view of such challenges, our work introduces SysGen, a pipeline for generating system messages with better aligned assistant responses from the supervised fine-tuning dataset without system messages. Training on SysGen data has demonstrated substantial improvements in the alignment of model responses with system messages and user instructions, as demonstrated across various open-source models on the Multifacet benchmark, while maintaining minimal impact on other unseen benchmarks such as Open LLM Leaderboard 2. Our qualitative analysis highlights the importance of diverse system messages to ensure better adaptability across different contexts.

Comment: The paper introduces Inverse Flow for generative models, which aligns with foundational research in representation learning and generative paradigms. The proposed methods (IFM and ICM) are novel and impactful.

Relevance: 9 Novelty: 9

6. A Power Transform

ArXiv ID: 2502.10647

Authors: Jonathan T. Barron

Abstract: Power transforms, such as the Box-Cox transform and Tukey's ladder of powers, are a fundamental tool in mathematics and statistics. These transforms are primarily used for normalizing and standardizing datasets, effectively by raising values to a power. In this work I present a novel power transform, and I show that it serves as a unifying framework for wide family of loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions.

Comment: The novel power transform framework connects across loss functions, activations, and kernels, offering a significant theoretical contribution to foundational methods like representation learning.

Relevance: 9 Novelty: 9

7. Does Editing Provide Evidence for Localization?

ArXiv ID: 2502.11447

Authors: Zihao Wang, Victor Veitch

Abstract: A basic aspiration for interpretability research in large language models is to "localize" semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

Comment: The paper critically examines interpretability in LLMs by analyzing the evidence provided by localized edits, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

8. Controlling Neural Collapse Enhances Out-of-Distribution Detection and Transfer Learning

ArXiv ID: 2502.10691

Authors: Md Yousuf Harun, Jhair Gallardo, Christopher Kanan

Abstract: Out-of-distribution (OOD) detection and OOD generalization are widely studied in Deep Neural Networks (DNNs), yet their relationship remains poorly understood. We empirically show that the degree of Neural Collapse (NC) in a network layer is inversely related with these objectives: stronger NC improves OOD detection but degrades generalization, while weaker NC enhances generalization at the cost of detection. This trade-off suggests that a single feature space cannot simultaneously achieve both tasks. To address this, we develop a theoretical framework linking NC to OOD detection and generalization. We show that entropy regularization mitigates NC to improve generalization, while a fixed Simplex Equiangular Tight Frame (ETF) projector enforces NC for better detection. Based on these insights, we propose a method to control NC at different DNN layers. In experiments, our method excels at both tasks across OOD datasets and DNN architectures.

Comment: The paper explores Neural Collapse and its impact on OOD detection and generalization, providing theoretical insights into representation learning and training dynamics.

Relevance: 9 Novelty: 8

9. A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

ArXiv ID: 2312.02073

Authors: Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre K{\i}c{\i}man, Hamid Palangi, Barun Patra, Robert West

Abstract: Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown, especially in situations where contextual information contradicts factual knowledge stored in the parameters, which LLMs also excel at recalling. Favoring the contextual information is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify outdated or noisy stored knowledge. We present a novel method to study grounding abilities using Fakepedia, a novel dataset of counterfactual texts constructed to clash with a model's internal parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the internal parametric knowledge clashes with the contextual information. We benchmark various LLMs with Fakepedia and conduct a causal mediation analysis of LLM components when answering Fakepedia queries, based on our Masked Grouped Causal Tracing (MGCT) method. Through this analysis, we identify distinct computational patterns between grounded and ungrounded responses. We finally demonstrate that distinguishing grounded from ungrounded responses is achievable through computational analysis alone. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.

Comment: The paper investigates grounding mechanisms in LLMs using a novel dataset, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

10. An Efficient Row-Based Sparse Fine-Tuning

ArXiv ID: 2502.11439

Authors: Cen-Jhih Li, Aditya Bhaskara

Abstract: Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SFT framework, based on ideas from neural network pruning. At a high level, we first identify "important" neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Using experiments on common language tasks, we demonstrate that our method significantly improves the memory efficiency of SFT without increasing training time complexity and implementation complexity, while achieving accuracy comparable to state-of-the-art methods such as LoRA and its variants.

Comment: The paper proposes a sparse fine-tuning framework based on pruning, which is highly relevant to model compression and efficiency research.

Relevance: 9 Novelty: 8

11. Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity

ArXiv ID: 2502.11147

Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Li, Tiancheng Hu, Zhixia Liu, Xusheng Chen, Tao Xie, Yizhou Shan

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires long decoding chains (of thoughts), which incur $O(N)$ time and memory consumption, where $N$ is the chain length. To mitigate $O(N)$ time and memory consumption, existing sparsity-based algorithms propose retaining only the most critical token's intermediate data (i.e., key-value cache) and discarding the rest. However, these existing algorithms struggle with the ``impossible trinity'' of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with $O(L)$ time but $O(N)$ memory ($L$ is the cache budget, $L \ll N$). To address this issue, in this paper, we identify a new attention pattern during the decode stage of reasoning tasks, where milestone tokens (analogous to lemmas in mathematical proofs) emerge, are utilized, and then become unimportant afterward. Based on this pattern, we propose a new algorithm named RaaS that identifies and retains milestone tokens only until they are no longer needed, achieving high accuracy with $O(L)$ time and $O(L)$ memory complexity.

Comment: The paper proposes a reasoning-aware attention sparsity method for efficient long-decoding inference, which is highly relevant to foundational research in LLM efficiency and sparsity.

Relevance: 9 Novelty: 8

12. Large Language-Geometry Model: When LLM meets Equivariance

ArXiv ID: 2502.11149

Authors: Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao

Abstract: Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fall in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integrates E(3)-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adaptor modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.

Comment: The paper proposes a novel framework integrating E(3)-equivariance with LLM capabilities for handling 3D physical systems. It introduces architectural innovations aligning with foundational AI for Science.

Relevance: 9 Novelty: 8

13. Sparse Autoencoder Features for Classifications and Transferability

ArXiv ID: 2502.11367

Authors: Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

Abstract: Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.

Comment: Sparse autoencoders are explored for feature learning, which relates closely to 'Representation Learning' and 'Model Compression', particularly given the focus on sparsity and transferable features.

Relevance: 9 Novelty: 8

14. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

ArXiv ID: 2502.11089

Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

Comment: The paper discusses a hardware-aligned sparse attention mechanism, relevant to 'Model Compression' due to its sparsity and efficiency focus.

Relevance: 9 Novelty: 8

15. CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

ArXiv ID: 2502.10940

Authors: Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

Abstract: Large language models (LLMs) are revolutionizing many science and engineering fields. However, their huge model sizes impose extremely demanding needs of computational resources in the pre-training stage. Although low-rank factorizations can reduce model parameters, their direct application in LLM pre-training often lead to non-negligible performance loss. To address this fundamental challenge, we introduce CoLA and its memory-efficient implementation, CoLA-M. We leverage the low-rank structure observed widely in model activations, enforcing non-linear transformations between factorized weight matrices to reduce model size, boost model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $\bf 2\pmb{\times}$ and improves training throughput by $\bf 1.86\pmb{\times}$ while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also $\bf 2\pmb{\times}$ smaller, enabling faster inference with lower memory cost on resource-constrained platforms

Comment: Introduces a low-rank activation mechanism to pre-train LLMs more efficiently, aligning with the model compression and foundational enhancements for training efficiency.

Relevance: 9 Novelty: 8

16. LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

ArXiv ID: 2502.12120

Authors: Prasanna Mayilvahanan, Thadd\"aus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

Abstract: Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Comment: Provides theoretical insights into loss-to-loss scaling laws for LLMs, deeply relevant for foundational research into training dynamics and scalability.

Relevance: 9 Novelty: 8

17. The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval

ArXiv ID: 2502.11276

Authors: Ting-Rui Chiang, Dani Yogatama

Abstract: The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.

Comment: The paper investigates the Rotary Position Embedding (RoPE) and its inefficiencies in long-distance retrieval, which aligns with foundational research on LLM behavior and interpretability.

Relevance: 9 Novelty: 8

18. The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training

ArXiv ID: 2502.10927

Authors: Matteo Saponati, Pascal Sager, Pau Vilimelis Aceituno, Thilo Stadelmann, Benjamin Grewe

Abstract: Self-attention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models - including ModernBERT, GPT, LLaMA3, and Mistral - and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models.

Comment: The paper provides a mathematical framework to analyze self-attention matrices, which aligns with foundational research on Transformer architectures and their training dynamics.

Relevance: 9 Novelty: 8

19. CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

ArXiv ID: 2502.11101

Authors: Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na

Abstract: Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive Positional Allocation Strategy dynamically reassigns cache positions to maximize the use of the available positional encoding range. Experiments on the Natural Questions and TriviaQA datasets demonstrate that CacheFocus outperforms alternative methods even when inputs exceed the $4$K limit of the \texttt{LLaMA-2} model, emphasizing its practical effectiveness for long-context LLMs. Moreover, even with large maximum input length of \texttt{Qwen2}, the performance of CacheFocus shows that it maintains consistent performance even as the number of documents increases, effectively managing long-text generation without degradation.

Comment: The paper introduces a novel cache management approach, addressing foundational challenges in model efficiency for long-context LLMs.

Relevance: 9 Novelty: 8

20. The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis

ArXiv ID: 2502.10871

Authors: Ge Lei, Samuel J. Cooper

Abstract: This study investigates how large language models (LLMs) represent and recall multi-associated attributes across transformer layers. We show that intermediate layers encode factual knowledge by superimposing related attributes in overlapping spaces, along with effective recall even when attributes are not explicitly prompted. In contrast, later layers refine linguistic patterns and progressively separate attribute representations, optimizing task-specific outputs while appropriately narrowing attribute recall. We identify diverse encoding patterns including, for the first time, the observation of 3D spiral structures when exploring information related to the periodic table of elements. Our findings reveal a dynamic transition in attribute representations across layers, contributing to mechanistic interpretability and providing insights for understanding how LLMs handle complex, interrelated knowledge.

Comment: Analyzes multi-layer and geometric encoding patterns in LLMs, offering insights into representation dynamics, which strongly aligns with foundational research criteria.

Relevance: 9 Novelty: 8

21. Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

ArXiv ID: 2502.11812

Authors: Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou

Abstract: Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, which are closer to the practical setting. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, which is in contrast to the previous work \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that show circuits only add some additional components after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method, which assigns ranks to layers based on edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA algorithm achieves an average performance improvement of 2.46\% over standard LoRA with similar parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, providing new insights into the design of such tasks and deepening the understanding of circuit dynamics and fine-tuning mechanisms.

Comment: Provides a mechanistic interpretability analysis of fine-tuning in LLMs and proposes novel circuit-aware LoRA adaptations for performance gains.

Relevance: 9 Novelty: 8

22. Weighted quantization using MMD: From mean field to mean shift via gradient flows

ArXiv ID: 2502.10600

Authors: Ayoub Belhadji, Daniel Sharp, Youssef Marzouk

Abstract: Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a finite weighted mixture of Dirac measures that best approximates the target distribution. While much existing work relies on the Wasserstein distance to quantify approximation errors, maximum mean discrepancy (MMD) has received comparatively less attention, especially when allowing for variable particle weights. We study the quantization problem from the perspective of minimizing MMD via gradient flow in the Wasserstein-Fisher-Rao (WFR) geometry. This gradient flow yields an ODE system from which we further derive a fixed-point algorithm called mean shift interacting particles (MSIP). We show that MSIP extends the (non-interacting) mean shift algorithm, widely used for identifying modes in kernel density estimates. Moreover, we show that MSIP can be interpreted as preconditioned gradient descent, and that it acts as a relaxation of Lloyd's algorithm for clustering. Our numerical experiments demonstrate that MSIP and the WFR ODEs outperform other algorithms for quantization of multi-modal and high-dimensional targets.

Comment: The paper introduces a novel quantization method using MMD and gradient flows, which aligns with the model compression criterion. The proposed MSIP algorithm and its theoretical grounding add significant novelty.

Relevance: 9 Novelty: 8

23. Neural Interpretable Reasoning

ArXiv ID: 2502.11639

Authors: Pietro Barbiero, Giuseppe Marra, Gabriele Ciravegna, David Debot, Francesco De Santis, Michelangelo Diligenti, Mateo Espinosa Zarlenga, Francesco Giannini

Abstract: We formalize a novel modeling framework for achieving interpretability in deep learning, anchored in the principle of inference equivariance. While the direct verification of interpretability scales exponentially with the number of variables of the system, we show that this complexity can be mitigated by treating interpretability as a Markovian property and employing neural re-parametrization techniques. Building on these insights, we propose a new modeling paradigm -- neural generation and interpretable execution -- that enables scalable verification of equivariance. This paradigm provides a general approach for designing Neural Interpretable Reasoners that are not only expressive but also transparent.

Comment: The paper introduces a novel framework for interpretable reasoning in neural networks, which aligns with representation learning and interpretability. The Markovian property and neural re-parametrization add theoretical depth.

Relevance: 9 Novelty: 8

24. Atom of Thoughts for Markov LLM Test-Time Scaling

ArXiv ID: 2502.12018

Authors: Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo

Abstract: Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at https://github.com/qixucen/atom.

Comment: The paper introduces Atom of Thoughts (AoT) for test-time scaling in LLMs, which aligns with theoretical insights into LLM behavior. The Markovian reasoning framework adds methodological novelty.

Relevance: 9 Novelty: 8

25. QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

ArXiv ID: 2502.10424

Authors: Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Abstract: Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.

Comment: This paper introduces a novel framework for efficient KV cache optimization in LLMs, which is relevant to 'Model Compression'.

Relevance: 9 Novelty: 8

26. Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size

ArXiv ID: 2502.11467

Authors: Naoki Takeshita, Masaaki Imaizumi

Abstract: Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of transformers has garnered significant attention. A notable example is the mathematical analysis of their approximation power, which validates the empirical expressive capability of transformers. In this study, we investigate the ability of transformers to approximate column-symmetric polynomials, an extension of symmetric polynomials that take matrices as input. Consequently, we establish an explicit relationship between the size of the transformer network and its approximation capability, leveraging the parameter efficiency of transformers and their compatibility with symmetry by focusing on the algebraic properties of symmetric polynomials.

Comment: The study explores approximation capabilities of transformers concerning column-symmetric polynomials, advancing theoretical understanding of model expressivity.

Relevance: 9 Novelty: 8

27. Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise

ArXiv ID: 2502.11413

Authors: Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos

Abstract: We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $\sigma: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels. As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomial Statistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.

Comment: The paper provides theoretical insights into the complexity of multiclass linear classification with random noise, which aligns with 'Emerging Trends' through its foundational focus.

Relevance: 9 Novelty: 8

28. Teleportation With Null Space Gradient Projection for Optimization Acceleration

ArXiv ID: 2502.11362

Authors: Zihao Wu, Juncheng Dong, Ahmed Aloui, Vahid Tarokh

Abstract: Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.

Comment: The paper introduces a novel optimization technique for advanced architectures like Transformers, aligning with 'Model Architecture' and 'Emerging Trends'.

Relevance: 9 Novelty: 8

29. Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning

ArXiv ID: 2502.10428

Authors: Libo Wang

Abstract: To reduce the cost and consumption of computing resources caused by computational redundancy and delayed reward assignment in long CoT, this research proposes the dynamic chain-of-thought with adaptive reasoning time and steps. The researcher used simulation experiment to simulate the integration of D-CoT through Python 3.13 IDLE combined with a Python simulator based on GPTs. At the same time, the researcher used DeepSeek R1 as a control group to test and compare the performance of the D-CoT simulator in processing MIT OpenCourseWare's linear algebra exam questions. Experimental results show that D-CoT is better than DeepSeek R1 based on long CoT in three indicators: reasoning time, CoT length (reasoning steps) and token count, which achieves a significant reduction in computing resource consumption. In addition, this research has potential value in deep reasoning optimization and can be used as a reference for future dynamic deep reasoning frameworks.

Comment: The paper proposes a dynamic chain-of-thought reasoning framework, which aligns with foundational research in adaptive reasoning and efficiency in LLMs.

Relevance: 9 Novelty: 8

30. Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

ArXiv ID: 2502.11517

Authors: Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

Abstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.

Comment: The paper proposes a learning-based system for parallel decoding in LLMs, which aligns with foundational research in efficiency and decoding innovations.

Relevance: 9 Novelty: 8

31. Exact Upper and Lower Bounds for the Output Distribution of Neural Networks with Random Inputs

ArXiv ID: 2502.11672

Authors: Andrey Kofnov, Daniel Kapla, Ezio Bartocci, Efstathia Bura

Abstract: We derive exact upper and lower bounds for the cumulative distribution function (cdf) of the output of a neural network over its entire support subject to noisy (stochastic) inputs. The upper and lower bounds converge to the true cdf over its domain as the resolution increases. Our method applies to any feedforward NN using continuous monotonic piecewise differentiable activation functions (e.g., ReLU, tanh and softmax) and convolutional NNs, which were beyond the scope of competing approaches. The novelty and an instrumental tool of our approach is to bound general NNs with ReLU NNs. The ReLU NN based bounds are then used to derive upper and lower bounds of the cdf of the NN output. Experiments demonstrate that our method delivers guaranteed bounds of the predictive output distribution over its support, thus providing exact error guarantees, in contrast to competing approaches.

Comment: The paper derives exact bounds for the output distribution of neural networks with stochastic inputs, which is foundational in terms of theoretical contributions to neural network behavior.

Relevance: 9 Novelty: 8

32. AdaSplash: Adaptive Sparse Flash Attention

ArXiv ID: 2502.12082

Authors: Nuno Gon\c{c}alves, Marcos Treviso, Andr\'e F. T. Martins

Abstract: The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches -- and in some cases surpasses -- the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.

Comment: AdaSplash improves sparse attention mechanisms, directly impacting Transformer efficiency and aligning well with topics like sparsity and low-rank adaptations.

Relevance: 9 Novelty: 8

33. One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

ArXiv ID: 2502.10454

Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu

Abstract: Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.

Comment: The paper explores foundational aspects of LLMs by introducing a novel benchmark (CounterMATH) and focusing on counterexample-driven reasoning, which aligns with the 'Large Language Models' criterion for theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

34. Ansatz-free Hamiltonian learning with Heisenberg-limited scaling

ArXiv ID: 2502.11900

Authors: Hong-Ye Hu, Muzhou Ma, Weiyuan Gong, Qi Ye, Yu Tong, Steven T. Flammia, Susanne F. Yelin

Abstract: Learning the unknown interactions that govern a quantum system is crucial for quantum information processing, device benchmarking, and quantum sensing. The problem, known as Hamiltonian learning, is well understood under the assumption that interactions are local, but this assumption may not hold for arbitrary Hamiltonians. Previous methods all require high-order inverse polynomial dependency with precision, unable to surpass the standard quantum limit and reach the gold standard Heisenberg-limited scaling. Whether Heisenberg-limited Hamiltonian learning is possible without prior assumptions about the interaction structures, a challenge we term \emph{ansatz-free Hamiltonian learning}, remains an open question. In this work, we present a quantum algorithm to learn arbitrary sparse Hamiltonians without any structure constraints using only black-box queries of the system's real-time evolution and minimal digital controls to attain Heisenberg-limited scaling in estimation error. Our method is also resilient to state-preparation-and-measurement errors, enhancing its practical feasibility. Moreover, we establish a fundamental trade-off between total evolution time and quantum control on learning arbitrary interactions, revealing the intrinsic interplay between controllability and total evolution time complexity for any learning algorithm. These results pave the way for further exploration into Heisenberg-limited Hamiltonian learning in complex quantum systems under minimal assumptions, potentially enabling new benchmarking and verification protocols.

Comment: The study introduces Heisenberg-limited precision in Hamiltonian learning, which falls under 'Emerging Trends' for foundational quantum system insights.

Relevance: 8 Novelty: 9

35. The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

ArXiv ID: 2502.11771

Authors: Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi

Abstract: The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models' internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on $\textit{consistency heads}$--attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models' internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.

Comment: The paper provides a mechanistic analysis of error detection in LLMs, focusing on arithmetic validation, which aligns with interpretability and foundational research in LLM behavior.

Relevance: 9 Novelty: 7

36. The geometry of BERT

ArXiv ID: 2502.12033

Authors: Matteo Bonino, Giorgia Ghione, Giansalvo Cirrincione

Abstract: Transformer neural networks, particularly Bidirectional Encoder Representations from Transformers (BERT), have shown remarkable performance across various tasks such as classification, text summarization, and question answering. However, their internal mechanisms remain mathematically obscure, highlighting the need for greater explainability and interpretability. In this direction, this paper investigates the internal mechanisms of BERT proposing a novel perspective on the attention mechanism of BERT from a theoretical perspective. The analysis encompasses both local and global network behavior. At the local level, the concept of directionality of subspace selection as well as a comprehensive study of the patterns emerging from the self-attention matrix are presented. Additionally, this work explores the semantic content of the information stream through data distribution analysis and global statistical measures including the novel concept of cone index. A case study on the classification of SARS-CoV-2 variants using RNA which resulted in a very high accuracy has been selected in order to observe these concepts in an application. The insights gained from this analysis contribute to a deeper understanding of BERT's classification process, offering potential avenues for future architectural improvements in Transformer models and further analysis in the training process.

Comment: The paper provides a theoretical analysis of BERT's attention mechanism and internal geometry, which aligns with foundational research in Transformer interpretability.

Relevance: 9 Novelty: 7

ArXiv ID: 2502.10723

Authors: Xiliang Yang, Shenyang Deng, Shicong Liu, Yuanchi Suo, Wing. W. Y NG, Jianjun Zhang

Abstract: Data augmentation is an important technique in training deep neural networks as it enhances their ability to generalize and remain robust. While data augmentation is commonly used to expand the sample size and act as a consistency regularization term, there is a lack of research on the relationship between them. To address this gap, this paper introduces a more comprehensive mathematical framework for data augmentation. Through this framework, we establish that the expected risk of the shifted population is the sum of the original population risk and a gap term, which can be interpreted as a consistency regularization term. The paper also provides a theoretical understanding of this gap, highlighting its negative effects on the early stages of training. We also propose a method to mitigate these effects. To validate our approach, we conducted experiments using same data augmentation techniques and computing resources under several scenarios, including standard training, out-of-distribution, and imbalanced classification. The results demonstrate that our methods surpass compared methods under all scenarios in terms of generalization ability and convergence stability. We provide our code implementation at the following link: https://github.com/ydlsfhll/ASPR.

Comment: The proposed mathematical framework for understanding consistency regularization in data augmentation contributes to theoretical insights and training dynamics, aligning with 'Representation Learning'.

Relevance: 9 Novelty: 7

38. From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics

ArXiv ID: 2502.10463

Authors: Qinshuo Liu, Weiqin Zhao, Wei Huang, Yanwen Fang, Lequan Yu, Guodong Li

Abstract: The depth of neural networks is a critical factor for their capability, with deeper models often demonstrating superior performance. Motivated by this, significant efforts have been made to enhance layer aggregation - reusing information from previous layers to better extract features at the current layer, to improve the representational power of deep neural networks. However, previous works have primarily addressed this problem from a discrete-state perspective which is not suitable as the number of network layers grows. This paper novelly treats the outputs from layers as states of a continuous process and considers leveraging the state space model (SSM) to design the aggregation of layers in very deep neural networks. Moreover, inspired by its advancements in modeling long sequences, the Selective State Space Models (S6) is employed to design a new module called Selective State Space Model Layer Aggregation (S6LA). This module aims to combine traditional CNN or transformer architectures within a sequential framework, enhancing the representational capabilities of state-of-the-art vision networks. Extensive experiments show that S6LA delivers substantial improvements in both image classification and detection tasks, highlighting the potential of integrating SSMs with contemporary deep learning techniques.

Comment: This paper introduces a state-space model layer aggregation for deep networks. It aligns with 'Model Architecture', offering insights into layer dynamics and integration of SSM techniques.

Relevance: 9 Novelty: 7

39. LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model Merging

ArXiv ID: 2502.10749

Authors: Zehua Liu, Han Wu, Yuxuan Yao, Ruifeng She, Xiongwei Han, Tao Zhong, Mingxuan Yuan

Abstract: While most current approaches rely on further training techniques, such as fine-tuning or reinforcement learning, to enhance model capacities, model merging stands out for its ability of improving models without requiring any additional training. In this paper, we propose a unified framework for model merging based on low-rank estimation of task vectors without the need for access to the base model, named \textsc{LoRE-Merging}. Our approach is motivated by the observation that task vectors from fine-tuned models frequently exhibit a limited number of dominant singular values, making low-rank estimations less prone to interference. We implement the method by formulating the merging problem as an optimization problem. Extensive empirical experiments demonstrate the effectiveness of our framework in mitigating interference and preserving task-specific information, thereby advancing the state-of-the-art performance in model merging techniques.

Comment: Introduces low-rank estimation techniques for model merging, advancing low-rank methodologies closely tied to compression and scaling topics.

Relevance: 9 Novelty: 7

40. Low-Rank Thinning

ArXiv ID: 2502.12063

Authors: Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey

Abstract: The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially reducing the number of summary points. However, existing guarantees cover only a restricted range of distributions and kernel-based quality measures and suffer from pessimistic dimension dependence. To address these deficiencies, we introduce a new low-rank analysis of sub-Gaussian thinning that applies to any distribution and any kernel, guaranteeing high-quality compression whenever the kernel or data matrix is approximately low-rank. To demonstrate the broad applicability of the techniques, we design practical sub-Gaussian thinning approaches that improve upon the best known guarantees for approximating attention in transformers, accelerating stochastic gradient training through reordering, and distinguishing distributions in near-linear time.

Comment: The paper introduces a low-rank analysis for sub-Gaussian thinning, which has implications for model compression and efficiency, particularly in attention mechanisms.

Relevance: 8 Novelty: 8

41. On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning

ArXiv ID: 2502.10818

Authors: \'Alvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, Pierre Vandergheynst

Abstract: Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, this approach is well known to suffer from the over-smoothing and over-squashing phenomena, which result in representational collapse as the number of layers increases and insensitivity to the information contained at distant and poorly connected nodes, respectively. In this paper, we present a unified view of these problems through the lens of vanishing gradients, using ideas from linear control theory for our analysis. We propose an interpretation of GNNs as recurrent models and empirically demonstrate that a simple state-space formulation of a GNN effectively alleviates over-smoothing and over-squashing at no extra trainable parameter cost. Further, we show theoretically and empirically that (i) GNNs are by design prone to extreme gradient vanishing even after a few layers; (ii) Over-smoothing is directly related to the mechanism causing vanishing gradients; (iii) Over-squashing is most easily alleviated by a combination of graph rewiring and vanishing gradient mitigation. We believe our work will help bridge the gap between the recurrent and graph neural network literature and will unlock the design of new deep and performant GNNs.

Comment: The paper provides a unification of over-smoothing, over-squashing, and vanishing gradients in GNNs with theoretical insights and proposes improvements. Foundational relevance for graph neural networks.

Relevance: 8 Novelty: 8

42. How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

ArXiv ID: 2502.11196

Authors: Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen

Abstract: Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.

Comment: The exploration of knowledge circuit evolution in LLMs aligns with 'Representation Learning', focusing on interpretability and continual pre-training insights.

Relevance: 8 Novelty: 8

43. On the Query Complexity of Verifier-Assisted Language Generation

ArXiv ID: 2502.12123

Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski

Abstract: Recently, a plethora of works have proposed inference-time algorithms (e.g. best-of-n), which incorporate verifiers to assist the generation process. Their quality-efficiency trade-offs have been empirically benchmarked on a variety of constrained generation tasks, but the algorithmic design landscape is still largely poorly understood. In this paper, we develop a mathematical framework for reasoning about constrained generation using a pre-trained language model generator oracle and a process verifier--which can decide whether a prefix can be extended to a string which satisfies the constraints of choice. We show that even in very simple settings, access to a verifier can render an intractable problem (information-theoretically or computationally) to a tractable one. In fact, we show even simple algorithms, like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier. Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to "backtrack" (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling)--both in terms of computational efficiency, accuracy and diversity.

Comment: This paper explores verifier-assisted constrained generation, offering novel mathematical insights and advancing theoretical understanding of inference-time algorithms. Relevant to foundational LLM research.

Relevance: 8 Novelty: 8

44. APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

ArXiv ID: 2502.12085

Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun

Abstract: While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.

Comment: The paper introduces APB, a framework for accelerating long-context inference in LLMs, which is relevant to model efficiency and compression with significant speedup contributions.

Relevance: 8 Novelty: 8

45. Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens

ArXiv ID: 2502.11245

Authors: Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso

Abstract: Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we first extend RSs to the more complex setting of Concept-based Models and then derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of reasoning shortcuts and show that existing methods, even when combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.

Comment: The paper studies concept-based models and reasoning shortcuts, which aligns with representation learning and interpretability. The theoretical conditions for identifiability add significant novelty.

Relevance: 8 Novelty: 8

46. SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

ArXiv ID: 2502.11356

Authors: Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du

Abstract: The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instructions. Through analysis of SAE latent activations, we identify specific latents responsible for instruction following behavior. Our findings reveal that instruction following capabilities are encoded by a distinct set of instruction-relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Additionally, we demonstrate that our methodology scales effectively across SAEs and LLMs of varying sizes.

Comment: This paper uses Sparse Autoencoders (SAEs) to interpret instruction-following in LLMs, connecting both 'Representation Learning' and interpretability of models.

Relevance: 8 Novelty: 8

47. Generalization of the Gibbs algorithm with high probability at low temperatures

ArXiv ID: 2502.11071

Authors: Andreas Maurer

Abstract: The paper gives a bound on the generalization error of the Gibbs algorithm, which recovers known data-independent bounds for the high temperature range and extends to the low-temperature range, where generalization depends critically on the data-dependent loss-landscape. It is shown, that with high probability the generalization error of a single hypothesis drawn from the Gibbs posterior decreases with the total prior volume of all hypotheses with similar or smaller empirical error. This gives theoretical support to the belief in the benefit of flat minima. The zero temperature limit is discussed and the bound is extended to a class of similar stochastic algorithms.

Comment: The paper addresses generalization bounds for the Gibbs algorithm with a focus on flat minima, relevant to emerging foundational insights in optimization.

Relevance: 8 Novelty: 8

48. Learning the Exact Time Integration Algorithm for Initial Value Problems by Randomized Neural Networks

ArXiv ID: 2502.10949

Authors: Suchuan Dong, Naxian Ni

Abstract: We present a method leveraging extreme learning machine (ELM) type randomized neural networks (NNs) for learning the exact time integration algorithm for initial value problems (IVPs). The exact time integration algorithm for non-autonomous systems can be represented by an algorithmic function in higher dimensions, which satisfies an associated system of partial differential equations with corresponding boundary conditions. Our method learns the algorithmic function by solving this associated system using ELM with a physics informed approach. The trained ELM network serves as the learned algorithm and can be used to solve the IVP with arbitrary initial data or step sizes from some domain. When the right hand side of the non-autonomous system exhibits a periodicity with respect to any of its arguments, while the solution itself to the problem is not periodic, we show that the algorithmic function is either periodic, or when it is not, satisfies a well-defined relation for different periods. This property can greatly simplify the algorithm learning in many problems. We consider explicit and implicit NN formulations, leading to explicit or implicit time integration algorithms, and discuss how to train the ELM network by the nonlinear least squares method. Extensive numerical experiments with benchmark problems, including non-stiff, stiff and chaotic systems, show that the learned NN algorithm produces highly accurate solutions in long-time simulations, with its time-marching errors decreasing nearly exponentially with increasing degrees of freedom in the neural network. We compare extensively the computational performance (accuracy vs.~cost) between the current NN algorithm and the leading traditional time integration algorithms. The learned NN algorithm is computationally competitive, markedly outperforming the traditional algorithms in many problems.

Comment: The paper introduces a method for learning time integration algorithms using randomized neural networks, which is foundational in terms of algorithmic innovation and efficiency.

Relevance: 8 Novelty: 8

49. Continuous Diffusion Model for Language Modeling

ArXiv ID: 2502.11564

Authors: Jaehyeong Jo, Sung Ju Hwang

Abstract: Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.

Comment: This paper introduces a continuous diffusion model for language modeling with connections to statistical manifolds, providing theoretical innovations in generative modeling for discrete data.

Relevance: 8 Novelty: 8

50. MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models

ArXiv ID: 2502.11513

Authors: Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang

Abstract: Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has largely focused on single-task learning, leaving its application to multi-task learning unexplored. Multi-task learning is crucial for leveraging shared knowledge across tasks to improve generalization, yet it introduces unique challenges under ZO settings, such as amplified gradient variance and collinearity. In this paper, we present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization. MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters, reducing the dimensionality of the parameter space and mitigating task conflicts. Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.

Comment: The paper introduces MaZO, a novel framework for multi-task fine-tuning of LLMs using zeroth-order optimization, which aligns with the 'Model Compression' criterion due to its focus on memory-efficient optimization and parameter-level innovations.

Relevance: 8 Novelty: 8

51. How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines

ArXiv ID: 2502.12051

Authors: Ayan Sengupta, Yash Goel, Tanmoy Chakraborty

Abstract: Neural scaling laws have revolutionized the design and optimization of large-scale AI models by revealing predictable relationships between model size, dataset volume, and computational resources. Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies. However, recent studies highlighted their limitations across architectures, modalities, and deployment contexts. Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns. Moreover, scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches. In this survey, we synthesize insights from over 50 studies, examining the theoretical foundations, empirical findings, and practical implications of scaling laws. We also explore key challenges, including data efficiency, inference scaling, and architecture-specific constraints, advocating for adaptive scaling strategies tailored to real-world applications. We suggest that while scaling laws provide a useful guide, they do not always generalize across all architectures and training strategies.

Comment: This paper offers a broad survey of scaling laws, which encompass topics such as sparse models and mixture-of-experts. This aligns with the 'Model Architecture' and 'Representation Learning' criteria as it touches on foundational and theoretical aspects of scaling models.

Relevance: 9 Novelty: 6

52. Meta-Statistical Learning: Supervised Learning of Statistical Inference

ArXiv ID: 2502.12088

Authors: Maxime Peyrard, Kyunghyun Cho

Abstract: This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks, where the goal is to predict properties of the data-generating distribution rather than labels for individual datapoints. These tasks encompass statistical inference problems such as parameter estimation, hypothesis testing, or mutual information estimation. Framing these tasks within traditional machine learning pipelines is challenging, as supervision is typically tied to individual datapoint. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems. In this approach, entire datasets are treated as single inputs to neural networks, which predict distribution-level parameters. Transformer-based architectures, without positional encoding, provide a natural fit due to their permutation-invariance properties. By training on large-scale synthetic datasets, meta-statistical models can leverage the scalability and optimization infrastructure of Transformer-based LLMs. We demonstrate the framework's versatility with applications in hypothesis testing and mutual information estimation, showing strong performance, particularly for small datasets where traditional neural methods struggle.

Comment: The paper introduces a novel framework for statistical inference using Transformer-based architectures, which aligns with representation learning and architectural insights. The use of permutation-invariant Transformers is particularly relevant.