Personalized Daily Arxiv Papers 02/10/2025

	Prompt	Completion	Total
Token	108721	10720	119441
Cost	$2.72	$1.07	$3.79

Total scanned papers: 348

Total relevant papers: 34

Table of contents with paper titles:

In-context denoising with one-layer transformers: connections between attention and associative memory retrieval Authors: Matthew Smart, Alberto Bietti, Anirvan M. Sengupta
Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization Authors: Xinhao Yao, Ruifeng Ren, Yun Liao, Yong Liu
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient Authors: Jan Ludziejewski, Maciej Pi\'oro, Jakub Krajewski, Maciej Stefaniak, Micha{\l} Krutul, Jan Ma{\l}a\'snicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Mi{\l}o\'s, Sebastian Jaszczur
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations Authors: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh
Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing Authors: Kunfeng Lai, Zhenheng Tang, Xinglin Pan, Peijie Dong, Xiang Liu, Haolan Chen, Li Shen, Bo Li, Xiaowen Chu
Tighter sparse variational Gaussian processes Authors: Thang D. Bui, Matthew Ashman, Richard E. Turner
No Task Left Behind: Isotropic Model Merging with Common and Task-Specific Subspaces Authors: Daniel Marczak, Simone Magistri, Sebastian Cygert, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Authors: Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein
Distinguishing Cause from Effect with Causal Velocity Models Authors: Johnny Xi, Hugh Dance, Peter Orbanz, Benjamin Bloem-Reddy
Extracting and Understanding the Superficial Knowledge in Alignment Authors: Runjin Chen, Gabriel Jacob Perin, Xuxi Chen, Xilun Chen, Yan Han, Nina S. T. Hirata, Junyuan Hong, Bhavya Kailkhura
Sparse Autoencoders for Hypothesis Generation Authors: Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson
In Praise of Stubbornness: The Case for Cognitive-Dissonance-Aware Knowledge Updates in LLMs Authors: Simone Clemente, Zied Ben Houidi, Alexis Huet, Dario Rossi, Giulio Franzese, Pietro Michiardi
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models Authors: Xiao-Wen Yang, Xuan-Yi Zhu, Wen-Da Wei, Ding-Chu Zhang, Jie-Jing Shao, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li
Implicit Bias of SignGD and Adam on Multiclass Separable Data Authors: Chen Fan, Mark Schmidt, Christos Thrampoulidis
KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference Authors: Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan
TruthFlow: Truthful LLM Generation via Representation Flow Correction Authors: Hanyu Wang, Bochuan Cao, Yuanpu Cao, Jinghui Chen
An Analysis for Reasoning Bias of Language Models with Small Initialization Authors: Junjie Yao, Zhongwang Zhang, Zhi-Qin John Xu
Position-aware Automatic Circuit Discovery Authors: Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
Rethinking Oversmoothing in Graph Neural Networks: A Rank-Based Perspective Authors: Piero Deidda, Kaicheng Zhang, Desmond Higham, Francesco Tudisco
Generating Symbolic World Models via Test-time Scaling of Large Language Models Authors: Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge Lin, Weiyang Liu
WaferLLM: A Wafer-Scale LLM Inference System Authors: Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai
Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers Authors: Chashi Mahiul Islam, Samuel Jacob Chacko, Mao Nishino, Xiuwen Liu
No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory Authors: Imad Eddine Marouf, Enzo Tartaglione, Stephane Lathuiliere, Joost van de Weijer
PerPO: Perceptual Preference Optimization via Discriminative Rewarding Authors: Zining Zhu, Liang Zhao, Kangheng Lin, Jinze Yang, En Yu, Chenglong Liu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang
Noise Sensitivity of Hierarchical Functions and Deep Learning Lower Bounds in General Product Measures Authors: Rupert Li, Elchanan Mossel
Speeding up Speculative Decoding via Approximate Verification Authors: Meiyu Zhong, Noel Teku, Ravi Tandon
Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator Authors: Zhuotong Chen, Fang Liu, Xuan Zhu, Yanjun Qi, Mohammad Ghavamzadeh
GNNs Getting ComFy: Community and Feature Similarity Guided Rewiring Authors: Celia Rubio-Madrigal, Adarsh Jamadandi, Rebekka Burkholz
Learning low-dimensional representations of ensemble forecast fields using autoencoder-based methods Authors: Jieyu Chen, Kevin H\"ohlein, Sebastian Lerch
Flopping for FLOPs: Leveraging equivariance for computational efficiency Authors: Georg B\"okman, David Nordstr\"om, Fredrik Kahl
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization Authors: Yu-Neng Chuang, Leisheng Yu, Guanchu Wang, Lizhe Zhang, Zirui Liu, Xuanting Cai, Yang Sui, Vladimir Braverman, Xia Hu
Self-Regulation and Requesting Interventions Authors: So Yeon Min, Yue Wu, Jimin Sun, Max Kaufmann, Fahim Tajwar, Yonatan Bisk, Ruslan Salakhutdinov
Investigating the Robustness of Deductive Reasoning with Large Language Models Authors: Fabian Hoppe, Filip Ilievski, Jan-Christoph Kalo
HSI: A Holistic Style Injector for Arbitrary Style Transfer Authors: Shuhao Zhang, Hui Kang, Yang Liu, Fang Mei, Hongjuan Li

1. In-context denoising with one-layer transformers: connections between attention and associative memory retrieval

ArXiv ID: 2502.05164

Authors: Matthew Smart, Alberto Bietti, Anirvan M. Sengupta

Abstract: We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.

Comment: Explores connections between attention mechanisms and associative memory in transformers within a theoretical framework, linking strongly to foundational representation learning and transformer behaviors.

Relevance: 10 Novelty: 9

2. Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization

ArXiv ID: 2502.04667

Authors: Xinhao Yao, Ruifeng Ren, Yun Liao, Yong Liu

Abstract: Training large language models (LLMs) with high-quality Chain-of-Thought (CoT) annotations has become a widely adopted strategy due to its significant enhancement of reasoning capabilities. To fully comprehend this approach, two questions naturally arise: (Q1) What advantages does training with CoT offer compared to training without CoT? (Q2) If there are advantages, what are the underlying mechanisms of explicit CoT training? Analyzing the advantages and mechanisms of CoT training is challenging due to the many factors involved. To address this, we conduct a detailed analysis using clear and controllable data distributions and, for the first time, reveal that CoT training offers the following advantages: (1) Training with CoT markedly improves reasoning generalization, extending it from in-distribution (ID) to both ID and out-of-distribution (OOD) scenarios, while also speeding up convergence; (2) Even when training with CoT includes a certain range of erroneous reasoning steps, it still enables the model to learn reasoning patterns, leading to systematic generalization. We further explore the underlying mechanisms from a circuit perspective: (1) The data distribution (e.g., ratio $\lambda$ and pattern) plays a crucial role in influencing the model's systematic generalization; (2) CoT training (with two-hop facts) internalizes reasoning into a two-stage generalizing circuit, where the number of stages corresponds to the explicit reasoning steps during training. Our findings elucidate the mechanisms underlying explicit CoT training and offer critical insights into tuning strategies for LLMs to achieve robust generalization.

Comment: The paper investigates the mechanism of explicit Chain-of-Thought (CoT) training, which aligns with understanding LLM training dynamics and behaviors, directly addressing foundational insights for reasoning enhancement.

Relevance: 10 Novelty: 8

3. Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

ArXiv ID: 2502.05172

Authors: Jan Ludziejewski, Maciej Pi\'oro, Jakub Krajewski, Maciej Stefaniak, Micha{\l} Krutul, Jan Ma{\l}a\'snicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Mi{\l}o\'s, Sebastian Jaszczur

Abstract: Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.

Comment: Analyzes joint scaling laws for memory-efficient MoE models, directly addressing theoretical and computational efficiency, which is highly relevant to 'Mixture of Experts' and architectural principles.

Relevance: 10 Novelty: 8

4. QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

ArXiv ID: 2502.05003

Authors: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh

Abstract: One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

Comment: This paper introduces QuEST, which explores cutting-edge quantization-aware training and demonstrates stable performance with weights and activations in 1-bit. This directly aligns with the criterion on model compression breakthroughs.