Personalized Daily ArXiv Papers 2025-06-06

[gpt-4o]	Prompt	Completion	Total
Token	38313	4874	43187
Cost	$0.1	$0.05	$0.14

Total arXiv papers: 662

Total scanned papers: 355

Total relevant papers: 37

Table of contents with paper titles:

Transformers Meet In-Context Learning: A Universal Approximation Theory Authors: Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen
NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models Authors: Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar
There Was Never a Bottleneck in Concept Bottleneck Models Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Alfonso Ortega
Inference-Time Hyper-Scaling with KV Cache Compression Authors: Adrian {\L}a\'ncucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti
Log-Linear Attention Authors: Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim
Sample Complexity and Representation Ability of Test-time Scaling Paradigms Authors: Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao
Kinetics: Rethinking Test-Time Scaling Laws Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning Authors: Achleshwar Luthra, Tianbao Yang, Tomer Galanti
KOALA++: Efficient Kalman-Based Optimization of Neural Networks with Gradient-Covariance Products Authors: Zixuan Xia, Aram Davtyan, Paolo Favaro
On the Convergence of Gradient Descent on Learning Transformers with Residual Connections Authors: Zhen Qin, Jinxin Zhou, Zhihui Zhu
Learning normalized image densities via dual score matching Authors: Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli
FPTQuant: Function-Preserving Transforms for LLM Quantization Authors: Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, Markus Nagel
Sparse Autoencoders, Again? Authors: Yin Lu, Tong He, Xuening Zhu, David Wipf
Adaptive Preconditioners Trigger Loss Spikes in Adam Authors: Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order Authors: Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Pavel Plyusnin, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov
NIMO: a Nonlinear Interpretable MOdel Authors: Shijian Xu, Marcello Massimo Negri, Volker Roth
DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience Authors: Runxiang Wang, Boxiao Wang, Kai Li, Yifan Zhang, Jian Cheng
Aligning Latent Spaces with Flow Priors Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo
The Oversmoothing Fallacy: A Misguided Narrative in GNN Research Authors: MoonJeong Park, Sunghyun Choi, Jaeseung Heo, Eunhyeok Park, Dongwoo Kim
HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training Authors: Geon-Woo Kim, Junbo Li, Shashidhar Gandham, Omar Baldonado, Adithya Gangidi, Pavan Balaji, Zhangyang Wang, Aditya Akella
Exploring Diffusion Transformer Designs via Grafting Authors: Keshigeyan Chandrasegaran, Michael Poli, Daniel Y. Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Li Fei-Fei
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation Authors: Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher R\'e, David W. Romero
Relational reasoning and inductive bias in transformers trained on a transitive inference task Authors: Jesse Geerts, Stephanie Chan, Claudia Clopath, Kimberly Stachenfeld
Exploring bidirectional bounds for minimax-training of Energy-based models Authors: Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, S{\o}ren Hauberg
Tight analyses of first-order methods with error feedback Authors: Daniel Berg Thomsen, Adrien Taylor, Aymeric Dieuleveut
Aligning Multimodal Representations through an Information Bottleneck Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega
Associative Memory and Generative Diffusion in the Zero-noise Limit Authors: Joshua Hess, Quaid Morris
Robust Moment Identification for Nonlinear PDEs via a Neural ODE Approach Authors: Shaoxuan Chen, Su Yang, Panayotis G. Kevrekidis, Wei Zhu
Semi-Implicit Variational Inference via Kernelized Path Gradient Descent Authors: Tobias Pielok, Bernd Bischl, David R\"ugamer
Identifying and Understanding Cross-Class Features in Adversarial Training Authors: Zeming Wei, Yiwen Guo, Yisen Wang
Hierarchical Implicit Neural Emulators Authors: Ruoxi Jiang, Xiao Zhang, Karan Jakhar, Peter Y. Lu, Pedram Hassanzadeh, Michael Maire, Rebecca Willett
Power Law Guided Dynamic Sifting for Efficient Attention Authors: Nirav Koley, Prajwal Singhania, Abhinav Bhatele
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training Authors: Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag\"uera y Arcas, Jo\~ao Sacramento
You Only Train Once Authors: Christos Sakaridis
Half-Layered Neural Networks Authors: Ethem Alpaydin
Towards Reasonable Concept Bottleneck Models Authors: Nektarios Kalampalikis, Kavya Gupta, Georgi Vitanov, Isabel Valera

1. Transformers Meet In-Context Learning: A Universal Approximation Theory

ArXiv ID: 2506.05200

Authors: Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen

Abstract: Modern large language models are capable of in-context learning, the ability to perform new tasks at inference time using only a handful of input-output examples in the prompt, without any fine-tuning or parameter updates. We develop a universal approximation theory to better understand how transformers enable in-context learning. For any class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can perform reliable prediction given only a few in-context examples. In contrast to much of the recent literature that frames transformers as algorithm approximators -- i.e., constructing transformers to emulate the iterations of optimization algorithms as a means to approximate solutions of learning problems -- our work adopts a fundamentally different approach rooted in universal function approximation. This alternative approach offers approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being approximated, thereby extending far beyond convex problems and linear function classes. Our construction sheds light on how transformers can simultaneously learn general-purpose representations and adapt dynamically to in-context examples.

Comment: The paper develops a universal approximation theory for transformers in in-context learning, providing theoretical insights into LLM behavior.

Relevance: 9 Novelty: 9

2. NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models

ArXiv ID: 2506.04536

Authors: Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar

Abstract: Characterizing the diverse computational properties of human neurons via multimodal electrophysiological, transcriptomic, and morphological data provides the foundation for constructing and validating bio-realistic neuron models that can advance our understanding of fundamental mechanisms underlying brain function. However, current modeling approaches remain constrained by the limited availability and intrinsic variability of experimental neuronal data. To capture variability, ensembles of deterministic models are often used, but are difficult to scale as model generation requires repeating computationally expensive optimization for each neuron. While deep learning is becoming increasingly relevant in this space, it fails to capture the full biophysical complexity of neurons, their nonlinear voltage dynamics, and variability. To address these shortcomings, we introduce NOBLE, a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection. Trained on data generated from biophysically realistic neuron models, NOBLE predicts distributions of neural dynamics accounting for the intrinsic experimental variability. Unlike conventional bio-realistic neuron models, interpolating within the embedding space offers models whose dynamics are consistent with experimentally observed responses. NOBLE is the first scaled-up deep learning framework validated on real experimental data, enabling efficient generation of synthetic neurons that exhibit trial-to-trial variability and achieve a $4200\times$ speedup over numerical solvers. To this end, NOBLE captures fundamental neural properties, opening the door to a better understanding of cellular composition and computations, neuromorphic architectures, large-scale brain circuits, and general neuroAI applications.

Comment: The paper introduces NOBLE, a neural operator framework for modeling biological neurons, which aligns with foundational research in AI for Science, focusing on new generative paradigms and architecture-level innovations.

Relevance: 9 Novelty: 8

3. There Was Never a Bottleneck in Concept Bottleneck Models

ArXiv ID: 2506.04877

Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Alfonso Ortega

Abstract: Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs support concept-level interventions with theoretical guarantees, remain consistent with Bayesian principles, and offer greater flexibility in key design choices.

Comment: The paper proposes Minimal Concept Bottleneck Models, which aligns with representation learning by introducing a new method for interpretability and information bottleneck.

Relevance: 9 Novelty: 8

4. Inference-Time Hyper-Scaling with KV Cache Compression

ArXiv ID: 2506.05345

Authors: Adrian {\L}a\'ncucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

Comment: The paper introduces Dynamic Memory Sparsification for KV cache compression, aligning with model compression and efficiency improvements.

Relevance: 9 Novelty: 8

5. Log-Linear Attention

ArXiv ID: 2506.04761

Authors: Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim

Abstract: The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.

Comment: The paper introduces log-linear attention, a novel attention mechanism balancing efficiency and expressiveness, relevant to model architecture.

Relevance: 9 Novelty: 8

6. Sample Complexity and Representation Ability of Test-time Scaling Paradigms

ArXiv ID: 2506.05295

Authors: Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao

Abstract: Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $\Theta(1/\Delta^2)$ samples to produce the correct answer, while best-of-$n$ only needs $\Theta(1/\Delta)$, where $\Delta < 1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.

Comment: The paper provides theoretical insights into test-time scaling paradigms for LLMs, which is relevant to understanding LLM behavior.

Relevance: 9 Novelty: 8

7. Kinetics: Rethinking Test-Time Scaling Laws

ArXiv ID: 2506.05333

Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen

Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

Comment: The paper proposes a new scaling paradigm centered on sparse attention, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

8. Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning

ArXiv ID: 2506.04411

Authors: Achleshwar Luthra, Tianbao Yang, Tomer Galanti

Abstract: Despite its empirical success, the theoretical foundations of self-supervised contrastive learning (CL) are not yet fully established. In this work, we address this gap by showing that standard CL objectives implicitly approximate a supervised variant we call the negatives-only supervised contrastive loss (NSCL), which excludes same-class contrasts. We prove that the gap between the CL and NSCL losses vanishes as the number of semantic classes increases, under a bound that is both label-agnostic and architecture-independent. We characterize the geometric structure of the global minimizers of the NSCL loss: the learned representations exhibit augmentation collapse, within-class collapse, and class centers that form a simplex equiangular tight frame. We further introduce a new bound on the few-shot error of linear-probing. This bound depends on two measures of feature variability--within-class dispersion and variation along the line between class centers. We show that directional variation dominates the bound and that the within-class dispersion's effect diminishes as the number of labeled samples increases. These properties enable CL and NSCL-trained representations to support accurate few-shot label recovery using simple linear probes. Finally, we empirically validate our theoretical findings: the gap between CL and NSCL losses decays at a rate of $\mathcal{O}(\frac{1}{#\text{classes}})$; the two losses are highly correlated; minimizing the CL loss implicitly brings the NSCL loss close to the value achieved by direct minimization; and the proposed few-shot error bound provides a tight estimate of probing performance in practice.

Comment: The paper provides theoretical insights into self-supervised contrastive learning, which is relevant to representation learning.

Relevance: 9 Novelty: 8

9. KOALA++: Efficient Kalman-Based Optimization of Neural Networks with Gradient-Covariance Products

ArXiv ID: 2506.04432

Authors: Zixuan Xia, Aram Davtyan, Paolo Favaro

Abstract: We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.

Comment: The paper introduces KOALA++, a Kalman-based optimization algorithm that models structured gradient uncertainty, which is relevant to representation learning and training dynamics in neural networks.

Relevance: 9 Novelty: 8

10. On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

ArXiv ID: 2506.05249

Authors: Zhen Qin, Jinxin Zhou, Zhihui Zhu

Abstract: Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

Comment: The paper analyzes the convergence of gradient descent on Transformers with residual connections, providing insights into model architecture and training dynamics.

Relevance: 9 Novelty: 8

11. Learning normalized image densities via dual score matching

ArXiv ID: 2506.05310

Authors: Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli

Abstract: Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: estimated log probabilities are nearly independent of the specific images in the training set. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary significantly with image content, in contrast with traditional assumptions such as concentration of measure or support on a low-dimensional manifold.

Comment: The paper presents a new framework for learning normalized energy models inspired by diffusion generative models, contributing to representation learning with a novel dual score matching objective.

Relevance: 9 Novelty: 8

12. FPTQuant: Function-Preserving Transforms for LLM Quantization

ArXiv ID: 2506.04985

Authors: Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, Markus Nagel

Abstract: Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces four novel, lightweight, and expressive function-preserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a mergeable scaling transform within the MLP block, and (4) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model's function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no custom kernels and adds virtually no overhead during inference. The FPTs are trained both locally to reduce outliers, and end-to-end such that the outputs of the quantized and full-precision models match. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to 3.9 times over FP. Empirically, FPTQuant has an excellent accuracy-speed trade-off -- it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29% slower.

Comment: The paper introduces FPTQuant, a novel approach for LLM quantization, contributing to model compression with function-preserving transforms.

Relevance: 9 Novelty: 8

13. Sparse Autoencoders, Again?

ArXiv ID: 2506.04859

Authors: Yin Lu, Tong He, Xuening Zhu, David Wipf

Abstract: Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could reflect, among other things, correlation patterns in large language model activations, or complex natural image manifolds. And yet despite the wide-ranging applicability, there have been relatively few changes to SAEs beyond the original recipe from decades ago, namely, standard deep encoder/decoder layers trained with a classical/deterministic sparse regularizer applied within the latent space. One possible exception is the variational autoencoder (VAE), which adopts a stochastic encoder module capable of producing sparse representations when applied to manifold data. In this work we formalize underappreciated weaknesses with both canonical SAEs, as well as analogous VAEs applied to similar tasks, and propose a hybrid alternative model that circumvents these prior limitations. In terms of theoretical support, we prove that global minima of our proposed model recover certain forms of structured data spread across a union of manifolds. Meanwhile, empirical evaluations on synthetic and real-world datasets substantiate the efficacy of our approach in accurately estimating underlying manifold dimensions and producing sparser latent representations without compromising reconstruction error. In general, we are able to exceed the performance of equivalent-capacity SAEs and VAEs, as well as recent diffusion models where applicable, within domains such as images and language model activation patterns.

Comment: The paper revisits sparse autoencoders, proposing a hybrid model that addresses weaknesses in canonical SAEs and VAEs, contributing to representation learning.

Relevance: 9 Novelty: 8

14. Adaptive Preconditioners Trigger Loss Spikes in Adam

ArXiv ID: 2506.04805

Authors: Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu

Abstract: Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam's adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a $\beta_2$-exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold $2/\eta$ for a sustained period, inducing instability. This instability further leads to an alignment between the gradient and the maximum eigendirection, and a loss spike occurs precisely when the gradient-directional curvature exceeds $2/\eta$. We verify this mechanism through extensive experiments on fully connected networks, convolutional networks, and Transformer architectures.

Comment: The paper investigates the mechanism behind loss spikes in the Adam optimizer, providing insights into training dynamics in neural networks.

Relevance: 9 Novelty: 7

15. Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

ArXiv ID: 2506.04430

Authors: Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Pavel Plyusnin, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

Abstract: Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM

Comment: The paper introduces zero-order optimization methods for fine-tuning LLMs, which is relevant to model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 8

16. NIMO: a Nonlinear Interpretable MOdel

ArXiv ID: 2506.05059

Authors: Shijian Xu, Marcello Massimo Negri, Volker Roth

Abstract: Neural networks (NNs) have achieved tremendous success over the past decade, yet they are still extremely difficult to interpret. In contrast, linear models are less expressive but offer inherent interpretability. Linear coefficients are interpretable as the marginal effect of a feature on the prediction, assuming all other features are kept fixed. To combine the benefits of both approaches, we introduce NIMO (Nonlinear Interpretable MOdel). The key idea is to define a model where the NN is designed to learn nonlinear corrections to the linear model predictions, while also maintaining the original interpretability of the linear coefficients. Relevantly, we develop an optimization algorithm based on profile likelihood that elegantly allows for optimizing over the NN parameters while updating the linear coefficients analytically. By relying on adaptive ridge regression we can easily incorporate sparsity constraints as well. We show empirically that we can recover the underlying linear coefficients while significantly improving the predictive accuracy. Compared to other hybrid interpretable approaches, our model is the only one that actually maintains the same interpretability of linear coefficients as in linear models. We also achieve higher performance on various regression and classification settings.

Comment: The paper introduces NIMO, a model combining neural networks with linear models for interpretability, aligning with model architecture innovations.

Relevance: 8 Novelty: 8

17. DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience

ArXiv ID: 2506.04282

Authors: Runxiang Wang, Boxiao Wang, Kai Li, Yifan Zhang, Jian Cheng

Abstract: Symbolic regression is a fundamental tool for discovering interpretable mathematical expressions from data, with broad applications across scientific and engineering domains. Recently, large language models (LLMs) have demonstrated strong performance in this task, leveraging embedded scientific priors and reasoning capabilities to surpass traditional methods. However, existing LLM-based approaches, such as LLM-SR, often over-rely on internal priors, lacking explicit data understanding and systematic reflection during equation generation. To address these limitations, we propose DrSR (Dual Reasoning Symbolic Regression), a framework that combines data-driven insight with reflective learning to enhance both robustness and discovery capability. Specifically, DrSR guides LLMs to analyze structural relationships (e.g., monotonicity, nonlinearity, and correlation) within the data to generate structured descriptions. Simultaneously, it monitors equation performance and establishes a feedback loop to refine subsequent generations. By integrating data understanding and generation reflection in a closed loop, DrSR enables more efficient exploration of the symbolic expression space. Experiments across interdisciplinary datasets in physics, chemistry, biology, and materials science demonstrate that DrSR substantially improves the valid equation rate and consistently outperforms both classical and recent LLM-based methods in terms of accuracy, generalization, and search efficiency. These results underscore its potential for scientific equation discovery.

Comment: The paper introduces DrSR, a framework for symbolic regression using LLMs, which is relevant to representation learning and LLMs.

Relevance: 8 Novelty: 8

18. Aligning Latent Spaces with Flow Priors

ArXiv ID: 2506.05240

Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo

Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.

Comment: The paper introduces a novel framework for aligning latent spaces using flow-based generative models, which is relevant to representation learning.

Relevance: 8 Novelty: 8

19. The Oversmoothing Fallacy: A Misguided Narrative in GNN Research

ArXiv ID: 2506.04653

Authors: MoonJeong Park, Sunghyun Choi, Jaeseung Heo, Eunhyeok Park, Dongwoo Kim

Abstract: Oversmoothing has been recognized as a main obstacle to building deep Graph Neural Networks (GNNs), limiting the performance. This position paper argues that the influence of oversmoothing has been overstated and advocates for a further exploration of deep GNN architectures. Given the three core operations of GNNs, aggregation, linear transformation, and non-linear activation, we show that prior studies have mistakenly confused oversmoothing with the vanishing gradient, caused by transformation and activation rather than aggregation. Our finding challenges prior beliefs about oversmoothing being unique to GNNs. Furthermore, we demonstrate that classical solutions such as skip connections and normalization enable the successful stacking of deep GNN layers without performance degradation. Our results clarify misconceptions about oversmoothing and shed new light on the potential of deep GNNs.

Comment: The paper challenges the oversmoothing narrative in GNN research, providing insights into deep GNN architectures, which aligns with the model architecture criterion.