Previous Day 2025-06-05
Monthly Overview 2025-06
Next Day 2025-06-09

Personalized Daily ArXiv Papers 2025-06-06

[gpt-4o] Prompt Completion Total
Token 38313 4874 43187
Cost $0.1 $0.05 $0.14

Total arXiv papers: 662

Total scanned papers: 355

Total relevant papers: 37

Table of contents with paper titles:

  1. Transformers Meet In-Context Learning: A Universal Approximation Theory Authors: Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen

  2. NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models Authors: Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar

  3. There Was Never a Bottleneck in Concept Bottleneck Models Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Alfonso Ortega

  4. Inference-Time Hyper-Scaling with KV Cache Compression Authors: Adrian {\L}a\'ncucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

  5. Log-Linear Attention Authors: Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim

  6. Sample Complexity and Representation Ability of Test-time Scaling Paradigms Authors: Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao

  7. Kinetics: Rethinking Test-Time Scaling Laws Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen

  8. Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning Authors: Achleshwar Luthra, Tianbao Yang, Tomer Galanti

  9. KOALA++: Efficient Kalman-Based Optimization of Neural Networks with Gradient-Covariance Products Authors: Zixuan Xia, Aram Davtyan, Paolo Favaro

  10. On the Convergence of Gradient Descent on Learning Transformers with Residual Connections Authors: Zhen Qin, Jinxin Zhou, Zhihui Zhu

  11. Learning normalized image densities via dual score matching Authors: Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli

  12. FPTQuant: Function-Preserving Transforms for LLM Quantization Authors: Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, Markus Nagel

  13. Sparse Autoencoders, Again? Authors: Yin Lu, Tong He, Xuening Zhu, David Wipf

  14. Adaptive Preconditioners Trigger Loss Spikes in Adam Authors: Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu

  15. Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order Authors: Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Pavel Plyusnin, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

  16. NIMO: a Nonlinear Interpretable MOdel Authors: Shijian Xu, Marcello Massimo Negri, Volker Roth

  17. DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience Authors: Runxiang Wang, Boxiao Wang, Kai Li, Yifan Zhang, Jian Cheng

  18. Aligning Latent Spaces with Flow Priors Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo

  19. The Oversmoothing Fallacy: A Misguided Narrative in GNN Research Authors: MoonJeong Park, Sunghyun Choi, Jaeseung Heo, Eunhyeok Park, Dongwoo Kim

  20. HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training Authors: Geon-Woo Kim, Junbo Li, Shashidhar Gandham, Omar Baldonado, Adithya Gangidi, Pavan Balaji, Zhangyang Wang, Aditya Akella

  21. Exploring Diffusion Transformer Designs via Grafting Authors: Keshigeyan Chandrasegaran, Michael Poli, Daniel Y. Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Li Fei-Fei

  22. Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets

  23. HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation Authors: Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher R\'e, David W. Romero

  24. Relational reasoning and inductive bias in transformers trained on a transitive inference task Authors: Jesse Geerts, Stephanie Chan, Claudia Clopath, Kimberly Stachenfeld

  25. Exploring bidirectional bounds for minimax-training of Energy-based models Authors: Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, S{\o}ren Hauberg

  26. Tight analyses of first-order methods with error feedback Authors: Daniel Berg Thomsen, Adrien Taylor, Aymeric Dieuleveut

  27. Aligning Multimodal Representations through an Information Bottleneck Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega

  28. Associative Memory and Generative Diffusion in the Zero-noise Limit Authors: Joshua Hess, Quaid Morris

  29. Robust Moment Identification for Nonlinear PDEs via a Neural ODE Approach Authors: Shaoxuan Chen, Su Yang, Panayotis G. Kevrekidis, Wei Zhu

  30. Semi-Implicit Variational Inference via Kernelized Path Gradient Descent Authors: Tobias Pielok, Bernd Bischl, David R\"ugamer

  31. Identifying and Understanding Cross-Class Features in Adversarial Training Authors: Zeming Wei, Yiwen Guo, Yisen Wang

  32. Hierarchical Implicit Neural Emulators Authors: Ruoxi Jiang, Xiao Zhang, Karan Jakhar, Peter Y. Lu, Pedram Hassanzadeh, Michael Maire, Rebecca Willett

  33. Power Law Guided Dynamic Sifting for Efficient Attention Authors: Nirav Koley, Prajwal Singhania, Abhinav Bhatele

  34. MesaNet: Sequence Modeling by Locally Optimal Test-Time Training Authors: Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag\"uera y Arcas, Jo\~ao Sacramento

  35. You Only Train Once Authors: Christos Sakaridis

  36. Half-Layered Neural Networks Authors: Ethem Alpaydin

  37. Towards Reasonable Concept Bottleneck Models Authors: Nektarios Kalampalikis, Kavya Gupta, Georgi Vitanov, Isabel Valera


1. Transformers Meet In-Context Learning: A Universal Approximation Theory

ArXiv ID: 2506.05200

Authors: Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen

Abstract: Modern large language models are capable of in-context learning, the ability to perform new tasks at inference time using only a handful of input-output examples in the prompt, without any fine-tuning or parameter updates. We develop a universal approximation theory to better understand how transformers enable in-context learning. For any class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can perform reliable prediction given only a few in-context examples. In contrast to much of the recent literature that frames transformers as algorithm approximators -- i.e., constructing transformers to emulate the iterations of optimization algorithms as a means to approximate solutions of learning problems -- our work adopts a fundamentally different approach rooted in universal function approximation. This alternative approach offers approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being approximated, thereby extending far beyond convex problems and linear function classes. Our construction sheds light on how transformers can simultaneously learn general-purpose representations and adapt dynamically to in-context examples.

Comment: The paper develops a universal approximation theory for transformers in in-context learning, providing theoretical insights into LLM behavior.

Relevance: 9 Novelty: 9


2. NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models

ArXiv ID: 2506.04536

Authors: Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar

Abstract: Characterizing the diverse computational properties of human neurons via multimodal electrophysiological, transcriptomic, and morphological data provides the foundation for constructing and validating bio-realistic neuron models that can advance our understanding of fundamental mechanisms underlying brain function. However, current modeling approaches remain constrained by the limited availability and intrinsic variability of experimental neuronal data. To capture variability, ensembles of deterministic models are often used, but are difficult to scale as model generation requires repeating computationally expensive optimization for each neuron. While deep learning is becoming increasingly relevant in this space, it fails to capture the full biophysical complexity of neurons, their nonlinear voltage dynamics, and variability. To address these shortcomings, we introduce NOBLE, a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection. Trained on data generated from biophysically realistic neuron models, NOBLE predicts distributions of neural dynamics accounting for the intrinsic experimental variability. Unlike conventional bio-realistic neuron models, interpolating within the embedding space offers models whose dynamics are consistent with experimentally observed responses. NOBLE is the first scaled-up deep learning framework validated on real experimental data, enabling efficient generation of synthetic neurons that exhibit trial-to-trial variability and achieve a $4200\times$ speedup over numerical solvers. To this end, NOBLE captures fundamental neural properties, opening the door to a better understanding of cellular composition and computations, neuromorphic architectures, large-scale brain circuits, and general neuroAI applications.

Comment: The paper introduces NOBLE, a neural operator framework for modeling biological neurons, which aligns with foundational research in AI for Science, focusing on new generative paradigms and architecture-level innovations.

Relevance: 9 Novelty: 8


3. There Was Never a Bottleneck in Concept Bottleneck Models

ArXiv ID: 2506.04877

Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Alfonso Ortega

Abstract: Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs support concept-level interventions with theoretical guarantees, remain consistent with Bayesian principles, and offer greater flexibility in key design choices.

Comment: The paper proposes Minimal Concept Bottleneck Models, which aligns with representation learning by introducing a new method for interpretability and information bottleneck.

Relevance: 9 Novelty: 8


4. Inference-Time Hyper-Scaling with KV Cache Compression

ArXiv ID: 2506.05345

Authors: Adrian {\L}a\'ncucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

Comment: The paper introduces Dynamic Memory Sparsification for KV cache compression, aligning with model compression and efficiency improvements.

Relevance: 9 Novelty: 8


5. Log-Linear Attention

ArXiv ID: 2506.04761

Authors: Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim

Abstract: The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.

Comment: The paper introduces log-linear attention, a novel attention mechanism balancing efficiency and expressiveness, relevant to model architecture.

Relevance: 9 Novelty: 8


6. Sample Complexity and Representation Ability of Test-time Scaling Paradigms

ArXiv ID: 2506.05295

Authors: Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao

Abstract: Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $\Theta(1/\Delta^2)$ samples to produce the correct answer, while best-of-$n$ only needs $\Theta(1/\Delta)$, where $\Delta < 1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.

Comment: The paper provides theoretical insights into test-time scaling paradigms for LLMs, which is relevant to understanding LLM behavior.

Relevance: 9 Novelty: 8


7. Kinetics: Rethinking Test-Time Scaling Laws

ArXiv ID: 2506.05333

Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen

Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

Comment: The paper proposes a new scaling paradigm centered on sparse attention, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


8. Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning

ArXiv ID: 2506.04411

Authors: Achleshwar Luthra, Tianbao Yang, Tomer Galanti

Abstract: Despite its empirical success, the theoretical foundations of self-supervised contrastive learning (CL) are not yet fully established. In this work, we address this gap by showing that standard CL objectives implicitly approximate a supervised variant we call the negatives-only supervised contrastive loss (NSCL), which excludes same-class contrasts. We prove that the gap between the CL and NSCL losses vanishes as the number of semantic classes increases, under a bound that is both label-agnostic and architecture-independent. We characterize the geometric structure of the global minimizers of the NSCL loss: the learned representations exhibit augmentation collapse, within-class collapse, and class centers that form a simplex equiangular tight frame. We further introduce a new bound on the few-shot error of linear-probing. This bound depends on two measures of feature variability--within-class dispersion and variation along the line between class centers. We show that directional variation dominates the bound and that the within-class dispersion's effect diminishes as the number of labeled samples increases. These properties enable CL and NSCL-trained representations to support accurate few-shot label recovery using simple linear probes. Finally, we empirically validate our theoretical findings: the gap between CL and NSCL losses decays at a rate of $\mathcal{O}(\frac{1}{#\text{classes}})$; the two losses are highly correlated; minimizing the CL loss implicitly brings the NSCL loss close to the value achieved by direct minimization; and the proposed few-shot error bound provides a tight estimate of probing performance in practice.

Comment: The paper provides theoretical insights into self-supervised contrastive learning, which is relevant to representation learning.

Relevance: 9 Novelty: 8


9. KOALA++: Efficient Kalman-Based Optimization of Neural Networks with Gradient-Covariance Products

ArXiv ID: 2506.04432

Authors: Zixuan Xia, Aram Davtyan, Paolo Favaro

Abstract: We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.

Comment: The paper introduces KOALA++, a Kalman-based optimization algorithm that models structured gradient uncertainty, which is relevant to representation learning and training dynamics in neural networks.

Relevance: 9 Novelty: 8


10. On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

ArXiv ID: 2506.05249

Authors: Zhen Qin, Jinxin Zhou, Zhihui Zhu

Abstract: Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

Comment: The paper analyzes the convergence of gradient descent on Transformers with residual connections, providing insights into model architecture and training dynamics.

Relevance: 9 Novelty: 8


11. Learning normalized image densities via dual score matching

ArXiv ID: 2506.05310

Authors: Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli

Abstract: Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: estimated log probabilities are nearly independent of the specific images in the training set. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary significantly with image content, in contrast with traditional assumptions such as concentration of measure or support on a low-dimensional manifold.

Comment: The paper presents a new framework for learning normalized energy models inspired by diffusion generative models, contributing to representation learning with a novel dual score matching objective.

Relevance: 9 Novelty: 8


12. FPTQuant: Function-Preserving Transforms for LLM Quantization

ArXiv ID: 2506.04985

Authors: Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, Markus Nagel

Abstract: Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces four novel, lightweight, and expressive function-preserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a mergeable scaling transform within the MLP block, and (4) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model's function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no custom kernels and adds virtually no overhead during inference. The FPTs are trained both locally to reduce outliers, and end-to-end such that the outputs of the quantized and full-precision models match. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to 3.9 times over FP. Empirically, FPTQuant has an excellent accuracy-speed trade-off -- it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29% slower.

Comment: The paper introduces FPTQuant, a novel approach for LLM quantization, contributing to model compression with function-preserving transforms.

Relevance: 9 Novelty: 8


13. Sparse Autoencoders, Again?

ArXiv ID: 2506.04859

Authors: Yin Lu, Tong He, Xuening Zhu, David Wipf

Abstract: Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could reflect, among other things, correlation patterns in large language model activations, or complex natural image manifolds. And yet despite the wide-ranging applicability, there have been relatively few changes to SAEs beyond the original recipe from decades ago, namely, standard deep encoder/decoder layers trained with a classical/deterministic sparse regularizer applied within the latent space. One possible exception is the variational autoencoder (VAE), which adopts a stochastic encoder module capable of producing sparse representations when applied to manifold data. In this work we formalize underappreciated weaknesses with both canonical SAEs, as well as analogous VAEs applied to similar tasks, and propose a hybrid alternative model that circumvents these prior limitations. In terms of theoretical support, we prove that global minima of our proposed model recover certain forms of structured data spread across a union of manifolds. Meanwhile, empirical evaluations on synthetic and real-world datasets substantiate the efficacy of our approach in accurately estimating underlying manifold dimensions and producing sparser latent representations without compromising reconstruction error. In general, we are able to exceed the performance of equivalent-capacity SAEs and VAEs, as well as recent diffusion models where applicable, within domains such as images and language model activation patterns.

Comment: The paper revisits sparse autoencoders, proposing a hybrid model that addresses weaknesses in canonical SAEs and VAEs, contributing to representation learning.

Relevance: 9 Novelty: 8


14. Adaptive Preconditioners Trigger Loss Spikes in Adam

ArXiv ID: 2506.04805

Authors: Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu

Abstract: Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam's adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a $\beta_2$-exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold $2/\eta$ for a sustained period, inducing instability. This instability further leads to an alignment between the gradient and the maximum eigendirection, and a loss spike occurs precisely when the gradient-directional curvature exceeds $2/\eta$. We verify this mechanism through extensive experiments on fully connected networks, convolutional networks, and Transformer architectures.

Comment: The paper investigates the mechanism behind loss spikes in the Adam optimizer, providing insights into training dynamics in neural networks.

Relevance: 9 Novelty: 7


15. Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

ArXiv ID: 2506.04430

Authors: Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Pavel Plyusnin, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

Abstract: Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM

Comment: The paper introduces zero-order optimization methods for fine-tuning LLMs, which is relevant to model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 8


16. NIMO: a Nonlinear Interpretable MOdel

ArXiv ID: 2506.05059

Authors: Shijian Xu, Marcello Massimo Negri, Volker Roth

Abstract: Neural networks (NNs) have achieved tremendous success over the past decade, yet they are still extremely difficult to interpret. In contrast, linear models are less expressive but offer inherent interpretability. Linear coefficients are interpretable as the marginal effect of a feature on the prediction, assuming all other features are kept fixed. To combine the benefits of both approaches, we introduce NIMO (Nonlinear Interpretable MOdel). The key idea is to define a model where the NN is designed to learn nonlinear corrections to the linear model predictions, while also maintaining the original interpretability of the linear coefficients. Relevantly, we develop an optimization algorithm based on profile likelihood that elegantly allows for optimizing over the NN parameters while updating the linear coefficients analytically. By relying on adaptive ridge regression we can easily incorporate sparsity constraints as well. We show empirically that we can recover the underlying linear coefficients while significantly improving the predictive accuracy. Compared to other hybrid interpretable approaches, our model is the only one that actually maintains the same interpretability of linear coefficients as in linear models. We also achieve higher performance on various regression and classification settings.

Comment: The paper introduces NIMO, a model combining neural networks with linear models for interpretability, aligning with model architecture innovations.

Relevance: 8 Novelty: 8


17. DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience

ArXiv ID: 2506.04282

Authors: Runxiang Wang, Boxiao Wang, Kai Li, Yifan Zhang, Jian Cheng

Abstract: Symbolic regression is a fundamental tool for discovering interpretable mathematical expressions from data, with broad applications across scientific and engineering domains. Recently, large language models (LLMs) have demonstrated strong performance in this task, leveraging embedded scientific priors and reasoning capabilities to surpass traditional methods. However, existing LLM-based approaches, such as LLM-SR, often over-rely on internal priors, lacking explicit data understanding and systematic reflection during equation generation. To address these limitations, we propose DrSR (Dual Reasoning Symbolic Regression), a framework that combines data-driven insight with reflective learning to enhance both robustness and discovery capability. Specifically, DrSR guides LLMs to analyze structural relationships (e.g., monotonicity, nonlinearity, and correlation) within the data to generate structured descriptions. Simultaneously, it monitors equation performance and establishes a feedback loop to refine subsequent generations. By integrating data understanding and generation reflection in a closed loop, DrSR enables more efficient exploration of the symbolic expression space. Experiments across interdisciplinary datasets in physics, chemistry, biology, and materials science demonstrate that DrSR substantially improves the valid equation rate and consistently outperforms both classical and recent LLM-based methods in terms of accuracy, generalization, and search efficiency. These results underscore its potential for scientific equation discovery.

Comment: The paper introduces DrSR, a framework for symbolic regression using LLMs, which is relevant to representation learning and LLMs.

Relevance: 8 Novelty: 8


18. Aligning Latent Spaces with Flow Priors

ArXiv ID: 2506.05240

Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo

Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.

Comment: The paper introduces a novel framework for aligning latent spaces using flow-based generative models, which is relevant to representation learning.

Relevance: 8 Novelty: 8


19. The Oversmoothing Fallacy: A Misguided Narrative in GNN Research

ArXiv ID: 2506.04653

Authors: MoonJeong Park, Sunghyun Choi, Jaeseung Heo, Eunhyeok Park, Dongwoo Kim

Abstract: Oversmoothing has been recognized as a main obstacle to building deep Graph Neural Networks (GNNs), limiting the performance. This position paper argues that the influence of oversmoothing has been overstated and advocates for a further exploration of deep GNN architectures. Given the three core operations of GNNs, aggregation, linear transformation, and non-linear activation, we show that prior studies have mistakenly confused oversmoothing with the vanishing gradient, caused by transformation and activation rather than aggregation. Our finding challenges prior beliefs about oversmoothing being unique to GNNs. Furthermore, we demonstrate that classical solutions such as skip connections and normalization enable the successful stacking of deep GNN layers without performance degradation. Our results clarify misconceptions about oversmoothing and shed new light on the potential of deep GNNs.

Comment: The paper challenges the oversmoothing narrative in GNN research, providing insights into deep GNN architectures, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


20. HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

ArXiv ID: 2506.04531

Authors: Geon-Woo Kim, Junbo Li, Shashidhar Gandham, Omar Baldonado, Adithya Gangidi, Pavan Balaji, Zhangyang Wang, Aditya Akella

Abstract: Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.

Comment: The paper proposes HALoS, a hierarchical asynchronous optimization framework for LLM training, which aligns with model architecture innovations and efficiency improvements.

Relevance: 8 Novelty: 7


21. Exploring Diffusion Transformer Designs via Grafting

ArXiv ID: 2506.05340

Authors: Keshigeyan Chandrasegaran, Michael Poli, Daniel Y. Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Li Fei-Fei

Abstract: Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 for DiT-XL/2) using <2% pretraining compute. We then graft a text-to-image model (PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2x and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring. Code and grafted models: https://grafting.stanford.edu

Comment: The paper explores diffusion transformer designs via grafting, which aligns with model architecture innovations, particularly in the context of transformers.

Relevance: 8 Novelty: 7


22. Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

ArXiv ID: 2506.05229

Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets

Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

Comment: The paper introduces Diagonal Batching to improve parallelism in Recurrent Memory Transformers, aligning with model architecture and efficiency improvements.

Relevance: 8 Novelty: 7


23. HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

ArXiv ID: 2506.04421

Authors: Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher R\'e, David W. Romero

Abstract: Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

Comment: The paper introduces HMAR, a new image generation algorithm, which aligns with model architecture innovations, particularly in auto-regressive models.

Relevance: 8 Novelty: 7


24. Relational reasoning and inductive bias in transformers trained on a transitive inference task

ArXiv ID: 2506.04289

Authors: Jesse Geerts, Stephanie Chan, Claudia Clopath, Kimberly Stachenfeld

Abstract: Transformer-based models have demonstrated remarkable reasoning abilities, but the mechanisms underlying relational reasoning in different learning regimes remain poorly understood. In this work, we investigate how transformers perform a classic relational reasoning task from the Psychology literature, \textit{transitive inference}, which requires inference about indirectly related items by integrating information across observed adjacent item pairs (e.g., if A>B and B>C, then A>C). We compare transitive inference behavior across two distinct learning regimes: in-weights learning (IWL), where models store information in network parameters, and in-context learning (ICL), where models flexibly utilize information presented within the input sequence. Our findings reveal that IWL naturally induces a generalization bias towards transitive inference, despite being trained only on adjacent items, whereas ICL models trained solely on adjacent items do not generalize transitively. Mechanistic analysis shows that ICL models develop induction circuits that implement a simple match-and-copy strategy that performs well at relating adjacent pairs, but does not encoding hierarchical relationships among indirectly related items. Interestingly, when pre-trained on in-context linear regression tasks, transformers successfully exhibit in-context generalizable transitive inference. Moreover, like IWL, they display both \textit{symbolic distance} and \textit{terminal item effects} characteristic of human and animal performance, without forming induction circuits. These results suggest that pre-training on tasks with underlying structure promotes the development of representations that can scaffold in-context relational reasoning.

Comment: The paper investigates relational reasoning in transformers, which is relevant to understanding transformer behavior and representation learning.

Relevance: 8 Novelty: 7


25. Exploring bidirectional bounds for minimax-training of Energy-based models

ArXiv ID: 2506.04609

Authors: Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, S{\o}ren Hauberg

Abstract: Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.

Comment: The paper explores training dynamics in energy-based models, which aligns with representation learning. It introduces bidirectional bounds to stabilize training, offering theoretical insights.

Relevance: 8 Novelty: 7


26. Tight analyses of first-order methods with error feedback

ArXiv ID: 2506.05271

Authors: Daniel Berg Thomsen, Adrien Taylor, Aymeric Dieuleveut

Abstract: Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in a simplified yet representative setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

Comment: The paper provides a tight analysis of error feedback methods in distributed learning, focusing on compression and efficiency, which is relevant to model compression.

Relevance: 8 Novelty: 7


27. Aligning Multimodal Representations through an Information Bottleneck

ArXiv ID: 2506.04870

Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega

Abstract: Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that their use is not effective to learn an aligned representation space. In this paper, we argue that this phenomenon is caused by the presence of modality-specific information in the representation space. Although some of the most widely used contrastive losses maximize the mutual information between representations of both modalities, they are not designed to remove the modality-specific information. We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect the emergence of this phenomenon in a controlled experimental setup. Finally, we propose a regularization term in the loss function that is derived by means of a variational approximation and aims to increase the representational alignment. We analyze in a set of controlled experiments and real-world applications the advantages of including this regularization term.

Comment: The paper discusses aligning multimodal representations using an information bottleneck, which is relevant to representation learning.

Relevance: 8 Novelty: 7


28. Associative Memory and Generative Diffusion in the Zero-noise Limit

ArXiv ID: 2506.05178

Authors: Joshua Hess, Quaid Morris

Abstract: Connections between generative diffusion and continuous-state associative memory models are studied. Morse-Smale dynamical systems are emphasized as universal approximators of gradient-based associative memory models and diffusion models as white-noise perturbed systems thereof. Universal properties of associative memory that follow from this description are described and used to characterize a generic transition from generation to memory as noise levels diminish. Structural stability inherited by Morse-Smale flows is shown to imply a notion of stability for diffusions at vanishing noise levels. Applied to one- and two-parameter families of gradients, this indicates stability at all but isolated points of associative memory learning landscapes and the learning and generation landscapes of diffusion models with gradient drift in the zero-noise limit, at which small sets of generic bifurcations characterize qualitative transitions between stable systems. Examples illustrating the characterization of these landscapes by sequences of these bifurcations are given, along with structural stability criterion for classic and modern Hopfield networks (equivalently, the attention mechanism).

Comment: The paper explores connections between generative diffusion and associative memory models, which is relevant to emerging trends in foundational research.

Relevance: 8 Novelty: 7


29. Robust Moment Identification for Nonlinear PDEs via a Neural ODE Approach

ArXiv ID: 2506.05245

Authors: Shaoxuan Chen, Su Yang, Panayotis G. Kevrekidis, Wei Zhu

Abstract: We propose a data-driven framework for learning reduced-order moment dynamics from PDE-governed systems using Neural ODEs. In contrast to derivative-based methods like SINDy, which necessitate densely sampled data and are sensitive to noise, our approach based on Neural ODEs directly models moment trajectories, enabling robust learning from sparse and potentially irregular time series. Using as an application platform the nonlinear Schr\"{o}dinger equation, the framework accurately recovers governing moment dynamics when closure is available, even with limited and irregular observations. For systems without analytical closure, we introduce a data-driven coordinate transformation strategy based on Stiefel manifold optimization, enabling the discovery of low-dimensional representations in which the moment dynamics become closed, facilitating interpretable and reliable modeling. We also explore cases where a closure model is not known, such as a Fisher-KPP reaction-diffusion system. Here we demonstrate that Neural ODEs can still effectively approximate the unclosed moment dynamics and achieve superior extrapolation accuracy compared to physical-expert-derived ODE models. This advantage remains robust even under sparse and irregular sampling, highlighting the method's robustness in data-limited settings. Our results highlight the Neural ODE framework as a powerful and flexible tool for learning interpretable, low-dimensional moment dynamics in complex PDE-governed systems.

Comment: The paper proposes a framework using Neural ODEs for learning reduced-order moment dynamics, which is relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 7


30. Semi-Implicit Variational Inference via Kernelized Path Gradient Descent

ArXiv ID: 2506.05088

Authors: Tobias Pielok, Bernd Bischl, David R\"ugamer

Abstract: Semi-implicit variational inference (SIVI) is a powerful framework for approximating complex posterior distributions, but training with the Kullback-Leibler (KL) divergence can be challenging due to high variance and bias in high-dimensional settings. While current state-of-the-art semi-implicit variational inference methods, particularly Kernel Semi-Implicit Variational Inference (KSIVI), have been shown to work in high dimensions, training remains moderately expensive. In this work, we propose a kernelized KL divergence estimator that stabilizes training through nonparametric smoothing. To further reduce the bias, we introduce an importance sampling correction. We provide a theoretical connection to the amortized version of the Stein variational gradient descent, which estimates the score gradient via Stein's identity, showing that both methods minimize the same objective, but our semi-implicit approach achieves lower gradient variance. In addition, our method's bias in function space is benign, leading to more stable and efficient optimization. Empirical results demonstrate that our method outperforms or matches state-of-the-art SIVI methods in both performance and training efficiency.

Comment: The paper proposes a kernelized KL divergence estimator for semi-implicit variational inference, which is relevant to representation learning and efficiency.

Relevance: 8 Novelty: 7


31. Identifying and Understanding Cross-Class Features in Adversarial Training

ArXiv ID: 2506.05032

Authors: Zeming Wei, Yiwen Guo, Yisen Wang

Abstract: Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.

Comment: The paper studies adversarial training through class-wise feature attribution, which is relevant to representation learning and training dynamics.

Relevance: 8 Novelty: 7


32. Hierarchical Implicit Neural Emulators

ArXiv ID: 2506.04528

Authors: Ruoxi Jiang, Xiao Zhang, Karan Jakhar, Peter Y. Lu, Pedram Hassanzadeh, Michael Maire, Rebecca Willett

Abstract: Neural PDE solvers offer a powerful tool for modeling complex dynamical systems, but often struggle with error accumulation over long time horizons and maintaining stability and physical consistency. We introduce a multiscale implicit neural emulator that enhances long-term prediction accuracy by conditioning on a hierarchy of lower-dimensional future state representations. Drawing inspiration from the stability properties of numerical implicit time-stepping methods, our approach leverages predictions several steps ahead in time at increasing compression rates for next-timestep refinements. By actively adjusting the temporal downsampling ratios, our design enables the model to capture dynamics across multiple granularities and enforce long-range temporal coherence. Experiments on turbulent fluid dynamics show that our method achieves high short-term accuracy and produces long-term stable forecasts, significantly outperforming autoregressive baselines while adding minimal computational overhead.

Comment: The paper introduces a multiscale implicit neural emulator for neural PDE solvers, which aligns with foundational research in representation learning by enhancing long-term prediction accuracy through hierarchical state representations.

Relevance: 8 Novelty: 7


33. Power Law Guided Dynamic Sifting for Efficient Attention

ArXiv ID: 2506.05300

Authors: Nirav Koley, Prajwal Singhania, Abhinav Bhatele

Abstract: Efficient inference on GPUs using large language models remains challenging due to memory bandwidth limitations, particularly during data transfers between High Bandwidth Memory (HBM) and SRAM in attention computations. Approximate attention methods address this issue by reducing computational and memory overhead but often rely on expensive top-$k$ operations, which perform poorly on GPUs. We propose SiftAttention, a novel approximate attention method that replaces the top-$k$ step with a computationally efficient element-wise filtering operation based on a threshold value. Our intuition for doing this is based on our empirical observation that the $\tau$-th quantile of attention scores follows a predictable power-law over sequential generation steps. Exploiting this insight, our approach dynamically estimates a threshold value per prompt at each generation step. Only attention scores above this threshold and their corresponding value vectors are loaded/used to compute the attention output, reducing data movement between HBM and SRAM. Our evaluation demonstrates that SiftAttention preserves model quality better than existing approximate attention methods while reducing memory bandwidth usage when loading value vectors.

Comment: The paper proposes SiftAttention, an efficient attention method for LLMs, which aligns with model architecture innovations by improving memory bandwidth usage.

Relevance: 8 Novelty: 7


34. MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

ArXiv ID: 2506.05233

Authors: Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag\"uera y Arcas, Jo\~ao Sacramento

Abstract: Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

Comment: The paper introduces MesaNet, a sequence modeling approach with a novel layer derived from an in-context regression objective, contributing to model architecture innovations.

Relevance: 8 Novelty: 7


35. You Only Train Once

ArXiv ID: 2506.04349

Authors: Christos Sakaridis

Abstract: The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used brute-force grid search across state-of-the-art networks solving two key problems in computer vision, i.e. 3D estimation and semantic segmentation, and showing that it consistently outperforms the best grid-search model on unseen test data. Code will be made publicly available.

Comment: The paper introduces a novel method for optimizing loss weight hyperparameters in one shot, which relates to representation learning by improving training dynamics.

Relevance: 8 Novelty: 7


36. Half-Layered Neural Networks

ArXiv ID: 2506.04352

Authors: Ethem Alpaydin

Abstract: We propose a ``half'' layer of hidden units that has some of its weights randomly set and some of them trained. A half unit is composed of two stages: First, it takes a weighted sum of its inputs with fixed random weights, and second, the total activation is multiplied and then translated using two modifiable weights, before the result is passed through a nonlinearity. The number of modifiable weights of each hidden unit is thus two and does not depend on the fan-in. We show how such half units can be used in the first or any later layer in a deep network, possibly following convolutional layers. Our experiments on MNIST and FashionMNIST data sets indicate the promise of half layers, where we can achieve reasonable accuracy with a reduced number of parameters due to the regularizing effect of the randomized connections.

Comment: The paper proposes a new architectural concept with 'half' layers, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


37. Towards Reasonable Concept Bottleneck Models

ArXiv ID: 2506.05014

Authors: Nektarios Kalampalikis, Kavya Gupta, Georgi Vitanov, Isabel Valera

Abstract: In this paper, we propose $\textbf{C}$oncept $\textbf{REA}$soning $\textbf{M}$odels (CREAM), a novel family of Concept Bottleneck Models (CBMs) that: (i) explicitly encodes concept-concept (${\texttt{C-C}}$) and concept-task (${\texttt{C$\rightarrow$Y}}$) relationships to enforce a desired model reasoning; and (ii) use a regularized side-channel to achieve competitive task performance, while keeping high concept importance. Specifically, CREAM architecturally embeds (bi)directed concept-concept, and concept to task relationships specified by a human expert, while severing undesired information flows (e.g., to handle mutually exclusive concepts). Moreover, CREAM integrates a black-box side-channel that is regularized to encourage task predictions to be grounded in the relevant concepts, thereby utilizing the side-channel only when necessary to enhance performance. Our experiments show that: (i) CREAM mainly relies on concepts while achieving task performance on par with black-box models; and (ii) the embedded ${\texttt{C-C}}$ and ${\texttt{C$\rightarrow$Y}}$ relationships ease model interventions and mitigate concept leakage.

Comment: The paper introduces Concept Reasoning Models (CREAM) that focus on concept bottleneck models, which is relevant to representation learning by encoding concept relationships.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.