Personalized Daily ArXiv Papers 2026-02-24
| [gpt-5] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 79884 | 64231 | 144115 |
| Cost | $0.1 | $0.64 | $0.74 |
Total arXiv papers: 1009
Total scanned papers: 583
Total relevant papers: 42
Table of contents with paper titles:
-
A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs Authors: Zijie Liu, Jie Peng, Jinhao Duan, Zirui Liu, Kaixiong Zhou, Mingfu Liang, Luke Simon, Xi Liu, Zhaozhuo Xu, Tianlong Chen
-
Why ReLU? A Bit-Model Dichotomy for Deep Network Training Authors: Ilan Doron-Arad, Elchanan Mossel
-
PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse Authors: Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan
-
Toward Manifest Relationality in Transformers via Symmetry Reduction Authors: J. Fran\c{c}ois, L. Ravera
-
Incremental Learning of Sparse Attention Patterns in Transformers Authors: O\u{g}uz Kaan Y\"uksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion
-
Path-conditioned training: a principled way to rescale ReLU neural networks Authors: Arthur Lebeurrier, Titouan Vayer, R\'emi Gribonval
-
Regularity of Second-Order Elliptic PDEs in Spectral Barron Spaces Authors: Ziang Chen, Liqiang Huang, Mengxuan Yang, Shengxuan Zhou
-
Adaptation to Intrinsic Dependence in Diffusion Language Models Authors: Yunxiao Zhao, Changxiao Cai
-
RPU -- A Reasoning Processing Unit Authors: Matthew Adiletta, Gu-Yeon Wei, David Brooks
-
Celo2: Towards Learned Optimization Free Lunch Authors: Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky
-
IDLM: Inverse-distilled Diffusion Language Models Authors: David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin
-
Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations Authors: Ahmed Karim, Fatima Sheaib, Zein Khamis, Maggie Chlon, Jad Awada, Leon Chlon
-
Manifold-Aligned Generative Transport Authors: Xinyu Tian, Xiaotong Shen
-
Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations Authors: Yuhao Liu, Zilin Wang, Lei Wu, Shaobo Zhang
-
Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series Authors: Guoqi Yu, Juncheng Wang, Chen Yang, Jing Qin, Angelica I. Aviles-Rivero, Shujun Wang
-
Scaling Laws for Precision in High-Dimensional Linear Regression Authors: Dechen Zhang, Xuan Tang, Yingyu Liang, Difan Zou
-
Bayesian Lottery Ticket Hypothesis Authors: Nicholas Kuhn, Arvid Weyrauch, Lars Heyen, Achim Streit, Markus G\"otz, Charlotte Debus
-
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference Authors: Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
-
Training-Free Generative Modeling via Kernelized Stochastic Interpolants Authors: Florentin Coeurdoux, Etienne Lempereur, Nathana\"el Cuvelle-Magar, Thomas Eboli, St\'ephane Mallat, Anastasia Borovykh, Eric Vanden-Eijnden
-
On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference Authors: Moritz A. Zanger, Yijun Wu, Pascal R. Van der Vaart, Wendelin B\"ohmer, Matthijs T. J. Spaan
-
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model Authors: Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica
-
Training-Free Cross-Architecture Merging for Graph Neural Networks Authors: Rishabh Bhattacharya, Vikaskumar Kalsariya, Naresh Manwani
-
A Markovian View of Iterative-Feedback Loops in Image Generative Models: Neural Resonance and Model Collapse Authors: Vibhas Kumar Vats, David J. Crandall, Samuel Goree
-
Online Realizable Regression and Applications for ReLU Networks Authors: Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel
-
Implicit Bias and Convergence of Matrix Stochastic Mirror Descent Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi
-
I Dropped a Neural Net Authors: Hyunwoo Park
-
A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning Authors: Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine
-
Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data Authors: Zhenyao Ma, Yue Liang, Dongxu Li
-
A Computationally Efficient Multidimensional Vision Transformer Authors: Alaa El Ichi, Khalide Jbilou
-
Grokking Finite-Dimensional Algebra Authors: Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau
-
Understanding the Curse of Unrolling Authors: Sheheryar Mehmood, Florian Knoll, Peter Ochs
-
Partial Soft-Matching Distance for Neural Representational Comparison with Partial Unit Correspondence Authors: Chaitanya Kapoor, Alex H. Williams, Meenakshi Khosla
-
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement Authors: Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro
-
Transformers for dynamical systems learn transfer operators in-context Authors: Anthony Bao, Jeffrey Lai, William Gilpin
-
Spilled Energy in Large Language Models Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi
-
Information-Guided Noise Allocation for Efficient Diffusion Training Authors: Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh-Hsin Lai, Yuki Mitsufuji, Luca Ambrogioni
-
Relational Feature Caching for Accelerating Diffusion Transformers Authors: Byunggwan Son, Jeimin Jeon, Jeongwoo Choi, Bumsub Ham
-
Insertion Based Sequence Generation with Learnable Order Dynamics Authors: Dhruvesh Patel, Benjamin Rozonoyer, Gaurav Pandey, Tahira Naseem, Ram\'on Fernandez Astudillo, Andrew McCallum
-
Laplacian Multi-scale Flow Matching for Generative Modeling Authors: Zelin Zhao, Petr Molodyk, Haotian Xue, Yongxin Chen
-
Dirichlet Scale Mixture Priors for Bayesian Neural Networks Authors: August Arnstad, Leiv R{\o}nneberg, Geir Storvik
-
VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention Authors: Jingbo Zhou, Jun Xia, Siyuan Li, Yunfan Liu, Wenjun Wang, Yufei Huang, Changxi Chi, Mutian Hong, Zhuoli Ouyang, Shu Wang, Zhongqi Wang, Xingyu Wu, Chang Yu, Stan Z. Li
-
Spectral bias in physics-informed and operator learning: Analysis and mitigation guidelines Authors: Siavash Khodakarami, Vivek Oommen, Nazanin Ahmadi Daryakenari, Maxim Beekenkamp, George Em Karniadakis
1. A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs
ArXiv ID: 2602.19938
Authors: Zijie Liu, Jie Peng, Jinhao Duan, Zirui Liu, Kaixiong Zhou, Mingfu Liang, Luke Simon, Xi Liu, Zhaozhuo Xu, Tianlong Chen
Abstract: Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently, delivering strong accuracy under fixed compute budgets. However, SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized. Prior work has focused mainly on training-time solutions such as routing regularization or auxiliary losses, leaving inference-time behavior, which is critical for deployment, less explored. We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set. These insights motivate inference-time mechanisms that rebalance workloads without retraining or router modification. We propose Replicate-and-Quantize (R&Q), a training-free and near-lossless framework for dynamic workload rebalancing. In each layer, heavy-hitter experts are replicated to increase parallel capacity, while less critical experts and replicas are quantized to remain within the original memory budget. We also introduce a Load-Imbalance Score (LIS) to measure routing skew by comparing heavy-hitter load to an equal allocation baseline. Experiments across representative SMoE models and benchmarks show up to 1.4x reduction in imbalance with accuracy maintained within +/-0.6%, enabling more predictable and efficient inference.
Comment: Model Compression and Efficiency — MoE inference-time load balancing via expert replication and quantization; training-free, systems-level improvement for Sparse MoE LLMs.
Relevance: 10 Novelty: 8
2. Why ReLU? A Bit-Model Dichotomy for Deep Network Training
ArXiv ID: 2602.19017
Authors: Ilan Doron-Arad, Elchanan Mossel
Abstract: Theoretical analyses of Empirical Risk Minimization (ERM) are standardly framed within the Real-RAM model of computation. In this setting, training even simple neural networks is known to be $\exists \mathbb{R}$-complete -- a complexity class believed to be harder than NP, that characterizes the difficulty of solving systems of polynomial inequalities over the real numbers. However, this algebraic framework diverges from the reality of digital computation with finite-precision hardware. In this work, we analyze the theoretical complexity of ERM under a realistic bit-level model ($\mathsf{ERM}{\text{bit}}$), where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths. Under this model, we reveal a sharp dichotomy in tractability governed by the network's activation function. We prove that for deep networks with {\em any} polynomial activations with rational coefficients and degree at least $2$, the bit-complexity of training is severe: deciding $\mathsf{ERM}$ is contained within NP (specifically NP-complete), and standard backpropagation runs in polynomial time. Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.}}$ is $#P$-Hard, hence believed to be strictly harder than NP-complete problems. Furthermore, we show that determining the sign of a single partial derivative of the empirical loss function is intractable (unlikely in BPP), and deciding a specific bit in the gradient is $#P$-Hard. This provides a complexity-theoretic perspective for the phenomenon of exploding and vanishing gradients. In contrast, we show that for piecewise-linear activations such as ReLU, the precision requirements remain manageable: $\mathsf{ERM}_{\text{bit}
Comment: Theoretical foundations/architecture: bit-model complexity dichotomy showing ReLU yields tractable ERM vs. polynomial activations (#P-hard).
Relevance: 9 Novelty: 9
3. PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse
ArXiv ID: 2602.18904
Authors: Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan
Abstract: Vector-quantized autoencoders deliver high-fidelity latents but suffer inherent flaws: the quantizer is non-differentiable, requires straight-through hacks, and is prone to collapse. We address these issues at the root by replacing VQ with a simple, principled, and fully differentiable alternative: an online PCA bottleneck trained via Oja's rule. The resulting model, PCA-VAE, learns an orthogonal, variance-ordered latent basis without codebooks, commitment losses, or lookup noise. Despite its simplicity, PCA-VAE exceeds VQ-GAN and SimVQ in reconstruction quality on CelebAHQ while using 10-100x fewer latent bits. It also produces naturally interpretable dimensions (e.g., pose, lighting, gender cues) without adversarial regularization or disentanglement objectives. These results suggest that PCA is a viable replacement for VQ: mathematically grounded, stable, bit-efficient, and semantically structured, offering a new direction for generative models beyond vector quantization.
Comment: Model Architecture + Compression/Efficiency: replaces vector quantization with a differentiable PCA bottleneck (Oja’s rule), yielding stable, bit-efficient autoencoders.
Relevance: 9 Novelty: 8
4. Toward Manifest Relationality in Transformers via Symmetry Reduction
ArXiv ID: 2602.18948
Authors: J. Fran\c{c}ois, L. Ravera
Abstract: Transformer models contain substantial internal redundancy arising from coordinate-dependent representations and continuous symmetries, in model space and in head space, respectively. While recent approaches address this by explicitly breaking symmetry, we propose a complementary framework based on symmetry reduction. We reformulate representations, attention mechanisms, and optimization dynamics in terms of invariant relational quantities, eliminating redundant degrees of freedom by construction. This perspective yields architectures that operate directly on relational structures, providing a principled geometric framework for reducing parameter redundancy and analyzing optimization.
Comment: Model Architecture: symmetry-reduced Transformer operating on invariant relational quantities to remove redundant degrees of freedom and analyze optimization.
Relevance: 9 Novelty: 8
5. Incremental Learning of Sparse Attention Patterns in Transformers
ArXiv ID: 2602.19143
Authors: O\u{g}uz Kaan Y\"uksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion
Abstract: This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that transformers learn this task incrementally: each stage is defined by the acquisition of specific information through sparse attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that transformers ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an implicit regularizer, biasing the model toward these simpler classes. These results provide a theoretical foundation for the emergence of staged learning and complex behaviors in transformers, offering insights into generalization for natural language processing and algorithmic reasoning.
Comment: Training Dynamics/Representation Learning: analyzes staged emergence of sparse attention patterns in transformers with differential equation modeling and convergence results.
Relevance: 9 Novelty: 8
6. Path-conditioned training: a principled way to rescale ReLU neural networks
ArXiv ID: 2602.19799
Authors: Arthur Lebeurrier, Titouan Vayer, R\'emi Gribonval
Abstract: Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.
Comment: Model Architecture/Optimization Theory: path-conditioned rescaling of ReLU networks via path-lifting and kernel alignment; principled conditioning improving training.
Relevance: 9 Novelty: 8
7. Regularity of Second-Order Elliptic PDEs in Spectral Barron Spaces
ArXiv ID: 2602.19381
Authors: Ziang Chen, Liqiang Huang, Mengxuan Yang, Shengxuan Zhou
Abstract: We establish a regularity theorem for second-order elliptic PDEs on $\mathbb{R}^{d}$ in spectral Barron spaces. Under mild ellipticity and smallness assumptions, the solution gains two additional orders of Barron regularity. As a corollary, we identify a class of PDEs whose solutions can be approximated by two-layer neural networks with cosine activation functions, where the width of the neural network is independent of the spatial dimension.
Comment: Theory/Representation: proves Barron-space regularity gains for elliptic PDEs and dimension-independent two-layer cosine-network approximation.
Relevance: 9 Novelty: 8
8. Adaptation to Intrinsic Dependence in Diffusion Language Models
ArXiv ID: 2602.20126
Authors: Yunxiao Zhao, Changxiao Cai
Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) approaches, enabling parallel token generation beyond a rigid left-to-right order. Despite growing empirical success, the theoretical understanding of how unmasking schedules -- which specify the order and size of unmasked tokens during sampling -- affect generation quality remains limited. In this work, we introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution, without requiring any prior knowledge or hyperparameter tuning. In contrast to prior deterministic procedures that fix unmasking sizes, our method randomizes the number of tokens revealed at each iteration. We show that, for two specific parameter choices, the sampling convergence guarantees -- measured by Kullback-Leibler (KL) divergence -- scale as $\widetilde O(\mathsf{TC}/K)$ and $\widetilde O(\mathsf{DTC}/K)$ respectively. Here, $K$ is the number of iterations, and $\mathsf{TC}$ and $\mathsf{DTC}$ are the total correlation and dual total correlation of the target distribution, capturing the intrinsic dependence structure underlying the data. Importantly, our guarantees hold in the practically relevant parallel-sampling regime $K<L$ where $L$ is the token sequence length. These results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions. Overall, our findings unveil the adaptivity of DLMs to intrinsic data structures and shed light on the benefit of randomized unmasking sizes in inference schedule design.
Comment: Model Architecture/Inference Efficiency: distribution-agnostic randomized unmasking schedules for diffusion language models with KL convergence scaling to total correlation.
Relevance: 9 Novelty: 8
9. RPU -- A Reasoning Processing Unit
ArXiv ID: 2602.18568
Authors: Matthew Adiletta, Gu-Yeon Wei, David Brooks
Abstract: Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
Comment: Matches High Performance Computing: chiplet-based, bandwidth-first architecture with decoupled pipelines to overcome memory-wall bottlenecks in LLM inference.
Relevance: 9 Novelty: 8
10. Celo2: Towards Learned Optimization Free Lunch
ArXiv ID: 2602.19142
Authors: Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky
Abstract: Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this work paves the way for practically applicable learnable optimization algorithms, unlocking exploration of richer meta-training and data curation recipes to further improve performance.
Comment: Matches Efficiency/Training Dynamics: simple normalized learned optimizer meta-trained with tiny compute, scaling out-of-distribution to billion-parameter pretraining.
Relevance: 9 Novelty: 8
11. IDLM: Inverse-distilled Diffusion Language Models
ArXiv ID: 2602.19066
Authors: David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin
Abstract: Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's entropy and generative perplexity.
Comment: Matches Model Compression/Efficiency: inverse distillation reduces DLM sampling steps 4–64× with theoretical uniqueness and gradient-stable relaxations.
Relevance: 9 Novelty: 8
12. Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations
ArXiv ID: 2602.19239
Authors: Ahmed Karim, Fatima Sheaib, Zein Khamis, Maggie Chlon, Jad Awada, Leon Chlon
Abstract: Large language models can follow complex procedures yet fail at a seemingly trivial final step: reporting a value they themselves computed moments earlier. We study this phenomenon as \emph{procedural hallucination}: failure to execute a verifiable, prompt-grounded specification even when the correct value is present in context. In long-context binding tasks with a known single-token candidate set, we find that many errors are readout-stage routing failures. Specifically, failures decompose into Stage~2A (gating) errors, where the model does not enter answer mode, and Stage~2B (binding) errors, where it enters answer mode but selects the wrong candidate (often due to recency bias). In the hard regime, Stage~2B accounts for most errors across model families in our tasks (Table~1). On Stage~2B error trials, a linear probe on the final-layer residual stream recovers the correct value far above chance (e.g., 74\% vs.\ 2\% on Qwen2.5-3B; Table~2), indicating that the answer is encoded but not used. We formalize ``present but not used'' via available vs.\ used mutual information and pseudo-prior interventions, yielding output-computable diagnostics and information-budget certificates. Finally, an oracle checkpointing intervention that restates the true binding near the query can nearly eliminate Stage~2B failures at long distance (e.g., Qwen2.5-3B $0/400 \rightarrow 399/400$ at $k = 1024$; Table~8).
Comment: Matches Representation Learning/Training Dynamics: causal analysis of LLM readout failures (gating vs. binding), with probes and mutual-information diagnostics.
Relevance: 9 Novelty: 8
13. Manifold-Aligned Generative Transport
ArXiv ID: 2602.19600
Authors: Xinyu Tian, Xiaotong Shen
Abstract: High-dimensional generative modeling is fundamentally a manifold-learning problem: real data concentrate near a low-dimensional structure embedded in the ambient space. Effective generators must therefore balance support fidelity -- placing probability mass near the data manifold -- with sampling efficiency. Diffusion models often capture near-manifold structure but require many iterative denoising steps and can leak off-support; normalizing flows sample in one pass but are limited by invertibility and dimension preservation. We propose MAGT (Manifold-Aligned Generative Transport), a flow-like generator that learns a one-shot, manifold-aligned transport from a low-dimensional base distribution to the data space. Training is performed at a fixed Gaussian smoothing level, where the score is well-defined and numerically stable. We approximate this fixed-level score using a finite set of latent anchor points with self-normalized importance sampling, yielding a tractable objective. MAGT samples in a single forward pass, concentrates probability near the learned support, and induces an intrinsic density with respect to the manifold volume measure, enabling principled likelihood evaluation for generated samples. We establish finite-sample Wasserstein bounds linking smoothing level and score-approximation accuracy to generative fidelity, and empirically improve fidelity and manifold concentration across synthetic and benchmark datasets while sampling substantially faster than diffusion models.
Comment: Model Architecture/Representation Learning: proposes a one-shot manifold-aligned generative transport with theoretical Wasserstein bounds.
Relevance: 9 Novelty: 8
14. Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations
ArXiv ID: 2602.19691
Authors: Yuhao Liu, Zilin Wang, Lei Wu, Shaobo Zhang
Abstract: Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we characterize both approximation and statistical properties of neural networks with smooth activations over the Sobolev space $W^{s,\infty}([0,1]^d)$ for arbitrary smoothness $s>0$. We prove that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In sharp contrast, networks with non-smooth activations, such as ReLU, lack this adaptivity: their attainable approximation order is strictly limited by depth, and capturing higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimality. Technically, our results are established via a constructive approximation framework that produces explicit neural network approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and removes the impractical sparsity constraints commonly required in prior analyses.
Comment: Representation/approximation theory: shows smooth activations enable depth-constant, minimax-optimal rates (smoothness adaptivity).
Relevance: 9 Novelty: 8
15. Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series
ArXiv ID: 2602.18473
Authors: Guoqi Yu, Juncheng Wang, Chen Yang, Jing Qin, Angelica I. Aviles-Rivero, Shujun Wang
Abstract: Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibit two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies, they often struggle with modeling channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To address this mismatch, we propose CoTAR (Core Token Aggregation-Redistribution), a centralized MLP-based module designed to replace decentralized attention. Instead of allowing all tokens to interact directly, as in standard attention, CoTAR introduces a global core token that serves as a proxy to facilitate inter-token interactions, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 12.13% improvement on the APAVA dataset, while using only 33% of the memory and 20% of the inference time compared to the previous state of the art. Code and all training scripts are available at https://github.com/Levi-Ackman/TeCh.
Comment: Model Architecture and Efficiency: replaces attention with a centralized aggregation (CoTAR) achieving linear complexity and improved channel dependency modeling.
Relevance: 9 Novelty: 7
16. Scaling Laws for Precision in High-Dimensional Linear Regression
ArXiv ID: 2602.19241
Authors: Dechen Zhang, Xuan Tang, Yingyu Liang, Difan Zou
Abstract: Low-precision training is critical for optimizing the trade-off between model quality and training costs, necessitating the joint allocation of model size, dataset size, and numerical precision. While empirical scaling laws suggest that quantization impacts effective model and data capacities or acts as an additive error, the theoretical mechanisms governing these effects remain largely unexplored. In this work, we initiate a theoretical study of scaling laws for low-precision training within a high-dimensional sketched linear regression framework. By analyzing multiplicative (signal-dependent) and additive (signal-independent) quantization, we identify a critical dichotomy in their scaling behaviors. Our analysis reveals that while both schemes introduce an additive error and degrade the effective data size, they exhibit distinct effects on effective model size: multiplicative quantization maintains the full-precision model size, whereas additive quantization reduces the effective model size. Numerical experiments validate our theoretical findings. By rigorously characterizing the complex interplay among model scale, dataset size, and quantization error, our work provides a principled theoretical basis for optimizing training protocols under practical hardware constraints.
Comment: Model Compression and Efficiency: provides theoretical scaling laws for low-precision (quantized) training, linking precision to effective model/data size.
Relevance: 9 Novelty: 7
17. Bayesian Lottery Ticket Hypothesis
ArXiv ID: 2602.18825
Authors: Nicholas Kuhn, Arvid Weyrauch, Lars Heyen, Achim Streit, Markus G\"otz, Charlotte Debus
Abstract: Bayesian neural networks (BNNs) are a useful tool for uncertainty quantification, but require substantially more computational resources than conventional neural networks. For non-Bayesian networks, the Lottery Ticket Hypothesis (LTH) posits the existence of sparse subnetworks that can train to the same or even surpassing accuracy as the original dense network. Such sparse networks can lower the demand for computational resources at inference, and during training. The existence of the LTH and corresponding sparse subnetworks in BNNs could motivate the development of sparse training algorithms and provide valuable insights into the underlying training process. Towards this end, we translate the LTH experiments to a Bayesian setting using common computer vision models. We investigate the defining characteristics of Bayesian lottery tickets, and extend our study towards a transplantation method connecting BNNs with deterministic Lottery Tickets. We generally find that the LTH holds in BNNs, and winning tickets of matching and surpassing accuracy are present independent of model size, with degradation at very high sparsities. However, the pruning strategy should rely primarily on magnitude, secondly on standard deviation. Furthermore, our results demonstrate that models rely on mask structure and weight initialization to varying degrees.
Comment: Matches Sparsity/Pruning: extends the Lottery Ticket Hypothesis to Bayesian NNs and analyzes effective pruning criteria for BNNs.
Relevance: 9 Novelty: 7
18. DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
ArXiv ID: 2602.18846
Authors: Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Abstract: Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.
Comment: Compression/Efficiency: dual-stage token reduction (vision-side compression + text-guided pruning) for VLM training/inference.
Relevance: 9 Novelty: 7
19. Training-Free Generative Modeling via Kernelized Stochastic Interpolants
ArXiv ID: 2602.20070
Authors: Florentin Coeurdoux, Etienne Lempereur, Nathana\"el Cuvelle-Magar, Thomas Eboli, St\'ephane Mallat, Anastasia Borovykh, Eric Vanden-Eijnden
Abstract: We develop a kernel method for generative modeling within the stochastic interpolant framework, replacing neural network training with linear systems. The drift of the generative SDE is $\hat b_t(x) = \nabla\phi(x)^\top\eta_t$, where $\eta_t\in\R^P$ solves a $P\times P$ system computable from data, with $P$ independent of the data dimension $d$. Since estimates are inexact, the diffusion coefficient $D_t$ affects sample quality; the optimal $D_t^*$ from Girsanov diverges at $t=0$, but this poses no difficulty and we develop an integrator that handles it seamlessly. The framework accommodates diverse feature maps -- scattering transforms, pretrained generative models etc. -- enabling training-free generation and model combination. We demonstrate the approach on financial time series, turbulence, and image generation.
Comment: Model Architecture/Efficiency — training-free generative modeling via kernelized stochastic interpolants, replacing neural training with linear systems and specialized integrators.
Relevance: 8 Novelty: 8
20. On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference
ArXiv ID: 2602.19964
Authors: Moritz A. Zanger, Yijun Wu, Pascal R. Van der Vaart, Wendelin B\"ohmer, Matthijs T. J. Spaan
Abstract: Uncertainty quantification is central to safe and efficient deployments of deep learning models, yet many computationally practical methods lack lacking rigorous theoretical motivation. Random network distillation (RND) is a lightweight technique that measures novelty via prediction errors against a fixed random target. While empirically effective, it has remained unclear what uncertainties RND measures and how its estimates relate to other approaches, e.g. Bayesian inference or deep ensembles. This paper establishes these missing theoretical connections by analyzing RND within the neural tangent kernel framework in the limit of infinite network width. Our analysis reveals two central findings in this limit: (1) The uncertainty signal from RND -- its squared self-predictive error -- is equivalent to the predictive variance of a deep ensemble. (2) By constructing a specific RND target function, we show that the RND error distribution can be made to mirror the centered posterior predictive distribution of Bayesian inference with wide neural networks. Based on this equivalence, we moreover devise a posterior sampling algorithm that generates i.i.d. samples from an exact Bayesian posterior predictive distribution using this modified \textit{Bayesian RND} model. Collectively, our findings provide a unified theoretical perspective that places RND within the principled frameworks of deep ensembles and Bayesian inference, and offer new avenues for efficient yet theoretically grounded uncertainty quantification methods.
Comment: Representation Learning/Uncertainty — establishes equivalence between RND, deep ensembles, and Bayesian inference in the NTK limit, providing a principled theoretical link.
Relevance: 8 Novelty: 8
21. K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model
ArXiv ID: 2602.19128
Authors: Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica
Abstract: Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co-Evolving World Model and build K-Search based on this method. By replacing static search heuristics with a co-evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K-Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K-Search significantly outperforms state-of-the-art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K-Search achieves state-of-the-art performance on H100, reaching 1030us and surpassing both prior evolution and human-designed solutions.
Comment: High Performance Computing: co-evolving world model guides LLM-based search for GPU kernel optimization, yielding large speedups (incl. MoE kernels).
Relevance: 8 Novelty: 8
22. Training-Free Cross-Architecture Merging for Graph Neural Networks
ArXiv ID: 2602.19332
Authors: Rishabh Bhattacharya, Vikaskumar Kalsariya, Naresh Manwani
Abstract: Model merging has emerged as a powerful paradigm for combining the capabilities of distinct expert models without the high computational cost of retraining, yet current methods are fundamentally constrained to homogeneous architectures. For GNNs, however, message passing is topology-dependent and sensitive to misalignment, making direct parameter-space merging unreliable. To bridge this gap, we introduce H-GRAMA (Heterogeneous Graph Routing and Message Alignment), a training-free framework that lifts merging from parameter space to operator space. We formalize Universal Message Passing Mixture (UMPM), a shared operator family that expresses heterogeneous GNN layers in a common functional language. H-GRAMA enables cross-architecture GNN merging (e.g., GCN to GAT) without retraining, retaining high specialist accuracy in most cases in compatible depth settings and achieving inference speedups of 1.2x to 1.9x over ensembles.
Comment: Model Architecture and Efficiency: training-free cross-architecture GNN merging via a shared operator family (UMPM) and message alignment, avoiding retraining.
Relevance: 8 Novelty: 8
23. A Markovian View of Iterative-Feedback Loops in Image Generative Models: Neural Resonance and Model Collapse
ArXiv ID: 2602.19033
Authors: Vibhas Kumar Vats, David J. Crandall, Samuel Goree
Abstract: AI training datasets will inevitably contain AI-generated examples, leading to ``feedback'' in which the output of one model impacts the training of another. It is known that such iterative feedback can lead to model collapse, yet the mechanisms underlying this degeneration remain poorly understood. Here we show that a broad class of feedback processes converges to a low-dimensional invariant structure in latent space, a phenomenon we call neural resonance. By modeling iterative feedback as a Markov Chain, we show that two conditions are needed for this resonance to occur: ergodicity of the feedback process and directional contraction of the latent representation. By studying diffusion models on MNIST and ImageNet, as well as CycleGAN and an audio feedback experiment, we map how local and global manifold geometry evolve, and we introduce an eight-pattern taxonomy of collapse behaviors. Neural resonance provides a unified explanation for long-term degenerate behavior in generative models and provides practical diagnostics for identifying, characterizing, and eventually mitigating collapse.
Comment: Training Dynamics/Theory: Markov-chain view of iterative feedback in generative models, explaining collapse via neural resonance with diagnostic taxonomy.
Relevance: 8 Novelty: 8
24. Online Realizable Regression and Applications for ReLU Networks
ArXiv ID: 2602.19172
Authors: Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel
Abstract: Realizable online regression can behave very differently from online classification. Even without any margin or stochastic assumptions, realizability may enforce horizon-free (finite) cumulative loss under metric-like losses, even when the analogous classification problem has an infinite mistake bound. We study realizable online regression in the adversarial model under losses that satisfy an approximate triangle inequality (approximate pseudo-metrics). Recent work of Attias et al. shows that the minimax realizable cumulative loss is characterized by the scaled Littlestone/online dimension $\mathbb{D}{\mathrm{onl}}$, but this quantity can be difficult to analyze. Our main contribution is a generic potential method that upper bounds $\mathbb{D})d$, otherwise infinite), and for bounded-norm $k$-ReLU networks separate regression (finite loss, even $\widetilde O(k^2)$, and $O(1)$ for one ReLU) from classification (impossible already for $k=2,d=1$).}}$ by a concrete Dudley-type entropy integral that depends only on covering numbers of the hypothesis class under the induced sup pseudo-metric. We define an \emph{entropy potential} $\Phi(\mathcal{H})=\int_{0}^{diam(\mathcal{H})} \log N(\mathcal{H},\varepsilon)\,d\varepsilon$, where $N(\mathcal{H},\varepsilon)$ is the $\varepsilon$-covering number of $\mathcal{H}$, and show that for every $c$-approximate pseudo-metric loss, $\mathbb{D}_{\mathrm{onl}}(\mathcal{H})\le O(c)\,\Phi(\mathcal{H})$. In particular, polynomial metric entropy implies $\Phi(\mathcal{H
Comment: Theory/Training Dynamics: bounds for realizable online regression under approximate metric losses with applications to bounded-norm ReLU networks.
Relevance: 8 Novelty: 8
25. Implicit Bias and Convergence of Matrix Stochastic Mirror Descent
ArXiv ID: 2602.18997
Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi
Abstract: We investigate Stochastic Mirror Descent (SMD) with matrix parameters and vector-valued predictions, a framework relevant to multi-class classification and matrix completion problems. Focusing on the overparameterized regime, where the total number of parameters exceeds the number of training samples, we prove that SMD with matrix mirror functions $\psi(\cdot)$ converges exponentially to a global interpolator. Furthermore, we generalize classical implicit bias results of vector SMD by demonstrating that the matrix SMD algorithm converges to the unique solution minimizing the Bregman divergence induced by $\psi(\cdot)$ from initialization subject to interpolating the data. These findings reveal how matrix mirror maps dictate inductive bias in high-dimensional, multi-output problems.
Comment: Training Dynamics/Implicit Bias: proves convergence and implicit bias for matrix-valued stochastic mirror descent, extending classic results to multi-output settings.
Relevance: 8 Novelty: 8
26. I Dropped a Neural Net
ArXiv ID: 2602.19845
Authors: Hyunwoo Park
Abstract: A recent Dwarkesh Patel podcast with John Collison and Elon Musk featured an interesting puzzle from Jane Street: they trained a neural net, shuffled all 96 layers, and asked to put them back in order. Given unlabelled layers of a Residual Network and its training dataset, we recover the exact ordering of the layers. The problem decomposes into pairing each block's input and output projections ($48!$ possibilities) and ordering the reassembled blocks ($48!$ possibilities), for a combined search space of $(48!)^2 \approx 10^{122}$, which is more than the atoms in the observable universe. We show that stability conditions during training like dynamic isometry leave the product $W_{\text{out}} W_{\text{in}}$ for correctly paired layers with a negative diagonal structure, allowing us to use diagonal dominance ratio as a signal for pairing. For ordering, we seed-initialize with a rough proxy such as delta-norm or $|W_{\text{out}}|_F$ then hill-climb to zero mean squared error.
Comment: Matches Representation/Training Dynamics: reconstructs exact layer order of a shuffled ResNet via dynamic-isometry-driven signals, offering structural insights.
Relevance: 8 Novelty: 8
27. A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning
ArXiv ID: 2602.20062
Authors: Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine
Abstract: Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.
Comment: Representation learning/training dynamics: analytical theory linking pretraining initialization to feature reuse/refinement in fine-tuning.
Relevance: 8 Novelty: 8
28. Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data
ArXiv ID: 2602.20152
Authors: Zhenyao Ma, Yue Liang, Dongxu Li
Abstract: Inspired by behavioral science, we propose Behavior Learning (BL), a novel general-purpose machine learning framework that learns interpretable and identifiable optimization structures from data, ranging from single optimization problems to hierarchical compositions. It unifies predictive performance, intrinsic interpretability, and identifiability, with broad applicability to scientific domains involving optimization. BL parameterizes a compositional utility function built from intrinsically interpretable modular blocks, which induces a data distribution for prediction and generation. Each block represents and can be written in symbolic form as a utility maximization problem (UMP), a foundational paradigm in behavioral science and a universal framework of optimization. BL supports architectures ranging from a single UMP to hierarchical compositions, the latter modeling hierarchical optimization structures. Its smooth and monotone variant (IBL) guarantees identifiability. Theoretically, we establish the universal approximation property of BL, and analyze the M-estimation properties of IBL. Empirically, BL demonstrates strong predictive performance, intrinsic interpretability and scalability to high-dimensional data. Code: https://github.com/MoonYLiang/Behavior-Learning ; install via pip install blnetwork.
Comment: Model Architecture: proposes interpretable, identifiable networks as hierarchical compositions of utility-maximization blocks with theory.
Relevance: 8 Novelty: 8
29. A Computationally Efficient Multidimensional Vision Transformer
ArXiv ID: 2602.19982
Authors: Alaa El Ichi, Khalide Jbilou
Abstract: Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper, we introduce a novel tensor-based framework for Vision Transformers built upon the Tensor Cosine Product (Cproduct). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables efficient attention mechanisms and structured feature representations. We develop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new Cproduct-based Vision Transformer architecture (TCP-ViT). Numerical experiments on standard classification and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1/C parameter reduction (where C is the number of channels) while maintaining competitive accuracy.
Comment: Model Architecture and Efficiency — introduces a tensor cosine product (Cproduct) ViT with multilinear structure and 1/C parameter reduction enabling efficient attention.
Relevance: 8 Novelty: 7
30. Grokking Finite-Dimensional Algebra
ArXiv ID: 2602.19533
Authors: Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau
Abstract: This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed during neural networks training, in the context of learning multiplication in finite-dimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra's structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra's representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.
Comment: Representation Learning and Training Dynamics — studies grokking across algebraic structures, linking generalization to structure tensor rank/sparsity and implicit low-rank bias.
Relevance: 8 Novelty: 7
31. Understanding the Curse of Unrolling
ArXiv ID: 2602.19733
Authors: Sheheryar Mehmood, Florian Knoll, Peter Ochs
Abstract: Algorithm unrolling is ubiquitous in machine learning, particularly in hyperparameter optimization and meta-learning, where Jacobians of solution mappings are computed by differentiating through iterative algorithms. Although unrolling is known to yield asymptotically correct Jacobians under suitable conditions, recent work has shown that the derivative iterates may initially diverge from the true Jacobian, a phenomenon known as the curse of unrolling. In this work, we provide a non-asymptotic analysis that explains the origin of this behavior and identifies the algorithmic factors that govern it. We show that truncating early iterations of the derivative computation mitigates the curse while simultaneously reducing memory requirements. Finally, we demonstrate that warm-starting in bilevel optimization naturally induces an implicit form of truncation, providing a practical remedy. Our theoretical findings are supported by numerical experiments on representative examples.
Comment: Representation Learning (training dynamics): non-asymptotic analysis of algorithm unrolling explains divergence and proposes truncation to stabilize and reduce memory.
Relevance: 8 Novelty: 7
32. Partial Soft-Matching Distance for Neural Representational Comparison with Partial Unit Correspondence
ArXiv ID: 2602.19331
Authors: Chaitanya Kapoor, Alex H. Williams, Meenakshi Khosla
Abstract: Representational similarity metrics typically force all units to be matched, making them susceptible to noise and outliers common in neural representations. We extend the soft-matching distance to a partial optimal transport setting that allows some neurons to remain unmatched, yielding rotation-sensitive but robust correspondences. This partial soft-matching distance provides theoretical advantages -- relaxing strict mass conservation while maintaining interpretable transport costs -- and practical benefits through efficient neuron ranking in terms of cross-network alignment without costly iterative recomputation. In simulations, it preserves correct matches under outliers and reliably selects the correct model in noise-corrupted identification tasks. On fMRI data, it automatically excludes low-reliability voxels and produces voxel rankings by alignment quality that closely match computationally expensive brute-force approaches. It achieves higher alignment precision across homologous brain areas than standard soft-matching, which is forced to match all units regardless of quality. In deep networks, highly matched units exhibit similar maximally exciting images, while unmatched units show divergent patterns. This ability to partition by match quality enables focused analyses, e.g., testing whether networks have privileged axes even within their most aligned subpopulations. Overall, partial soft-matching provides a principled and practical method for representational comparison under partial correspondence.
Comment: Representation Learning: introduces a partial optimal transport-based soft-matching distance for neural representational comparison with theory and efficient ranking.
Relevance: 8 Novelty: 7
33. Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
ArXiv ID: 2602.19396
Authors: Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro
Abstract: Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations in a frozen LLM. We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection across multiple LLM families with minimal computational overhead. Theoretical guarantees for ReDAct and extensive empirical validations show that its disentanglement effectively powers FrameShield. Finally, we use disentanglement as an interpretability probe, revealing distinct profiles for goal and framing signals and positioning semantic disentanglement as a building block for both LLM safety and mechanistic interpretability.
Comment: Representation Learning/Interpretability: self-supervised disentanglement of goal vs. framing factors in LLM activations with theoretical guarantees and efficient detection.
Relevance: 8 Novelty: 7
34. Transformers for dynamical systems learn transfer operators in-context
ArXiv ID: 2602.18679
Authors: Anthony Bao, Jeffrey Lai, William Gilpin
Abstract: Large-scale foundation models for scientific machine learning adapt to physical settings unseen during training, such as zero-shot transfer between turbulent scales. This phenomenon, in-context learning, challenges conventional understanding of learning and adaptation in physical systems. Here, we study in-context learning of dynamical systems in a minimal setting: we train a small two-layer, single-head transformer to forecast one dynamical system, and then evaluate its ability to forecast a different dynamical system without retraining. We discover an early tradeoff in training between in-distribution and out-of-distribution performance, which manifests as a secondary double descent phenomenon. We discover that attention-based models apply a transfer-operator forecasting strategy in-context. They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold. Our results clarify the mechanism enabling large pretrained models to forecast unseen physical systems at test without retraining, and they illustrate the unique ability of attention-based models to leverage global attractor information in service of short-term forecasts.
Comment: Representation Learning/Architecture: elucidates in-context learning in transformers as transfer-operator forecasting with discovery of double-descent tradeoffs.
Relevance: 8 Novelty: 7
35. Spilled Energy in Large Language Models
ArXiv ID: 2602.18671
Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi
Abstract: We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.
Comment: Representation Learning/Architecture Analysis: reinterprets LLM softmax as EBM and proposes training-free energy metrics for hallucination detection from logits.
Relevance: 8 Novelty: 7
36. Information-Guided Noise Allocation for Efficient Diffusion Training
ArXiv ID: 2602.18647
Authors: Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh-Hsin Lai, Yuki Mitsufuji, Luca Ambrogioni
Abstract: Training diffusion models typically relies on manually tuned noise schedules, which can waste computation on weakly informative noise regions and limit transfer across datasets, resolutions, and representations. We revisit noise schedule allocation through an information-theoretic lens and propose the conditional entropy rate of the forward process as a theoretically grounded, data-dependent diagnostic for identifying suboptimal noise-level allocation in existing schedules. Based on these insight, we introduce InfoNoise, a principled data-adaptive training noise schedule that replaces heuristic schedule design with an information-guided noise sampling distribution derived from entropy-reduction rates estimated from denoising losses already computed during training. Across natural-image benchmarks, InfoNoise matches or surpasses tuned EDM-style schedules, in some cases with a substantial training speedup (about $1.4\times$ on CIFAR-10). On discrete datasets, where standard image-tuned schedules exhibit significant mismatch, it reaches superior quality in up to $3\times$ fewer training steps. Overall, InfoNoise makes noise scheduling data-adaptive, reducing the need for per-dataset schedule design as diffusion models expand across domains.
Comment: Model Efficiency: information-guided, data-adaptive noise scheduling for diffusion training that reallocates compute to informative noise regions.
Relevance: 8 Novelty: 7
37. Relational Feature Caching for Accelerating Diffusion Transformers
ArXiv ID: 2602.19506
Authors: Byunggwan Son, Jeimin Jeon, Jeongwoo Choi, Bumsub Ham
Abstract: Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. Project page is available at https://cvlab.yonsei.ac.kr/projects/RFC
Comment: Matches Model Compression/Efficiency: relational feature caching and error-aware cache scheduling accelerate Diffusion Transformers by reducing redundant compute.
Relevance: 8 Novelty: 7
38. Insertion Based Sequence Generation with Learnable Order Dynamics
ArXiv ID: 2602.18695
Authors: Dhruvesh Patel, Benjamin Rozonoyer, Gaurav Pandey, Tahira Naseem, Ram\'on Fernandez Astudillo, Andrew McCallum
Abstract: In many domains generating variable length sequences through insertions provides greater flexibility over autoregressive models. However, the action space of insertion models is much larger than that of autoregressive models (ARMs) making the learning challenging. To address this, we incorporate trainable order dynamics into the target rates for discrete flow matching, and show that with suitable choices of parameterizations, joint training of the target order dynamics and the generator is tractable without the need for numerical simulation. As the generative insertion model, we use a variable length masked diffusion model, which generates by inserting and filling mask tokens. On graph traversal tasks for which a locally optimal insertion order is known, we explore the choices of parameterization empirically and demonstrate the trade-offs between flexibility, training stability and generation quality. On de novo small molecule generation, we find that the learned order dynamics leads to an increase in the number of valid molecules generated and improved quality, when compared to uniform order dynamics.
Comment: Matches Model Architecture: introduces learnable order dynamics for insertion-based masked diffusion via discrete flow matching.
Relevance: 8 Novelty: 7
39. Laplacian Multi-scale Flow Matching for Generative Modeling
ArXiv ID: 2602.19461
Authors: Zelin Zhao, Petr Molodyk, Haotian Xue, Yongxin Chen
Abstract: In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 1024$\times$1024) while maintaining lower computational overhead.
Comment: Matches Model Architecture/Efficiency: Laplacian multi-scale flow matching with parallel mixture-of-transformers and causal attention for faster, high-quality generation.
Relevance: 8 Novelty: 7
40. Dirichlet Scale Mixture Priors for Bayesian Neural Networks
ArXiv ID: 2602.19859
Authors: August Arnstad, Leiv R{\o}nneberg, Geir Storvik
Abstract: Neural networks are the cornerstone of modern machine learning, yet can be difficult to interpret, give overconfident predictions and are vulnerable to adversarial attacks. Bayesian neural networks (BNNs) provide some alleviation of these limitations, but have problems of their own. The key step of specifying prior distributions in BNNs is no trivial task, yet is often skipped out of convenience. In this work, we propose a new class of prior distributions for BNNs, the Dirichlet scale mixture (DSM) prior, that addresses current limitations in Bayesian neural networks through structured, sparsity-inducing shrinkage. Theoretically, we derive general dependence structures and shrinkage results for DSM priors and show how they manifest under the geometry induced by neural networks. In experiments on simulated and real world data we find that the DSM priors encourages sparse networks through implicit feature selection, show robustness under adversarial attacks and deliver competitive predictive performance with substantially fewer effective parameters. In particular, their advantages appear most pronounced in correlated, moderately small data regimes, and are more amenable to weight pruning. Moreover, by adopting heavy-tailed shrinkage mechanisms, our approach aligns with recent findings that such priors can mitigate the cold posterior effect, offering a principled alternative to the commonly used Gaussian priors.
Comment: Matches Sparsity/Compression and Representation: Dirichlet scale mixture priors impose structured shrinkage in BNNs, enabling sparsity and pruning with robustness benefits.
Relevance: 8 Novelty: 7
41. VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention
ArXiv ID: 2602.19622
Authors: Jingbo Zhou, Jun Xia, Siyuan Li, Yunfan Liu, Wenjun Wang, Yufei Huang, Changxi Chi, Mutian Hong, Zhuoli Ouyang, Shu Wang, Zhongqi Wang, Xingyu Wu, Chang Yu, Stan Z. Li
Abstract: Graph Transformer has demonstrated impressive capabilities in the field of graph representation learning. However, existing approaches face two critical challenges: (1) most models suffer from exponentially increasing computational complexity, making it difficult to scale to large graphs; (2) attention mechanisms based on node-level operations limit the flexibility of the model and result in poor generalization performance in out-of-distribution (OOD) scenarios. To address these issues, we propose \textbf{VecFormer} (the \textbf{Vec}tor Quantized Graph Trans\textbf{former}), an efficient and highly generalizable model for node classification, particularly under OOD settings. VecFormer adopts a two-stage training paradigm. In the first stage, two codebooks are used to reconstruct the node features and the graph structure, aiming to learn the rich semantic \texttt{Graph Codes}. In the second stage, attention mechanisms are performed at the \texttt{Graph Token} level based on the transformed cross codebook, reducing computational complexity while enhancing the model's generalization capability. Extensive experiments on datasets of various sizes demonstrate that VecFormer outperforms the existing Graph Transformer in both performance and speed.
Comment: Model Architecture/Efficiency: introduces vector-quantized graph tokens and token-level attention to reduce Graph Transformer complexity and improve OOD generalization.
Relevance: 8 Novelty: 7
42. Spectral bias in physics-informed and operator learning: Analysis and mitigation guidelines
ArXiv ID: 2602.19265
Authors: Siavash Khodakarami, Vivek Oommen, Nazanin Ahmadi Daryakenari, Maxim Beekenkamp, George Em Karniadakis
Abstract: Solving partial differential equations (PDEs) by neural networks as well as Kolmogorov-Arnold Networks (KANs), including physics-informed neural networks (PINNs), physics-informed KANs (PIKANs), and neural operators, are known to exhibit spectral bias, whereby low-frequency components of the solution are learned significantly faster than high-frequency modes. While spectral bias is often treated as an intrinsic representational limitation of neural architectures, its interaction with optimization dynamics and physics-based loss formulations remains poorly understood. In this work, we provide a systematic investigation of spectral bias in physics-informed and operator learning frameworks, with emphasis on the coupled roles of network architecture, activation functions, loss design, and optimization strategy. We quantify spectral bias through frequency-resolved error metrics, Barron-norm diagnostics, and higher-order statistical moments, enabling a unified analysis across elliptic, hyperbolic, and dispersive PDEs. Through diverse benchmark problems, including the Korteweg-de Vries, wave and steady-state diffusion-reaction equations, turbulent flow reconstruction, and earthquake dynamics, we demonstrate that spectral bias is not simply representational but fundamentally dynamical. In particular, second-order optimization methods substantially alter the spectral learning order, enabling earlier and more accurate recovery of high-frequency modes for all PDE types. For neural operators, we further show that spectral bias is dependent on the neural operator architecture and can also be effectively mitigated through spectral-aware loss formulations without increasing the inference cost.
Comment: Representation/training dynamics: analyzes spectral bias in PINNs/neural operators and proposes optimization and loss strategies to mitigate it.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.