Personalized Daily ArXiv Papers 2026-02-24

[gpt-5]	Prompt	Completion	Total
Token	79884	64231	144115
Cost	$0.1	$0.64	$0.74

Total arXiv papers: 1009

Total scanned papers: 583

Total relevant papers: 42

Table of contents with paper titles:

A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs Authors: Zijie Liu, Jie Peng, Jinhao Duan, Zirui Liu, Kaixiong Zhou, Mingfu Liang, Luke Simon, Xi Liu, Zhaozhuo Xu, Tianlong Chen
Why ReLU? A Bit-Model Dichotomy for Deep Network Training Authors: Ilan Doron-Arad, Elchanan Mossel
PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse Authors: Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan
Toward Manifest Relationality in Transformers via Symmetry Reduction Authors: J. Fran\c{c}ois, L. Ravera
Incremental Learning of Sparse Attention Patterns in Transformers Authors: O\u{g}uz Kaan Y\"uksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion
Path-conditioned training: a principled way to rescale ReLU neural networks Authors: Arthur Lebeurrier, Titouan Vayer, R\'emi Gribonval
Regularity of Second-Order Elliptic PDEs in Spectral Barron Spaces Authors: Ziang Chen, Liqiang Huang, Mengxuan Yang, Shengxuan Zhou
Adaptation to Intrinsic Dependence in Diffusion Language Models Authors: Yunxiao Zhao, Changxiao Cai
RPU -- A Reasoning Processing Unit Authors: Matthew Adiletta, Gu-Yeon Wei, David Brooks
Celo2: Towards Learned Optimization Free Lunch Authors: Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky
IDLM: Inverse-distilled Diffusion Language Models Authors: David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin
Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations Authors: Ahmed Karim, Fatima Sheaib, Zein Khamis, Maggie Chlon, Jad Awada, Leon Chlon
Manifold-Aligned Generative Transport Authors: Xinyu Tian, Xiaotong Shen
Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations Authors: Yuhao Liu, Zilin Wang, Lei Wu, Shaobo Zhang
Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series Authors: Guoqi Yu, Juncheng Wang, Chen Yang, Jing Qin, Angelica I. Aviles-Rivero, Shujun Wang
Scaling Laws for Precision in High-Dimensional Linear Regression Authors: Dechen Zhang, Xuan Tang, Yingyu Liang, Difan Zou
Bayesian Lottery Ticket Hypothesis Authors: Nicholas Kuhn, Arvid Weyrauch, Lars Heyen, Achim Streit, Markus G\"otz, Charlotte Debus
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference Authors: Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Training-Free Generative Modeling via Kernelized Stochastic Interpolants Authors: Florentin Coeurdoux, Etienne Lempereur, Nathana\"el Cuvelle-Magar, Thomas Eboli, St\'ephane Mallat, Anastasia Borovykh, Eric Vanden-Eijnden
On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference Authors: Moritz A. Zanger, Yijun Wu, Pascal R. Van der Vaart, Wendelin B\"ohmer, Matthijs T. J. Spaan
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model Authors: Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica
Training-Free Cross-Architecture Merging for Graph Neural Networks Authors: Rishabh Bhattacharya, Vikaskumar Kalsariya, Naresh Manwani
A Markovian View of Iterative-Feedback Loops in Image Generative Models: Neural Resonance and Model Collapse Authors: Vibhas Kumar Vats, David J. Crandall, Samuel Goree
Online Realizable Regression and Applications for ReLU Networks Authors: Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel
Implicit Bias and Convergence of Matrix Stochastic Mirror Descent Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi
I Dropped a Neural Net Authors: Hyunwoo Park
A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning Authors: Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine
Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data Authors: Zhenyao Ma, Yue Liang, Dongxu Li
A Computationally Efficient Multidimensional Vision Transformer Authors: Alaa El Ichi, Khalide Jbilou
Grokking Finite-Dimensional Algebra Authors: Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau
Understanding the Curse of Unrolling Authors: Sheheryar Mehmood, Florian Knoll, Peter Ochs
Partial Soft-Matching Distance for Neural Representational Comparison with Partial Unit Correspondence Authors: Chaitanya Kapoor, Alex H. Williams, Meenakshi Khosla
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement Authors: Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro
Transformers for dynamical systems learn transfer operators in-context Authors: Anthony Bao, Jeffrey Lai, William Gilpin
Spilled Energy in Large Language Models Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi
Information-Guided Noise Allocation for Efficient Diffusion Training Authors: Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh-Hsin Lai, Yuki Mitsufuji, Luca Ambrogioni
Relational Feature Caching for Accelerating Diffusion Transformers Authors: Byunggwan Son, Jeimin Jeon, Jeongwoo Choi, Bumsub Ham
Insertion Based Sequence Generation with Learnable Order Dynamics Authors: Dhruvesh Patel, Benjamin Rozonoyer, Gaurav Pandey, Tahira Naseem, Ram\'on Fernandez Astudillo, Andrew McCallum
Laplacian Multi-scale Flow Matching for Generative Modeling Authors: Zelin Zhao, Petr Molodyk, Haotian Xue, Yongxin Chen
Dirichlet Scale Mixture Priors for Bayesian Neural Networks Authors: August Arnstad, Leiv R{\o}nneberg, Geir Storvik
VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention Authors: Jingbo Zhou, Jun Xia, Siyuan Li, Yunfan Liu, Wenjun Wang, Yufei Huang, Changxi Chi, Mutian Hong, Zhuoli Ouyang, Shu Wang, Zhongqi Wang, Xingyu Wu, Chang Yu, Stan Z. Li
Spectral bias in physics-informed and operator learning: Analysis and mitigation guidelines Authors: Siavash Khodakarami, Vivek Oommen, Nazanin Ahmadi Daryakenari, Maxim Beekenkamp, George Em Karniadakis

1. A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs

ArXiv ID: 2602.19938

Authors: Zijie Liu, Jie Peng, Jinhao Duan, Zirui Liu, Kaixiong Zhou, Mingfu Liang, Luke Simon, Xi Liu, Zhaozhuo Xu, Tianlong Chen

Abstract: Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently, delivering strong accuracy under fixed compute budgets. However, SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized. Prior work has focused mainly on training-time solutions such as routing regularization or auxiliary losses, leaving inference-time behavior, which is critical for deployment, less explored. We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set. These insights motivate inference-time mechanisms that rebalance workloads without retraining or router modification. We propose Replicate-and-Quantize (R&Q), a training-free and near-lossless framework for dynamic workload rebalancing. In each layer, heavy-hitter experts are replicated to increase parallel capacity, while less critical experts and replicas are quantized to remain within the original memory budget. We also introduce a Load-Imbalance Score (LIS) to measure routing skew by comparing heavy-hitter load to an equal allocation baseline. Experiments across representative SMoE models and benchmarks show up to 1.4x reduction in imbalance with accuracy maintained within +/-0.6%, enabling more predictable and efficient inference.

Comment: Model Compression and Efficiency — MoE inference-time load balancing via expert replication and quantization; training-free, systems-level improvement for Sparse MoE LLMs.

Relevance: 10 Novelty: 8

2. Why ReLU? A Bit-Model Dichotomy for Deep Network Training

ArXiv ID: 2602.19017

Authors: Ilan Doron-Arad, Elchanan Mossel

Abstract: Theoretical analyses of Empirical Risk Minimization (ERM) are standardly framed within the Real-RAM model of computation. In this setting, training even simple neural networks is known to be $\exists \mathbb{R}$-complete -- a complexity class believed to be harder than NP, that characterizes the difficulty of solving systems of polynomial inequalities over the real numbers. However, this algebraic framework diverges from the reality of digital computation with finite-precision hardware. In this work, we analyze the theoretical complexity of ERM under a realistic bit-level model ($\mathsf{ERM}{\text{bit}}$), where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths. Under this model, we reveal a sharp dichotomy in tractability governed by the network's activation function. We prove that for deep networks with {\em any} polynomial activations with rational coefficients and degree at least $2$, the bit-complexity of training is severe: deciding $\mathsf{ERM}$ is contained within NP (specifically NP-complete), and standard backpropagation runs in polynomial time. Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.}}$ is $#P$-Hard, hence believed to be strictly harder than NP-complete problems. Furthermore, we show that determining the sign of a single partial derivative of the empirical loss function is intractable (unlikely in BPP), and deciding a specific bit in the gradient is $#P$-Hard. This provides a complexity-theoretic perspective for the phenomenon of exploding and vanishing gradients. In contrast, we show that for piecewise-linear activations such as ReLU, the precision requirements remain manageable: $\mathsf{ERM}_{\text{bit}

Comment: Theoretical foundations/architecture: bit-model complexity dichotomy showing ReLU yields tractable ERM vs. polynomial activations (#P-hard).

Relevance: 9 Novelty: 9

3. PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

ArXiv ID: 2602.18904

Authors: Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan

Abstract: Vector-quantized autoencoders deliver high-fidelity latents but suffer inherent flaws: the quantizer is non-differentiable, requires straight-through hacks, and is prone to collapse. We address these issues at the root by replacing VQ with a simple, principled, and fully differentiable alternative: an online PCA bottleneck trained via Oja's rule. The resulting model, PCA-VAE, learns an orthogonal, variance-ordered latent basis without codebooks, commitment losses, or lookup noise. Despite its simplicity, PCA-VAE exceeds VQ-GAN and SimVQ in reconstruction quality on CelebAHQ while using 10-100x fewer latent bits. It also produces naturally interpretable dimensions (e.g., pose, lighting, gender cues) without adversarial regularization or disentanglement objectives. These results suggest that PCA is a viable replacement for VQ: mathematically grounded, stable, bit-efficient, and semantically structured, offering a new direction for generative models beyond vector quantization.

Comment: Model Architecture + Compression/Efficiency: replaces vector quantization with a differentiable PCA bottleneck (Oja’s rule), yielding stable, bit-efficient autoencoders.

Relevance: 9 Novelty: 8

4. Toward Manifest Relationality in Transformers via Symmetry Reduction

ArXiv ID: 2602.18948

Authors: J. Fran\c{c}ois, L. Ravera

Abstract: Transformer models contain substantial internal redundancy arising from coordinate-dependent representations and continuous symmetries, in model space and in head space, respectively. While recent approaches address this by explicitly breaking symmetry, we propose a complementary framework based on symmetry reduction. We reformulate representations, attention mechanisms, and optimization dynamics in terms of invariant relational quantities, eliminating redundant degrees of freedom by construction. This perspective yields architectures that operate directly on relational structures, providing a principled geometric framework for reducing parameter redundancy and analyzing optimization.

Comment: Model Architecture: symmetry-reduced Transformer operating on invariant relational quantities to remove redundant degrees of freedom and analyze optimization.

Relevance: 9 Novelty: 8

5. Incremental Learning of Sparse Attention Patterns in Transformers

ArXiv ID: 2602.19143

Authors: O\u{g}uz Kaan Y\"uksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion

Abstract: This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that transformers learn this task incrementally: each stage is defined by the acquisition of specific information through sparse attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that transformers ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an implicit regularizer, biasing the model toward these simpler classes. These results provide a theoretical foundation for the emergence of staged learning and complex behaviors in transformers, offering insights into generalization for natural language processing and algorithmic reasoning.

Comment: Training Dynamics/Representation Learning: analyzes staged emergence of sparse attention patterns in transformers with differential equation modeling and convergence results.

Relevance: 9 Novelty: 8

6. Path-conditioned training: a principled way to rescale ReLU neural networks

ArXiv ID: 2602.19799

Authors: Arthur Lebeurrier, Titouan Vayer, R\'emi Gribonval

Abstract: Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.

Comment: Model Architecture/Optimization Theory: path-conditioned rescaling of ReLU networks via path-lifting and kernel alignment; principled conditioning improving training.

Relevance: 9 Novelty: 8

7. Regularity of Second-Order Elliptic PDEs in Spectral Barron Spaces

ArXiv ID: 2602.19381

Authors: Ziang Chen, Liqiang Huang, Mengxuan Yang, Shengxuan Zhou

Abstract: We establish a regularity theorem for second-order elliptic PDEs on $\mathbb{R}^{d}$ in spectral Barron spaces. Under mild ellipticity and smallness assumptions, the solution gains two additional orders of Barron regularity. As a corollary, we identify a class of PDEs whose solutions can be approximated by two-layer neural networks with cosine activation functions, where the width of the neural network is independent of the spatial dimension.

Comment: Theory/Representation: proves Barron-space regularity gains for elliptic PDEs and dimension-independent two-layer cosine-network approximation.

Relevance: 9 Novelty: 8

8. Adaptation to Intrinsic Dependence in Diffusion Language Models

ArXiv ID: 2602.20126

Authors: Yunxiao Zhao, Changxiao Cai

Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) approaches, enabling parallel token generation beyond a rigid left-to-right order. Despite growing empirical success, the theoretical understanding of how unmasking schedules -- which specify the order and size of unmasked tokens during sampling -- affect generation quality remains limited. In this work, we introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution, without requiring any prior knowledge or hyperparameter tuning. In contrast to prior deterministic procedures that fix unmasking sizes, our method randomizes the number of tokens revealed at each iteration. We show that, for two specific parameter choices, the sampling convergence guarantees -- measured by Kullback-Leibler (KL) divergence -- scale as $\widetilde O(\mathsf{TC}/K)$ and $\widetilde O(\mathsf{DTC}/K)$ respectively. Here, $K$ is the number of iterations, and $\mathsf{TC}$ and $\mathsf{DTC}$ are the total correlation and dual total correlation of the target distribution, capturing the intrinsic dependence structure underlying the data. Importantly, our guarantees hold in the practically relevant parallel-sampling regime $K<L$ where $L$ is the token sequence length. These results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions. Overall, our findings unveil the adaptivity of DLMs to intrinsic data structures and shed light on the benefit of randomized unmasking sizes in inference schedule design.

Comment: Model Architecture/Inference Efficiency: distribution-agnostic randomized unmasking schedules for diffusion language models with KL convergence scaling to total correlation.

Relevance: 9 Novelty: 8

9. RPU -- A Reasoning Processing Unit

ArXiv ID: 2602.18568

Authors: Matthew Adiletta, Gu-Yeon Wei, David Brooks

Abstract: Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.

Comment: Matches High Performance Computing: chiplet-based, bandwidth-first architecture with decoupled pipelines to overcome memory-wall bottlenecks in LLM inference.

Relevance: 9 Novelty: 8

10. Celo2: Towards Learned Optimization Free Lunch

ArXiv ID: 2602.19142

Authors: Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky

Abstract: Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this work paves the way for practically applicable learnable optimization algorithms, unlocking exploration of richer meta-training and data curation recipes to further improve performance.

Comment: Matches Efficiency/Training Dynamics: simple normalized learned optimizer meta-trained with tiny compute, scaling out-of-distribution to billion-parameter pretraining.

Relevance: 9 Novelty: 8

11. IDLM: Inverse-distilled Diffusion Language Models

ArXiv ID: 2602.19066

Authors: David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin

Abstract: Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's entropy and generative perplexity.

Comment: Matches Model Compression/Efficiency: inverse distillation reduces DLM sampling steps 4–64× with theoretical uniqueness and gradient-stable relaxations.

Relevance: 9 Novelty: 8

12. Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations

ArXiv ID: 2602.19239

Authors: Ahmed Karim, Fatima Sheaib, Zein Khamis, Maggie Chlon, Jad Awada, Leon Chlon

Abstract: Large language models can follow complex procedures yet fail at a seemingly trivial final step: reporting a value they themselves computed moments earlier. We study this phenomenon as \emph{procedural hallucination}: failure to execute a verifiable, prompt-grounded specification even when the correct value is present in context. In long-context binding tasks with a known single-token candidate set, we find that many errors are readout-stage routing failures. Specifically, failures decompose into Stage~2A (gating) errors, where the model does not enter answer mode, and Stage~2B (binding) errors, where it enters answer mode but selects the wrong candidate (often due to recency bias). In the hard regime, Stage~2B accounts for most errors across model families in our tasks (Table~1). On Stage~2B error trials, a linear probe on the final-layer residual stream recovers the correct value far above chance (e.g., 74\% vs.\ 2\% on Qwen2.5-3B; Table~2), indicating that the answer is encoded but not used. We formalize ``present but not used'' via available vs.\ used mutual information and pseudo-prior interventions, yielding output-computable diagnostics and information-budget certificates. Finally, an oracle checkpointing intervention that restates the true binding near the query can nearly eliminate Stage~2B failures at long distance (e.g., Qwen2.5-3B $0/400 \rightarrow 399/400$ at $k = 1024$; Table~8).

Comment: Matches Representation Learning/Training Dynamics: causal analysis of LLM readout failures (gating vs. binding), with probes and mutual-information diagnostics.

Relevance: 9 Novelty: 8

13. Manifold-Aligned Generative Transport

ArXiv ID: 2602.19600

Authors: Xinyu Tian, Xiaotong Shen

Abstract: High-dimensional generative modeling is fundamentally a manifold-learning problem: real data concentrate near a low-dimensional structure embedded in the ambient space. Effective generators must therefore balance support fidelity -- placing probability mass near the data manifold -- with sampling efficiency. Diffusion models often capture near-manifold structure but require many iterative denoising steps and can leak off-support; normalizing flows sample in one pass but are limited by invertibility and dimension preservation. We propose MAGT (Manifold-Aligned Generative Transport), a flow-like generator that learns a one-shot, manifold-aligned transport from a low-dimensional base distribution to the data space. Training is performed at a fixed Gaussian smoothing level, where the score is well-defined and numerically stable. We approximate this fixed-level score using a finite set of latent anchor points with self-normalized importance sampling, yielding a tractable objective. MAGT samples in a single forward pass, concentrates probability near the learned support, and induces an intrinsic density with respect to the manifold volume measure, enabling principled likelihood evaluation for generated samples. We establish finite-sample Wasserstein bounds linking smoothing level and score-approximation accuracy to generative fidelity, and empirically improve fidelity and manifold concentration across synthetic and benchmark datasets while sampling substantially faster than diffusion models.

Comment: Model Architecture/Representation Learning: proposes a one-shot manifold-aligned generative transport with theoretical Wasserstein bounds.

Relevance: 9 Novelty: 8

14. Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations

ArXiv ID: 2602.19691

Authors: Yuhao Liu, Zilin Wang, Lei Wu, Shaobo Zhang

Abstract: Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we characterize both approximation and statistical properties of neural networks with smooth activations over the Sobolev space $W^{s,\infty}([0,1]^d)$ for arbitrary smoothness $s>0$. We prove that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In sharp contrast, networks with non-smooth activations, such as ReLU, lack this adaptivity: their attainable approximation order is strictly limited by depth, and capturing higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimality. Technically, our results are established via a constructive approximation framework that produces explicit neural network approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and removes the impractical sparsity constraints commonly required in prior analyses.

Comment: Representation/approximation theory: shows smooth activations enable depth-constant, minimax-optimal rates (smoothness adaptivity).

Relevance: 9 Novelty: 8

15. Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

ArXiv ID: 2602.18473

Authors: Guoqi Yu, Juncheng Wang, Chen Yang, Jing Qin, Angelica I. Aviles-Rivero, Shujun Wang

Abstract: Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibit two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies, they often struggle with modeling channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To address this mismatch, we propose CoTAR (Core Token Aggregation-Redistribution), a centralized MLP-based module designed to replace decentralized attention. Instead of allowing all tokens to interact directly, as in standard attention, CoTAR introduces a global core token that serves as a proxy to facilitate inter-token interactions, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 12.13% improvement on the APAVA dataset, while using only 33% of the memory and 20% of the inference time compared to the previous state of the art. Code and all training scripts are available at https://github.com/Levi-Ackman/TeCh.

Comment: Model Architecture and Efficiency: replaces attention with a centralized aggregation (CoTAR) achieving linear complexity and improved channel dependency modeling.

Relevance: 9 Novelty: 7

16. Scaling Laws for Precision in High-Dimensional Linear Regression

ArXiv ID: 2602.19241

Authors: Dechen Zhang, Xuan Tang, Yingyu Liang, Difan Zou

Abstract: Low-precision training is critical for optimizing the trade-off between model quality and training costs, necessitating the joint allocation of model size, dataset size, and numerical precision. While empirical scaling laws suggest that quantization impacts effective model and data capacities or acts as an additive error, the theoretical mechanisms governing these effects remain largely unexplored. In this work, we initiate a theoretical study of scaling laws for low-precision training within a high-dimensional sketched linear regression framework. By analyzing multiplicative (signal-dependent) and additive (signal-independent) quantization, we identify a critical dichotomy in their scaling behaviors. Our analysis reveals that while both schemes introduce an additive error and degrade the effective data size, they exhibit distinct effects on effective model size: multiplicative quantization maintains the full-precision model size, whereas additive quantization reduces the effective model size. Numerical experiments validate our theoretical findings. By rigorously characterizing the complex interplay among model scale, dataset size, and quantization error, our work provides a principled theoretical basis for optimizing training protocols under practical hardware constraints.

Comment: Model Compression and Efficiency: provides theoretical scaling laws for low-precision (quantized) training, linking precision to effective model/data size.

Relevance: 9 Novelty: 7

17. Bayesian Lottery Ticket Hypothesis

ArXiv ID: 2602.18825

Authors: Nicholas Kuhn, Arvid Weyrauch, Lars Heyen, Achim Streit, Markus G\"otz, Charlotte Debus

Abstract: Bayesian neural networks (BNNs) are a useful tool for uncertainty quantification, but require substantially more computational resources than conventional neural networks. For non-Bayesian networks, the Lottery Ticket Hypothesis (LTH) posits the existence of sparse subnetworks that can train to the same or even surpassing accuracy as the original dense network. Such sparse networks can lower the demand for computational resources at inference, and during training. The existence of the LTH and corresponding sparse subnetworks in BNNs could motivate the development of sparse training algorithms and provide valuable insights into the underlying training process. Towards this end, we translate the LTH experiments to a Bayesian setting using common computer vision models. We investigate the defining characteristics of Bayesian lottery tickets, and extend our study towards a transplantation method connecting BNNs with deterministic Lottery Tickets. We generally find that the LTH holds in BNNs, and winning tickets of matching and surpassing accuracy are present independent of model size, with degradation at very high sparsities. However, the pruning strategy should rely primarily on magnitude, secondly on standard deviation. Furthermore, our results demonstrate that models rely on mask structure and weight initialization to varying degrees.

Comment: Matches Sparsity/Pruning: extends the Lottery Ticket Hypothesis to Bayesian NNs and analyzes effective pruning criteria for BNNs.

Relevance: 9 Novelty: 7

18. DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

ArXiv ID: 2602.18846

Authors: Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

Abstract: Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

Comment: Compression/Efficiency: dual-stage token reduction (vision-side compression + text-guided pruning) for VLM training/inference.

Relevance: 9 Novelty: 7

19. Training-Free Generative Modeling via Kernelized Stochastic Interpolants

ArXiv ID: 2602.20070

Authors: Florentin Coeurdoux, Etienne Lempereur, Nathana\"el Cuvelle-Magar, Thomas Eboli, St\'ephane Mallat, Anastasia Borovykh, Eric Vanden-Eijnden

Abstract: We develop a kernel method for generative modeling within the stochastic interpolant framework, replacing neural network training with linear systems. The drift of the generative SDE is $\hat b_t(x) = \nabla\phi(x)^\top\eta_t$, where $\eta_t\in\R^P$ solves a $P\times P$ system computable from data, with $P$ independent of the data dimension $d$. Since estimates are inexact, the diffusion coefficient $D_t$ affects sample quality; the optimal $D_t^*$ from Girsanov diverges at $t=0$, but this poses no difficulty and we develop an integrator that handles it seamlessly. The framework accommodates diverse feature maps -- scattering transforms, pretrained generative models etc. -- enabling training-free generation and model combination. We demonstrate the approach on financial time series, turbulence, and image generation.

Comment: Model Architecture/Efficiency — training-free generative modeling via kernelized stochastic interpolants, replacing neural training with linear systems and specialized integrators.

Relevance: 8 Novelty: 8

20. On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference

ArXiv ID: 2602.19964

Authors: Moritz A. Zanger, Yijun Wu, Pascal R. Van der Vaart, Wendelin B\"ohmer, Matthijs T. J. Spaan

Abstract: Uncertainty quantification is central to safe and efficient deployments of deep learning models, yet many computationally practical methods lack lacking rigorous theoretical motivation. Random network distillation (RND) is a lightweight technique that measures novelty via prediction errors against a fixed random target. While empirically effective, it has remained unclear what uncertainties RND measures and how its estimates relate to other approaches, e.g. Bayesian inference or deep ensembles. This paper establishes these missing theoretical connections by analyzing RND within the neural tangent kernel framework in the limit of infinite network width. Our analysis reveals two central findings in this limit: (1) The uncertainty signal from RND -- its squared self-predictive error -- is equivalent to the predictive variance of a deep ensemble. (2) By constructing a specific RND target function, we show that the RND error distribution can be made to mirror the centered posterior predictive distribution of Bayesian inference with wide neural networks. Based on this equivalence, we moreover devise a posterior sampling algorithm that generates i.i.d. samples from an exact Bayesian posterior predictive distribution using this modified \textit{Bayesian RND} model. Collectively, our findings provide a unified theoretical perspective that places RND within the principled frameworks of deep ensembles and Bayesian inference, and offer new avenues for efficient yet theoretically grounded uncertainty quantification methods.

Comment: Representation Learning/Uncertainty — establishes equivalence between RND, deep ensembles, and Bayesian inference in the NTK limit, providing a principled theoretical link.

Relevance: 8 Novelty: 8

21. K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

ArXiv ID: 2602.19128

Authors: Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica

Abstract: Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co-Evolving World Model and build K-Search based on this method. By replacing static search heuristics with a co-evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K-Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K-Search significantly outperforms state-of-the-art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K-Search achieves state-of-the-art performance on H100, reaching 1030us and surpassing both prior evolution and human-designed solutions.

Comment: High Performance Computing: co-evolving world model guides LLM-based search for GPU kernel optimization, yielding large speedups (incl. MoE kernels).

Relevance: 8 Novelty: 8

22. Training-Free Cross-Architecture Merging for Graph Neural Networks

ArXiv ID: 2602.19332

Authors: Rishabh Bhattacharya, Vikaskumar Kalsariya, Naresh Manwani

Abstract: Model merging has emerged as a powerful paradigm for combining the capabilities of distinct expert models without the high computational cost of retraining, yet current methods are fundamentally constrained to homogeneous architectures. For GNNs, however, message passing is topology-dependent and sensitive to misalignment, making direct parameter-space merging unreliable. To bridge this gap, we introduce H-GRAMA (Heterogeneous Graph Routing and Message Alignment), a training-free framework that lifts merging from parameter space to operator space. We formalize Universal Message Passing Mixture (UMPM), a shared operator family that expresses heterogeneous GNN layers in a common functional language. H-GRAMA enables cross-architecture GNN merging (e.g., GCN to GAT) without retraining, retaining high specialist accuracy in most cases in compatible depth settings and achieving inference speedups of 1.2x to 1.9x over ensembles.

Comment: Model Architecture and Efficiency: training-free cross-architecture GNN merging via a shared operator family (UMPM) and message alignment, avoiding retraining.

Relevance: 8 Novelty: 8

23. A Markovian View of Iterative-Feedback Loops in Image Generative Models: Neural Resonance and Model Collapse

ArXiv ID: 2602.19033

Authors: Vibhas Kumar Vats, David J. Crandall, Samuel Goree

Abstract: AI training datasets will inevitably contain AI-generated examples, leading to ``feedback'' in which the output of one model impacts the training of another. It is known that such iterative feedback can lead to model collapse, yet the mechanisms underlying this degeneration remain poorly understood. Here we show that a broad class of feedback processes converges to a low-dimensional invariant structure in latent space, a phenomenon we call neural resonance. By modeling iterative feedback as a Markov Chain, we show that two conditions are needed for this resonance to occur: ergodicity of the feedback process and directional contraction of the latent representation. By studying diffusion models on MNIST and ImageNet, as well as CycleGAN and an audio feedback experiment, we map how local and global manifold geometry evolve, and we introduce an eight-pattern taxonomy of collapse behaviors. Neural resonance provides a unified explanation for long-term degenerate behavior in generative models and provides practical diagnostics for identifying, characterizing, and eventually mitigating collapse.

Comment: Training Dynamics/Theory: Markov-chain view of iterative feedback in generative models, explaining collapse via neural resonance with diagnostic taxonomy.

Relevance: 8 Novelty: 8

24. Online Realizable Regression and Applications for ReLU Networks

ArXiv ID: 2602.19172

Authors: Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel

Abstract: Realizable online regression can behave very differently from online classification. Even without any margin or stochastic assumptions, realizability may enforce horizon-free (finite) cumulative loss under metric-like losses, even when the analogous classification problem has an infinite mistake bound. We study realizable online regression in the adversarial model under losses that satisfy an approximate triangle inequality (approximate pseudo-metrics). Recent work of Attias et al. shows that the minimax realizable cumulative loss is characterized by the scaled Littlestone/online dimension $\mathbb{D}{\mathrm{onl}}$, but this quantity can be difficult to analyze. Our main contribution is a generic potential method that upper bounds $\mathbb{D})d$, otherwise infinite), and for bounded-norm $k$-ReLU networks separate regression (finite loss, even $\widetilde O(k^2)$, and $O(1)$ for one ReLU) from classification (impossible already for $k=2,d=1$).}}$ by a concrete Dudley-type entropy integral that depends only on covering numbers of the hypothesis class under the induced sup pseudo-metric. We define an \emph{entropy potential} $\Phi(\mathcal{H})=\int_{0}^{diam(\mathcal{H})} \log N(\mathcal{H},\varepsilon)\,d\varepsilon$, where $N(\mathcal{H},\varepsilon)$ is the $\varepsilon$-covering number of $\mathcal{H}$, and show that for every $c$-approximate pseudo-metric loss, $\mathbb{D}_{\mathrm{onl}}(\mathcal{H})\le O(c)\,\Phi(\mathcal{H})$. In particular, polynomial metric entropy implies $\Phi(\mathcal{H

Comment: Theory/Training Dynamics: bounds for realizable online regression under approximate metric losses with applications to bounded-norm ReLU networks.

Relevance: 8 Novelty: 8

25. Implicit Bias and Convergence of Matrix Stochastic Mirror Descent

ArXiv ID: 2602.18997

Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi

Abstract: We investigate Stochastic Mirror Descent (SMD) with matrix parameters and vector-valued predictions, a framework relevant to multi-class classification and matrix completion problems. Focusing on the overparameterized regime, where the total number of parameters exceeds the number of training samples, we prove that SMD with matrix mirror functions $\psi(\cdot)$ converges exponentially to a global interpolator. Furthermore, we generalize classical implicit bias results of vector SMD by demonstrating that the matrix SMD algorithm converges to the unique solution minimizing the Bregman divergence induced by $\psi(\cdot)$ from initialization subject to interpolating the data. These findings reveal how matrix mirror maps dictate inductive bias in high-dimensional, multi-output problems.

Comment: Training Dynamics/Implicit Bias: proves convergence and implicit bias for matrix-valued stochastic mirror descent, extending classic results to multi-output settings.

Relevance: 8 Novelty: 8

26. I Dropped a Neural Net

ArXiv ID: 2602.19845

Authors: Hyunwoo Park

Abstract: A recent Dwarkesh Patel podcast with John Collison and Elon Musk featured an interesting puzzle from Jane Street: they trained a neural net, shuffled all 96 layers, and asked to put them back in order. Given unlabelled layers of a Residual Network and its training dataset, we recover the exact ordering of the layers. The problem decomposes into pairing each block's input and output projections ($48!$ possibilities) and ordering the reassembled blocks ($48!$ possibilities), for a combined search space of $(48!)^2 \approx 10^{122}$, which is more than the atoms in the observable universe. We show that stability conditions during training like dynamic isometry leave the product $W_{\text{out}} W_{\text{in}}$ for correctly paired layers with a negative diagonal structure, allowing us to use diagonal dominance ratio as a signal for pairing. For ordering, we seed-initialize with a rough proxy such as delta-norm or $|W_{\text{out}}|_F$ then hill-climb to zero mean squared error.

Comment: Matches Representation/Training Dynamics: reconstructs exact layer order of a shuffled ResNet via dynamic-isometry-driven signals, offering structural insights.

Relevance: 8 Novelty: 8

27. A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

ArXiv ID: 2602.20062

Authors: Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine

Abstract: Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.

Comment: Representation learning/training dynamics: analytical theory linking pretraining initialization to feature reuse/refinement in fine-tuning.

Relevance: 8 Novelty: 8

28. Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data

ArXiv ID: 2602.20152

Authors: Zhenyao Ma, Yue Liang, Dongxu Li

Abstract: Inspired by behavioral science, we propose Behavior Learning (BL), a novel general-purpose machine learning framework that learns interpretable and identifiable optimization structures from data, ranging from single optimization problems to hierarchical compositions. It unifies predictive performance, intrinsic interpretability, and identifiability, with broad applicability to scientific domains involving optimization. BL parameterizes a compositional utility function built from intrinsically interpretable modular blocks, which induces a data distribution for prediction and generation. Each block represents and can be written in symbolic form as a utility maximization problem (UMP), a foundational paradigm in behavioral science and a universal framework of optimization. BL supports architectures ranging from a single UMP to hierarchical compositions, the latter modeling hierarchical optimization structures. Its smooth and monotone variant (IBL) guarantees identifiability. Theoretically, we establish the universal approximation property of BL, and analyze the M-estimation properties of IBL. Empirically, BL demonstrates strong predictive performance, intrinsic interpretability and scalability to high-dimensional data. Code: https://github.com/MoonYLiang/Behavior-Learning ; install via pip install blnetwork.

Comment: Model Architecture: proposes interpretable, identifiable networks as hierarchical compositions of utility-maximization blocks with theory.

Relevance: 8 Novelty: 8

29. A Computationally Efficient Multidimensional Vision Transformer

ArXiv ID: 2602.19982

Authors: Alaa El Ichi, Khalide Jbilou

Abstract: Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper, we introduce a novel tensor-based framework for Vision Transformers built upon the Tensor Cosine Product (Cproduct). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables efficient attention mechanisms and structured feature representations. We develop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new Cproduct-based Vision Transformer architecture (TCP-ViT). Numerical experiments on standard classification and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1/C parameter reduction (where C is the number of channels) while maintaining competitive accuracy.

Comment: Model Architecture and Efficiency — introduces a tensor cosine product (Cproduct) ViT with multilinear structure and 1/C parameter reduction enabling efficient attention.