Personalized Daily ArXiv Papers 2026-05-11

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	401050	39246	440296	1047	672	63
`gpt-5.4`	Cost	$1.00	$0.59	$1.59	1047	672	63

Topic Coverage:

Topic	Papers
Architecture and Training Dynamics	28
Efficiency, Compression, and Large-Scale Training	10
Representation Learning Theory and Structure	16
Memory Structures and Agent Memory Systems	2
World Models, Exploration, and Open-Ended Reinforcement Learning	7

Table of contents by topic:

Architecture and Training Dynamics (28)

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation Authors: Haozhou Zhang
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling Authors: Yuxuan Lou, Yang You
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models Authors: Benjamin L. Badger, Ethan Roland
A Rod Flow Model for Adam at the Edge of Stability Authors: Eric Regis, Sinho Chewi
Rethinking State Tracking in Recurrent Models Through Error Control Dynamics Authors: Jiwan Chung, Heechan Choi, Seon Joo Kim
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization Authors: Jin Xu, Camille Couturier, Victor R\"uhle, Saravan Rajmohan, James Hensman
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition Authors: Sayantan Choudhury, Xiaoran Cheng, Martin Tak\'a\v{c}, Sen Na, Mladen Kolar
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control Authors: Ali Taghibakhshi, Ruisi Cai, Saurav Muralidharan, Sharath Turuvekere Sreenivas, Aditya Vavre, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Sheldon Liang, Marcin Chochowski, Zijia Chen, Akhiad Bercovich, Ran Zilberstein, Ran El-Yaniv, Yonatan Geifman, Daniel Korzekwa, Yoshi Suhara, Oluwatobi Olabiyi, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Self-Programmed Execution for Language-Model Agents Authors: Luke J. O'Connor
Fast Byte Latent Transformer Authors: Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer
When Diffusion Model Can Ignore Dimension: An Entropy-Based Theory Authors: Ahmad Aghapour, Erhan Bayraktar
Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics Authors: Caleb Jore, Jialin Liu
Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers Authors: Noboru Isobe, Daisuke Inoue, Masaaki Imaizumi
Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization Authors: Veeti Ahvonen, Damian Heiman, Antti Kuusisto, Miguel Moreno, Matias Selin
Randomness is sometimes necessary for coordination Authors: Rohan Patil, Jai Malegaonkar, Henrik I. Christensen
A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning Authors: Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training Authors: Pingbang Hu, Xueshen Liu, Z. Morley Mao, Jiaqi W. Ma
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity Authors: Anastasis Kratsios, Gregory Cousins, Haitz S\'aez de Oc\'ariz Borde, Bum Jun Kim, Simone Brugiapaglia
Why DDIM Hallucinates More than DDPM: A Theoretical Analysis of Reverse Dynamics Authors: Muhammad H. Ashiq, Samanyu Arora, Abhinav N. Harish, Ishaan Kharbanda, Hung Yun Tseng, Grigorios G. Chrysos
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation Authors: Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel \'Angel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List Authors: Zhanqi Zhang, Hua-Dong Xiong, Robert C. Wilson, Mikio Aoi, Marcelo G. Mattar, Li Ji-An
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning Authors: Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti
Bilevel Graph Structure Learning, Revisited: Inner-Channel Origins of the Reported Gain Authors: Minkyoung Kim, Beakcheol Jang
Normalizing Trajectory Models Authors: Jiatao Gu, Tianrong Chen, Ying Shen, David Berthelot, Shuangfei Zhai, Josh Susskind
Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction Authors: Himanshu Udupi, Xiaocong Yang, ChengXiang Zhai
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits Authors: Wenhua Nie, Junlin Liu, Jianan Wu, Zijie Meng, Yilong Fan, Zhang Zijian, Haoran Zheng, Jyh-Shing Roger Jang
Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents Authors: Ziming Li, Jiatan Huang, Xiaoguang Guo, Guilin Wang, Chuxu Zhang
Learned Lagrangian Models of PDEs via Euler-Lagrange Residual Minimization Authors: Lyra Zhornyak, Eric Forgoston, M. Ani Hsieh

Efficiency, Compression, and Large-Scale Training (10)

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation Authors: Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference Authors: Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei
CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations Authors: Robin Karlsson, Go Suzui
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference Authors: Feiyu Yao, Zhixiong Niu, Xiaqing Li, Yongqiang Xiong, Juan Fang, Qian Wang
Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding Authors: Wenhua Nie, Zijie Meng, Kun Zou, Zheng Lin, Ziwei Li, Haoran Zheng, Jyh-Shing Roger Jang, Hao Zhang
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA Authors: Jinqian Chen, Chang Liu, Jihua Zhu
Direction-Preserving Number Representations Authors: Bardia Zadeh, George A. Constantinides
Less Random, More Private: What is the Optimal Subsampling Scheme for DP-SGD? Authors: Andy Dong, Ayfer \"Ozg\"ur
KL for a KL: On-Policy Distillation with Control Variate Baseline Authors: Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo
Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning Authors: Hao Chen, Zavareh Bozorgasl

Representation Learning Theory and Structure (16)

Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation Authors: Junjie Yu, Yue Wang, Zihan Deng, Yan Zhu, Wenxiao Ma, Quanying Liu
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse Authors: Xinyu Zhao, Nikita Karagodin, Hamed Hassani, Sinan Hersek, Paul Pu Liang, Yury Polyanskiy
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders Authors: Tue M. Cao, Hoang X. Nhat, Raed Alharbi, My T. Thai
Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization Authors: Leonel Aguilar, Jan Nagler, Christoph Hoelscher, Nino Antulov-Fantulin
Characterizing and Correcting Effective Target Shift in Online Learning Authors: Ziyan Li, Naoki Hiratani
Structured Coupling for Flow Matching Authors: Xavier Sumba, Carles Balsells-Rodas, Yingzhen Li
Susceptibilities and Patterning: A Primer on Linear Response in Bayesian Learning Authors: Chris Elliott, Daniel Murfet
Distributional simplicity bias and effective convexity in Energy Based Models Authors: Aur\'elien Decelle, Alfonso de Jes\'us Navas G\'omez, Beatriz Seoane
When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification Authors: Wenjie Guan, Jelena Bradic
Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions Authors: Nicole Ma, Nick Rui
Tool Calling is Linearly Readable and Steerable in Language Models Authors: Zekun Wu (University College London), Ze Wang (University College London), Seonglae Cho (Holistic AI), Yufei Yang (Imperial College London), Adriano Koshiyama (University College London), Sahan Bulathwela (University College London), Maria Perez-Ortiz (University College London)
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning Authors: Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar
Learning Large-Scale Modular Addition with an Auxiliary Modulus Authors: Hanato Kikuchi, Ryosuke Masuya, Kazuhiko Kawamoto, Hiroshi Kera
When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment Authors: Long Zhang, Wei-neng Chen, Feng-feng Wei, Zi-bo Qin
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction Authors: Jonathn Chang, Arya Datla, Ziv Goldfeld
Interpreting Reinforcement Learning Agents with Susceptibilities Authors: Chris Elliott, Einar Urdshals, David Quarel, Daniel Murfet

Memory Structures and Agent Memory Systems (2)

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment Authors: Siyuan Guo, Yali Du, Hechang Chen, Yi Chang, Jun Wang
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory Authors: Yang Zhao, Chengxiao Dai, Mengying Kou, Yue Xiu

World Models, Exploration, and Open-Ended Reinforcement Learning (7)

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites Authors: Qinshi Zhang (University of California, San Diego), Weipeng Deng (University of Hong Kong), Zhihan Jiang (Columbia University), Jiaming Qu (Amazon), Qianren Li (City University of Hong Kong), Weitao Xu (City University of Hong Kong), Ray LC (City University of Hong Kong)
Predictive but Not Plannable: RC-aux for Latent World Models Authors: Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought Authors: Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang
Learning Visual Feature-Based World Models via Residual Latent Action Authors: Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias
Finite-Time Analysis of MCTS in Continuous POMDP Planning Authors: Da Kong, Vadim Indelman
On the Divergence of Differential Temporal Difference Learning without Local Clocks Authors: David Antrobius, Shangtong Zhang
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow Authors: Juil Koo, Mingue Park, Jiwon Choi, Yunhong Min, Minhyuk Sung

Architecture and Training Dynamics (28)

1. PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

ArXiv ID: 2605.07067

Primary Topic: Architecture and Training Dynamics

Authors: Haozhou Zhang

Abstract: Muon's matrix-level update couples two distinct effects: spectral control via a polar map, and equivariance under orthogonal changes of multiplicity-space basis (Schur gauge-equivariance). We separate them with PolarAdamW, a controlled hybrid that preserves Muon's polar spectral-norm control but breaks the gauge-equivariance, since AdamW's coordinatewise preconditioner is basis-dependent. Algorithmically, PolarAdamW applies Muon's Newton-Schulz polar map to AdamW's preconditioned direction rather than to raw momentum, at per-iteration wall-time comparable to Muon. We prove that Muon's polar step is Schur gauge-equivariant on multiplicity matrices while AdamW's coordinatewise step is not. On DeiT-Tiny trained from scratch on four independently sampled 100-class subsets of ImageNet-1k, where multiplicity-basis freedom is trivial, PolarAdamW outperforms Muon by +1.93 pp in test accuracy on average and AdamW by +9.5 pp; under the 300-epoch DeiT-style recipe, it remains ahead of Muon by +1.37 pp and AdamW by +5.80 pp on average. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is non-trivial, the ordering reverses: Muon outperforms PolarAdamW at every audited capacity, and the gap widens with capacity. Both matrix-polar optimisers continue to outperform AdamW. This double dissociation separates spectral control from Schur gauge-equivariance: the first composes well with AdamW preconditioning on standard transformers, while the second becomes consequential when multiplicity-basis freedom is structurally non-trivial.

Comment: Disentangles Muon's polar spectral control from Schur gauge-equivariance through a hybrid optimizer and shows when each property matters.

Topic Match: Best fit is architecture_training because the paper isolates optimizer mechanisms and their training-dynamics consequences in a principled way.

Relevance: 9 Novelty: 8

2. OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

ArXiv ID: 2605.07815

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Yuxuan Lou, Yang You

Abstract: Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.

Comment: Extends Muon with layer-wise trust-ratio scaling based on the actual update direction, improving optimizer behavior and transfer across model scales.

Topic Match: Best fit is architecture_training because the core contribution is a new optimizer design with theory and empirical effects on large-model training dynamics.

Relevance: 9 Novelty: 8

3. Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

ArXiv ID: 2605.06683

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Benjamin L. Badger, Ethan Roland

Abstract: Transformer-based large language models are in some respects limited by the quadratic time and space computational complexity of attention. We introduce the Toeplitz MLP Mixer (TMM), a transformer-like architecture that swaps attention for triangular-masked Toeplitz matrix multiplication over the sequence dimension resulting in $\mathcal{O} (dn \log n)$ time and $\mathcal O(dn)$ space complexity during training and $\mathcal O(dn)$ time and space at inference prefill. Despite the lack of sophisticated input modulation or state maintenance present in other sub-quadratic architectures, TMMs yield greater training efficiency in terms of loss achieved per compute and device memory. We demonstrate that TMMs are capable of retaining more input information resulting in improved copying ability, which we argue results from a lack of architectural biases. Consistent with higher input information retention, TMMs exhibit superior information retrieval and in-context learning benchmark accuracy compared to comparable architectures. We conclude with an analysis from the perspective of operator index theory and show that, counterintuitively, trained Toeplitz layers of causal non-invertible models are more likely to be invertible or nearly so than models that are actually invertible over their inputs.

Comment: Proposes Toeplitz MLP Mixers as sub-quadratic sequence models with strong information retention and in-context retrieval behavior.

Topic Match: Best fit is architecture_training because this is a new sequence-model architecture with specific mechanistic claims about complexity and information retention.

Relevance: 9 Novelty: 8

4. A Rod Flow Model for Adam at the Edge of Stability

ArXiv ID: 2605.06821

Primary Topic: Architecture and Training Dynamics

Authors: Eric Regis, Sinho Chewi

Abstract: Cohen et al. (arXiv:2207.14484) observed that adaptive gradient methods such as Adam operate at the edge of stability. While there has been significant work on continuous-time modeling of gradient descent at the edge of stability, extending these models to momentum methods remains underdeveloped. In the gradient descent setting, Regis et al. (arXiv:2602.01480) introduced rod flow, which models consecutive iterates as an extended one-dimensional object -- a "rod." Here we extend rod flow to Adam by working in the joint phase space of parameters and first moment $(w, m)$ and treating the second moment $\nu$ as a smooth auxiliary variable. We also develop rod flows for heavy ball momentum, Nesterov momentum, and scalar and per-component versions of RMSProp, Adam, and NAdam. For all eight optimizers, we empirically evaluate rod flow on representative machine learning architectures, where it tracks the discrete iterates through the edge-of-stability regime significantly more accurately than the corresponding stable flow.

Comment: Extends rod-flow continuous-time modeling to Adam and other momentum optimizers at the edge of stability.

Topic Match: It directly targets optimizer dynamics and training stability, a central fit to architecture and training dynamics.

Relevance: 9 Novelty: 8

5. Rethinking State Tracking in Recurrent Models Through Error Control Dynamics

ArXiv ID: 2605.07755

Primary Topic: Architecture and Training Dynamics

Also Matches: Memory Structures and Agent Memory Systems

Authors: Jiwan Chung, Heechan Choi, Seon Joo Kim

Abstract: The theory of state tracking in recurrent architectures has predominantly focused on expressive capacity: whether a fixed architecture can theoretically realize a set of symbolic transition rules. We argue that equally important is error control, the dynamics governing hidden-state drift along the directions that distinguish symbolic states. We prove that affine recurrent networks, a class of models encompassing State-Space Models and Linear Attention, cannot correct errors along state-separating subspaces once they preserve state representations. Consequently, practical affine trackers do not learn robust state tracking; rather, they learn finite horizon solutions governed by accumulated state-relevant error. We characterize the mechanics of this failure, showing that tracking remains readable only while the accumulating within-class spread remains small relative to the initial between-class separation. We demonstrate empirically on group state-tracking tasks that this breakdown is predictable: tracking collapses when the distinguishability ratio crosses the readability threshold of the trained decoder. Across trained models, the point of this crossing predicts the horizon at which downstream accuracy fails. These results establish that robust state tracking is determined not only by an architecture's theoretical expressivity but crucially by its error control.

Comment: Shows affine recurrent architectures cannot correct state-separating errors once they preserve symbolic states, reframing recurrent tracking via error control.

Topic Match: The heart of the paper is mechanistic analysis of recurrent/state-space sequence modeling and its stability limits.

Relevance: 9 Novelty: 8

6. Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

ArXiv ID: 2605.07588

Primary Topic: Architecture and Training Dynamics

Authors: Jin Xu, Camille Couturier, Victor R\"uhle, Saravan Rajmohan, James Hensman

Abstract: Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in language-modeling experiments at the moderate hundred-million-parameter scale. Despite their constrained parameterizations, these layers train stably and can match corresponding Transformer baselines. Overall, our results suggest that CEM provides a useful lens for understanding Transformer layer parameterization, connecting Transformer architectures to energy-based models and motivating further exploration of energy-guided layer designs.

Comment: Reinterprets transformer layers as causal energy-minimization steps, motivating tied attention, shared gated MLPs, and recursive updates.

Topic Match: This is directly about transformer architectural parameterization and training-stable alternatives derived from a mechanistic energy-based perspective.

Relevance: 9 Novelty: 8

7. Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

ArXiv ID: 2605.06884

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Sayantan Choudhury, Xiaoran Cheng, Martin Tak\'a\v{c}, Sen Na, Mladen Kolar

Abstract: Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of $O \left(\varepsilon^{\frac{-(3\alpha-2)}{(\alpha-1)}} \right)$ for finding an $\varepsilon$-stationary point, where $\alpha\in(1,2]$ denotes the heavy-tail index. For the inexact-polar setting with $\sigma_1=0$, we also provide guarantees that do not require prior knowledge of $\alpha$. We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.

Comment: Provides convergence theory for Muon with Nesterov momentum under heavy-tailed noise and inexact or randomized polar decomposition.

Topic Match: The paper primarily advances optimizer theory for neural network training dynamics, with efficiency-relevant inexact decomposition as a secondary angle.

Relevance: 9 Novelty: 8

8. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

ArXiv ID: 2605.07182

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Ali Taghibakhshi, Ruisi Cai, Saurav Muralidharan, Sharath Turuvekere Sreenivas, Aditya Vavre, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Sheldon Liang, Marcin Chochowski, Zijia Chen, Akhiad Bercovich, Ran Zilberstein, Ran El-Yaniv, Yonatan Geifman, Daniel Korzekwa, Yoshi Suhara, Oluwatobi Olabiyi, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

Abstract: Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM post-training method that adds N nested submodels to a given parent reasoning model using the compute of one run (N-fold savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation of efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel inference scheme that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE, and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. Building on the Nemotron Elastic framework, we apply Star Elastic to the NVIDIA Nemotron Nano models, with a particular focus on hybrid Mixture-of-Experts (MoE) architectures: from Nemotron Nano v3 (30B/3.6A), we generate 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. All nested models match or outperform independently trained baselines of comparable size and achieve a 360x reduction versus pretraining from scratch and a 7x reduction over state-of-the-art compression. Crucially, elastic budget control advances the accuracy-latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection. We further extend Star Elastic to quantized regimes via Quantization-Aware Distillation (QAD), producing nested NVFP4 and FP8 elastic checkpoints that preserve zero-shot slicing while delivering smaller deployment footprints.

Comment: Introduces nested submodels and trainable routing inside a single parent reasoning model, enabling elastic per-phase compute selection across SSM, MoE, FFN, and embedding axes.

Topic Match: The main contribution is a dynamic modular architecture with learned routing and shared training across nested submodels, with efficiency as a strong secondary benefit.

Relevance: 9 Novelty: 8

9. Self-Programmed Execution for Language-Model Agents

ArXiv ID: 2605.06898

Primary Topic: Architecture and Training Dynamics

Also Matches: Memory Structures and Agent Memory Systems

Authors: Luke J. O'Connor

Abstract: At the heart of existing language model agents is a fixed orchestrator program responsible for the state transition between consecutive turns. This paper introduces self-programmed execution (SPE), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn-to-turn orchestration policy. Realizing SPE in practice is nontrivial because the same data is both model context and executable program. I therefore introduce Spell, a Lisp-based language in which programs can edit and re-evaluate themselves, and effectful expressions like model invocations are structured such that re-evaluating an edited program does not replay its side effects. Experiments with existing models, not trained for SPE or Spell, show that frontier models can operate in this regime and accomplish challenging agentic tasks. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self-orchestration strategies might be learned by a model trained for self-programmed execution. Code is available at https://github.com/lukejoconnor/spell .

Comment: Proposes self-programmed execution where the model completion itself becomes the orchestrator, removing a fixed external turn-to-turn policy.

Topic Match: The paper primarily proposes a new agent architecture in which orchestration is internalized into model-generated executable state transitions.

Relevance: 8 Novelty: 9

10. Fast Byte Latent Transformer

ArXiv ID: 2605.08044

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer

Abstract: Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

Comment: Introduces diffusion and speculative-style decoding variants for byte latent transformers to generate multiple bytes per step at much lower bandwidth cost.

Topic Match: Best fit is architecture_training because it modifies the core byte-level model and decoding mechanism rather than merely optimizing deployment around an unchanged model.

Relevance: 8 Novelty: 8

11. When Diffusion Model Can Ignore Dimension: An Entropy-Based Theory

ArXiv ID: 2605.07969

Primary Topic: Architecture and Training Dynamics

Authors: Ahmad Aghapour, Erhan Bayraktar

Abstract: Diffusion models perform remarkably well on high-dimensional data such as images, often using only a modest number of reverse-time steps. Despite this practical success, existing convergence theory does not fully explain why such samplers remain efficient in high dimensions. Many prior KL guarantees bound the discretization error in terms of the ambient dimension, while other improved results replace this dependence using intrinsic-dimensional or geometric structure assumptions. In this work, we develop an alternative information-theoretic perspective on diffusion sampler convergence. We prove that, for Gaussian mixture targets, the discretization error is controlled by the Shannon entropy of the latent mixture component rather than by the ambient dimension. Consequently, the leading step complexity scales linearly with latent entropy and depends only logarithmically on the second moment of the data. Our analysis also extends to discrete target distributions, where the relevant complexity is the entropy of the target rather than the dimension of the embedding space. These results suggest that diffusion sampling can remain efficient in high-dimensional spaces when the data distribution admits a compact latent representation, as is widely believed to be the case for natural images.

Comment: Shows diffusion discretization complexity can depend on latent entropy rather than ambient dimension for Gaussian mixtures and discrete targets.

Topic Match: Best fit is architecture_training because it gives foundational theory for why diffusion generative dynamics can scale favorably despite high ambient dimension.

Relevance: 8 Novelty: 8

12. Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics

ArXiv ID: 2605.07277

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Caleb Jore, Jialin Liu

Abstract: Many scientific and combinatorial problems admit multiple correct solutions, not a single label. Standard supervised learning resolves this ambiguity by choosing one solution as the target, but this hidden selector can be arbitrary, discontinuous, and harder to learn than the underlying solution set. We study bifurcation models, a weight-tied dynamical view in which different initializations can converge to different stable equilibria, so the model represents an attractor landscape rather than one chosen branch. We prove that broad set-valued maps with locally Lipschitz branches can be represented by regular equilibrium dynamics and that the induced selectors are almost everywhere regular, while manual selectors can be arbitrarily irregular. Experiments on frustrated Ising models show that such dynamics can discover multiple valid equilibria without branch labels and outperform single-branch supervision. Allen--Cahn experiments further show that diversity is not automatic: it can be encouraged explicitly, but with an accuracy--diversity tradeoff.

Comment: Weight-tied equilibrium dynamics represent set-valued solution maps by storing multiple valid outputs as distinct attractors rather than forcing a single supervised branch.

Topic Match: The core contribution is a dynamical architectural mechanism with theoretical guarantees on multi-solution representation, making architecture/training dynamics the best fit.

Relevance: 8 Novelty: 8

13. Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers

ArXiv ID: 2605.07772

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Noboru Isobe, Daisuke Inoue, Masaaki Imaizumi

Abstract: Transformers perform inference by iteratively transforming token representations across layers. This layerwise computation has been studied empirically, and recent mean-field theories of Transformer dynamics explain how attention can drive token distributions toward clustering. However, existing mean-field analyses largely treat model parameters as prescribed, leaving open how training reshapes this clustering picture. We study this question in a noisy mean-field Transformer in which only a parameter-linear FFN is trained under $L^2$ regularization. We find and analyze a training-induced phase in the dynamics: after initially following attention-driven clustering, the token distribution can leave the clustered regime near the final layers. Our mathematical analysis is based on an entropy-regularized interaction energy that captures the clustering bias of attention. More broadly, our results point toward a training-aware mean-field theory of Transformer dynamics, in which training and inference dynamics are treated together.

Comment: Analyzes how training can cause token representations to escape attention-driven clustering in a mean-field Transformer model.

Topic Match: The paper directly studies transformer layer dynamics under training, making architecture/training dynamics the best fit.

Relevance: 8 Novelty: 8

14. Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization

ArXiv ID: 2605.07705

Primary Topic: Architecture and Training Dynamics

Authors: Veeti Ahvonen, Damian Heiman, Antti Kuusisto, Miguel Moreno, Matias Selin

Abstract: We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and soft-attention, characterizing them with a new temporal logic. This logic extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. We also give an additional characterization of such transformers via a type of distributed automata, and show that our results are not limited to the specific choices in the architecture and can account for changes in, e.g., masking. Finally, we discuss encoder-decoder transformers in the autoregressive setting.

Comment: Gives a logical characterization of encoder-decoder transformers with cross-attention, clarifying their formal expressive structure.

Topic Match: The work is foundational architecture theory focused on cross-attention and encoder-decoder transformer computation.

Relevance: 8 Novelty: 8

15. Randomness is sometimes necessary for coordination

ArXiv ID: 2605.06825

Primary Topic: Architecture and Training Dynamics

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Rohan Patil, Jai Malegaonkar, Henrik I. Christensen

Abstract: Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves $1.0$ success while all deterministic baselines plateau near $0.5$. On control coordination tasks, a policy trained on $N=4$ generalizes zero-shot to $N \in [2,8]$. On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0\% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. https://anonymous.4open.science/r/randomness-137A/

Comment: Shows symmetric MARL sometimes fundamentally requires randomness, then implements structured symmetry breaking through Diamond Attention.

Topic Match: The main contribution is an architectural mechanism for coordination—attention with structured random masking—to overcome a fundamental limitation of deterministic shared policies.

Relevance: 8 Novelty: 8

16. A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

ArXiv ID: 2605.06819

Primary Topic: Architecture and Training Dynamics

Authors: Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel

Abstract: Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for $M$ steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after $M$ autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire $M$-step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon $M$, and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon $M$: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on $M$ altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025].

Comment: Develops an online learning theory for autoregressive chain-of-thought and shows full trajectory feedback removes dependence on generation horizon.

Topic Match: This is a foundational theory paper about the autoregressive computation process itself and how intermediate generated states affect learnability.

Relevance: 8 Novelty: 8

17. Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

ArXiv ID: 2605.07063

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Pingbang Hu, Xueshen Liu, Z. Morley Mao, Jiaqi W. Ma

Abstract: Data selection methods address a critical challenge in LLM post-training: effectively leveraging scarce, high-fidelity target data alongside abundant but imperfectly aligned general training data. In this work, we move beyond the data-selection framing and introduce Dr. Post-Training (Data-Regularized Post-Training), a novel framework that reconceptualizes general training data as a data-induced regularizer that prevents overfitting to the scarce target objective, rather than serving as a pool for selection. Specifically, our framework proposes that at each training step, construct a feasible set of model update directions using the general training data, and project the model update direction specified by the scarce target data onto that feasible set. Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias--variance spectrum with different regularization strength. Building on this view, we propose a family of methods offering a richer design space and more flexible bias--variance tradeoffs. For practical LLM-scale use, we introduce careful system optimizations that realize these methods with minimal overhead. Extensive experiments across SFT, RLHF, and RLVR show that our methods consistently outperform state-of-the-art data selection baselines, and system benchmarks confirm their efficiency.

Comment: Recasts post-training data mixing as projection onto update directions induced by general data, exposing a bias-variance regularization view.

Topic Match: The paper is primarily about training dynamics and optimization geometry in post-training, with systems aspects secondary.

Relevance: 8 Novelty: 8

18. Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

ArXiv ID: 2605.07097

Primary Topic: Architecture and Training Dynamics

Authors: Anastasis Kratsios, Gregory Cousins, Haitz S\'aez de Oc\'ariz Borde, Bum Jun Kim, Simone Brugiapaglia

Abstract: We show that, in a precise sense, a broad class of feedforward neural networks learn (have finite sample complexity) in the PAC model: every fixed finite feedforward architecture whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting, even with unbounded parameters. This covers standard fixed-size MLPs, CNNs, GNNs, and transformers with fixed sequence length, together with the operations and layers typically used in such architectures, including linear projections, residual connections, attention mechanisms, pooling layers, normalization layers, and admissible positional encodings. Hence, distribution-free learnability for modern non-recurrent architectures is not an exceptional property of particular activations or architecture-specific VC arguments, but a consequence of tame feedforward computation. Our results reposition finite-sample PAC learnability as a baseline rather than a differentiator: they shift the focus of architectural comparison toward inductive biases, symmetries and geometric priors, scalability, and optimization behaviour.

Comment: Proves broad finite-sample PAC learnability for fixed feedforward architectures definable in o-minimal structures, covering modern layers including attention and normalization.

Topic Match: This is foundational theory about the learnability properties of modern feedforward architectures, so architecture/training is the best fit.

Relevance: 8 Novelty: 8

19. Why DDIM Hallucinates More than DDPM: A Theoretical Analysis of Reverse Dynamics

ArXiv ID: 2605.06831

Primary Topic: Architecture and Training Dynamics

Authors: Muhammad H. Ashiq, Samanyu Arora, Abhinav N. Harish, Ishaan Kharbanda, Hung Yun Tseng, Grigorios G. Chrysos

Abstract: We theoretically study the hallucination phenomena in two canonical diffusion samplers: the stochastic Denoising Diffusion Probabilistic Model (DDPM) and the deterministic Denoising Diffusion Implicit Model (DDIM). We analyze the reverse ODE (DDIM) and SDE (DDPM) for a Gaussian mixture target, proving that after a critical time $\tau$, (a) DDIM can become stuck on the segment connecting the two nearest modes and (b) DDPM stochasticity helps it become unstuck from this region, thus avoiding hallucination. Our empirical validation verifies that DDPM has a significantly lower hallucination rate than DDIM when this region is entered. Building on our observations, we exhibit how using additional stochastic steps can help DDIM avoid hallucinations and offer new insights on how to design improved samplers.

Comment: Provides a theoretical explanation for why DDIM hallucinates more than DDPM by analyzing reverse ODE vs. SDE dynamics and showing stochasticity helps escape spurious between-mode regions.

Topic Match: This is fundamentally about the computational dynamics of a core generative-model sampling mechanism rather than an application or benchmark.

Relevance: 8 Novelty: 8

20. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

ArXiv ID: 2605.08029

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel \'Angel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu

Abstract: Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

Comment: Builds a unified multimodal generator by aligning autoregressive normalizing flows with LLM-style causal masking and KV-cache mechanics.

Topic Match: The key contribution is architectural: a unified causal-flow design that shares transformer generation structure and cache behavior across text and images.

Relevance: 8 Novelty: 8

21. The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

ArXiv ID: 2605.07127

Primary Topic: Architecture and Training Dynamics

Authors: Zhanqi Zhang, Hua-Dong Xiong, Robert C. Wilson, Mikio Aoi, Marcelo G. Mattar, Li Ji-An

Abstract: Modern large language models (LLMs) can find a needle in a haystack (locating a single relevant fact buried among hundreds of thousands of irrelevant tokens) with near-saturated accuracy, yet fail to retrieve the last few items in a short list. We call this failure the Position Curse. For instance, even in a two-line code snippet, Claude Opus 4.6 misidentifies the second-to-last line most of the time. To characterize this failure, we evaluated two complementary queries: given a position in a sequence (of letters or words), retrieve the corresponding item; and given an item, return its position. Each position is specified as a forward or backward offset from an anchor, either an endpoint of the list (its start or end) or another item in the list. Across both open-source and frontier closed-source models, backward retrieval substantially lags forward retrieval. To test whether this capability can be rescued by post-training, we constructed PosBench, a position-focused training dataset. LoRA fine-tuning improves both forward and backward retrieval and generalizes to a held-out code-understanding benchmark (PyIndex), yet absolute performance remains far from saturated. As LLM coding agents increasingly operate over large codebases where precise indexing becomes essential for code understanding and editing, position-based retrieval emerges as a key capability for future pretraining objectives and model design.

Comment: Identifies a systematic positional retrieval failure in LLMs and shows it persists even after targeted LoRA tuning.

Topic Match: This is best seen as a foundational architecture/training-dynamics issue: a specific mechanistic weakness in sequence position handling that current models fail to learn away.

Relevance: 8 Novelty: 8

22. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

ArXiv ID: 2605.07817

Primary Topic: Architecture and Training Dynamics

Authors: Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti

Abstract: Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.

Comment: Introduces an internal attention-control mechanism where generated gaze tokens dynamically modulate the causal attention mask for active visual reasoning.

Topic Match: The core contribution is a new architectural mechanism for dynamic attention control inside a multimodal model, rather than a task-specific application.

Relevance: 8 Novelty: 8

23. Bilevel Graph Structure Learning, Revisited: Inner-Channel Origins of the Reported Gain

ArXiv ID: 2605.07577

Primary Topic: Architecture and Training Dynamics

Authors: Minkyoung Kim, Beakcheol Jang

Abstract: Bilevel graph structure learning is widely understood to improve graph neural networks by jointly optimizing model parameters and a learned graph structure, with the resulting performance gain attributed to the rewired adjacency. We find that this attribution may be overstated: training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain. To establish this, we introduce frozen-$\phi$, a control that freezes the graph while retaining the inner-loop training schedule. This decomposes the bilevel gain into an inner channel of $T$-step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101% of the gain; on node classification it accounts for 37-44% under a Bernoulli edge-level parameterization. We also verify that classical spectral diagnostics can dissociate from task gain. We propose frozen-$\phi$ as a standardized diagnostic for bilevel graph structure learning, with graph distillation as a method-agnostic complement. A three-precondition framework further predicts the sign of the bilevel gain on all six benchmarks.

Comment: Introduces a frozen-graph control showing that much of reported bilevel graph structure learning gain comes from inner-loop training dynamics rather than rewiring.

Topic Match: This is directly about disentangling architectural claims from optimization dynamics, matching the training-dynamics criterion closely.

Relevance: 8 Novelty: 8

24. Normalizing Trajectory Models

ArXiv ID: 2605.08078

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Jiatao Gu, Tianrong Chen, Ying Shen, David Berthelot, Shuangfei Zhai, Josh Susskind

Abstract: Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

Comment: Replaces Gaussian reverse steps with conditional normalizing flows, retaining exact likelihood while enabling few-step generation and self-distillation.

Topic Match: This is primarily a new generative architecture and training formulation for few-step sampling with exact likelihood.

Relevance: 8 Novelty: 8

25. Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction

ArXiv ID: 2605.08022

Primary Topic: Architecture and Training Dynamics

Authors: Himanshu Udupi, Xiaocong Yang, ChengXiang Zhai

Abstract: Spiking Neural Networks (SNNs) have been proposed as biologically plausible and energy-efficient alternatives to conventional Artificial Neural Networks (ANNs). However, the training of SNN usually relies on surrogate gradients due to the non-differentiability of the spike function, introducing approximation errors that accumulate across layers. To address this challenge, we extend the work on convexification of parallel feedforward threshold networks to parallel recurrent threshold networks, which subsume parallel SNNs as a structured special case. Building on this theoretical framework, we propose a parameter reconstruction algorithm for SNN training that demonstrates consistent and significant advantages across various tasks, both as a standalone method and in combination with surrogate-gradient training. The ablations further demonstrate the data scalability and robustness to model configurations of our training algorithm, pointing toward its potential in large-scale SNN training.

Comment: Introduces a parameter-reconstruction training method for recurrent threshold networks that avoids surrogate-gradient errors in spiking neural networks.

Topic Match: This directly targets training dynamics for a specialized neural architecture, with a new theoretical route to stable optimization beyond surrogate gradients.

Relevance: 8 Novelty: 8

26. The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

ArXiv ID: 2605.07686

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Wenhua Nie, Junlin Liu, Jianan Wu, Zijie Meng, Yilong Fan, Zhang Zijian, Haoran Zheng, Jyh-Shing Roger Jang

Abstract: Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=\alpha_c F_L(b)+\alpha_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.

Comment: Identifies a token-budget coupling mechanism that makes visible chain-of-thought hurt accuracy and proposes split-budget decoding.

Topic Match: This is a clear training/inference dynamics paper about a concrete computational mechanism—shared token budgets—and how it changes reasoning behavior.

Relevance: 8 Novelty: 8

27. Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents

ArXiv ID: 2605.06908

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Ziming Li, Jiatan Huang, Xiaoguang Guo, Guilin Wang, Chuxu Zhang

Abstract: Adaptive test-time compute for LLM agents aims to invoke extra computation only when it improves performance. Existing methods typically use confidence-, uncertainty-, or difficulty-based gates, assuming a fixed direction from the gating signal through compute need to the value of computation. This makes gating a utility-calibration problem: gating signals should align with whether extra computation improves the final outcome over the base policy. We show that this alignment is unstable: the same signal predicts rollout benefit in one setting and rollout harm in another, with reversals across environments and backbones even when the task is fixed. Wrong-direction gates can therefore worsen performance by precisely selecting harmful states. This reversal reflects a deeper distinction between compute need and compute suitability: a high uncertainty signal may indicate decision-difficult states where rollouts help compare alternatives, or intervention-unsuitable states where the current context does not support useful rollout-based improvement. Under this two-source model, fixed-direction gates are unreliable across heterogeneous settings. To address this, we propose DIAL (Direction-Informed Adaptive Learning), a sparse gate trained from signal-agnostic counterfactual exploration to learn the utility direction of state features per (environment, backbone). Across six environments and three backbones, DIAL yields a stronger overall success-cost trade-off than fixed-direction baselines.

Comment: Shows that adaptive test-time compute gates can reverse direction across settings, and learns utility direction from counterfactual exploration for better compute allocation.

Topic Match: The central idea is a dynamic-computation gating mechanism and an analysis of when extra inference compute helps, which fits architecture/training dynamics best.

Relevance: 8 Novelty: 8

28. Learned Lagrangian Models of PDEs via Euler-Lagrange Residual Minimization

ArXiv ID: 2605.07157

Primary Topic: Architecture and Training Dynamics

Authors: Lyra Zhornyak, Eric Forgoston, M. Ani Hsieh

Abstract: We present the first method to directly use a learned continuous Lagrangian to forecast the dynamics of systems governed by partial differential equations, exploiting the inherent conservative structure to achieve stable long-range predictions. We develop an optimization-based integrator that minimizes the squared Euler--Lagrange residual via a mesh-free near-symplectic construction on local space-time patches. Different from integrators for analytical models, integrators for learned models should decouple model error (phase error) from integration error (conservation error). By relying on optimization rather than time-stepping, we bypass the global coupling inherent to fixed discretizations, which slows time- and space-stepping and complicates learning. Our method scales linearly with domain size via Jacobi iteration, and places no structural requirements on the learned network, allowing it to be coupled with existing physics-guided machine learning (ML) methods. We validate our approach on a learned representation of a double pendulum, a one-dimensional wave equation, and a two-dimensional wave equation. Our method achieves error comparable to classical symplectic methods while generalizing to spatially varying dynamics and arbitrary boundary conditions without retraining.

Comment: Learns continuous Lagrangians for PDEs and introduces Euler-Lagrange residual minimization with a near-symplectic integrator for stable long-horizon dynamics.

Topic Match: Best fits architecture/training because the core contribution is a new mechanistic modeling-and-integration scheme for stable dynamical prediction, not an application benchmark.

Relevance: 8 Novelty: 8

Efficiency, Compression, and Large-Scale Training (10)

1. Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

ArXiv ID: 2605.07111

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo

Abstract: Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL.

Comment: Dynamically routes adaptation updates between full fine-tuning and LoRA using optimizer-level gradient access to both experts.

Topic Match: This is chiefly a parameter-efficient adaptation method that changes the efficiency-flexibility tradeoff in LLM fine-tuning.

Relevance: 9 Novelty: 8

2. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

ArXiv ID: 2605.07363

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei

Abstract: DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.

Comment: Replaces expensive multi-head sparse-attention indexers with routed expert head selection, cutting long-context indexing cost while preserving retrieval quality.

Topic Match: This is chiefly an inference-efficiency paper on sparse attention and long-context cost reduction, though it also has an architectural routing angle.

Relevance: 9 Novelty: 8

3. CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations

ArXiv ID: 2605.07325

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: Robin Karlsson, Go Suzui

Abstract: Deploying massive large language models (LLMs) as continuous cognitive engines for robotics is bottlenecked by the time-to-first-token (TTFT) latency required to process extensive state histories. Existing solutions like RAG or sliding windows compromise global context or incur prohibitive re-computation costs. We formalize the optimal task structure for minimizing latency and theoretically prove that prefix stability, incremental extensibility, and asynchronous state reconciliation are necessary conditions for real-time performance. Building on these proofs, we introduce the Cached State Representation (CSR) framework as the practical instantiation of these properties, ensuring optimal KV-cache reuse. To sustain these properties over infinite horizons, we further propose an Asynchronous State Reconciliation (ASR) algorithm that offloads state memory eviction to a parallel computational resource to eliminate latency spikes. On a physical robot wirelessly connected to an on-premise GPU server, CSR achieves a 26-fold latency reduction (14.67s to 0.56s) for 120K token contexts with a 235B parameter model compared to a standard baseline. On an embodied AI benchmark, we achieve SOTA recall (0.836 vs. 0.459) while maintaining RAG-level latency. ASR is validated to sustain bounded, spike-free TTFT over 10 eviction cycles in continuous real-world operation. Together, CSR and ASR enable massive LLMs to function as continuously operating, high-frequency (> 2 Hz) embodied policies.

Comment: Formalizes prefix stability and asynchronous reconciliation for infinite-horizon KV-cache reuse in real-time embodied policies.

Topic Match: The main contribution is a new cache/state design and algorithm that materially changes long-context inference latency, fitting efficiency and scaling best.

Relevance: 9 Novelty: 8

4. An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

ArXiv ID: 2605.07719

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Feiyu Yao, Zhixiong Niu, Xiaqing Li, Yongqiang Xiong, Juan Fang, Qian Wang

Abstract: Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.

Comment: Co-designs sparse attention, KV budgeting, and CPU-GPU scheduling for long-context inference over CPU-resident KV caches, materially improving throughput with small quality loss.

Topic Match: The paper directly targets KV-cache efficiency and long-context systems design, with a substantive algorithm-systems contribution rather than routine optimization.

Relevance: 9 Novelty: 8

5. Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding

ArXiv ID: 2605.07698

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Wenhua Nie, Zijie Meng, Kun Zou, Zheng Lin, Ziwei Li, Haoran Zheng, Jyh-Shing Roger Jang, Hao Zhang

Abstract: Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejection, and rollback soundness samples from the locally projected distribution $\mu^{\mathrm{proj}}$ rather than the grammar-conditional distribution $\mu^\star$. This extends the GAD impossibility result to speculative decoding; on Dyck grammars with Qwen3-8B, the total-variation gap can reach 0.996. We identify the future-validity function $\Phi_t(y)=\Pr_p[\mathrm{valid\ completion}\mid y]$ as the missing correction statistic. The target distribution is a Doob transform of the base model with $h=\Phi$, while local masking corresponds to setting $h$ to one. With exact $\Phi$, our oracle decoder FVO-Spec samples exactly from $\mu^\star$; with approximate $\Phi$, we bound the resulting total-variation error. Because exact future validity is hard for general context-free grammars, we evaluate estimator hierarchies on tractable Dyck and finite JSON languages. OneStep reduces Dyck TV by 14% with under 1% throughput overhead, exact dynamic programming reduces it by 97%, and finite-language correction closes JSON gaps to numerical precision. All fidelity claims are scoped to enumerable grammars and token tries.

Comment: Shows local-mask speculative decoding cannot recover grammar-conditional sampling and identifies future validity as the missing statistic for exact correction.

Topic Match: The strongest fit is efficient inference because it fundamentally analyzes and corrects speculative decoding under grammar constraints.

Relevance: 8 Novelty: 9

6. Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA

ArXiv ID: 2605.06733

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Jinqian Chen, Chang Liu, Jihua Zhu

Abstract: Federated LoRA enables parameter-efficient adaptation of large language models under decentralized data and limited client resources.However, directly averaging LoRA factors is representation-dependent: the same intrinsic update admits infinitely many gauge-equivalent factorizations, so factor-level aggregation can change under arbitrary coordinate choices while the underlying update remains unchanged. This reveals a semantic mismatch in existing federated LoRA aggregation rules. We propose \textbf{GLoRA}, a gauge-aware server representation for federated LoRA.Instead of aggregating raw factors, GLoRA estimates a consensus update subspace from client projectors and aggregates client updates in shared reference coordinates, thereby representing semantic update aggregation entirely in low-rank form. To support heterogeneous client capacities, GLoRA further provides a rank-compatible readout that instantiates adapters of different ranks from the same server state without dense update reconstruction. Experiments on GLUE and SuperNI show that GLoRA consistently outperforms federated LoRA baselines under data, resource, and task heterogeneity, including heterogeneous client ranks, sparse participation, larger backbones, and unseen-task evaluation. GLoRA also achieves a favorable efficiency--performance trade-off, suggesting that effective federated LoRA requires not merely averaging low-rank factors, but defining a semantically meaningful server-side representation for aggregation.

Comment: Introduces a gauge-aware low-rank server representation for federated LoRA, avoiding factor-averaging ambiguities across equivalent decompositions.

Topic Match: Best fit is efficiency_scaling because the contribution is a more principled low-rank adaptation and aggregation mechanism under resource-constrained distributed fine-tuning.

Relevance: 8 Novelty: 8

7. Direction-Preserving Number Representations

ArXiv ID: 2605.07662

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Bardia Zadeh, George A. Constantinides

Abstract: Low-precision number formats are widely used in modern machine learning systems due to their efficiency. Accurate direction representation is key to the accuracy of vector operations. This work precisely explores the extent to which the direction of a vector can be represented by selecting its scalar elements from a common finite alphabet of a given size. This is standard practice in machine learning, where low-precision significands may be narrow-width floating-point or integer values. A geometric framework is introduced for analyzing the directional coverage of such product-structured codes. This work analytically quantifies the suboptimality gap between such product-structured codes and spherical codes for the vector as a whole, in both low and asymptotically high dimensions. Furthermore, within the product code class, it is proven that the standard formats of two's complement, fixed-point, and floating-point are suboptimal, again with quantified gap, pointing to the potential to develop new scalar number formats. Such scalar alphabets are numerically optimized across multiple block dimensions for directional coverage, including the dimension used in NVIDIA's NVFP4 format. Experimental results are presented comparing the performance of standard formats and the optimized alphabet. We find that for four bits, NVIDIA's choice of E2M1 closely approximates the optimized alphabet, providing a geometric explanation for its strong performance in low-precision machine learning workloads and an analytical understanding of the link between that superiority and block size. We provide open-source formal proofs in Lean for the theorems in this work, along with the experimental code and the optimized alphabets obtained.

Comment: Develops a geometric theory of low-precision scalar alphabets for preserving vector directions, directly informing quantization format design.

Topic Match: The core contribution is a foundational analysis of number representation for low-precision ML, which squarely fits compression and efficiency.

Relevance: 8 Novelty: 8

8. Less Random, More Private: What is the Optimal Subsampling Scheme for DP-SGD?

ArXiv ID: 2605.07072

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Andy Dong, Ayfer \"Ozg\"ur

Abstract: Poisson subsampling is the default sampling scheme in differentially private machine learning, largely because its unstructured randomness yields tractable privacy amplification analyses. Yet this same randomness introduces substantial participation variance: each sample appears in very different numbers of training iterations. In this work, we show that this variance is not merely a practical artifact to be tolerated, but a fundamental source of suboptimal privacy amplification. We prove that Balanced Iteration Subsampling (BIS), a structured scheme in which each sample participates in exactly a fixed number of iterations, achieves stronger privacy amplification than Poisson subsampling and is optimal at both extremes of the noise spectrum ($\sigma \to 0$ and $\sigma \to \infty$). Our analysis reveals that the privacy-noise tradeoff is governed not by maximizing randomness, but by eliminating participation variance while preserving uniform marginal participation across iterations. To translate this asymptotic theory into finite-noise guarantees, we introduce a practical near-exact Monte Carlo accountant for BIS, which removes the analytical slack of existing RDP and composition-based PLD analyses. Evaluations across more than 60 practical DP-SGD configurations show that BIS consistently outperforms Poisson subsampling in the low-noise regimes most relevant for high-utility private training, reducing the required noise multiplier by up to $9.6\%$. These results overturn the common intuition that more sampling randomness necessarily yields stronger privacy amplification: in DP-SGD, structured participation can be both more practical and more private. Our implementation is available at https://github.com/dong-xin-ao-andy/bis-mc-accountant.

Comment: Shows structured balanced subsampling can outperform Poisson subsampling for DP-SGD privacy amplification and provides a practical accountant.

Topic Match: This is a large-scale training algorithm paper: it changes the sampling scheme and accounting underlying DP-SGD behavior.

Relevance: 8 Novelty: 8

9. KL for a KL: On-Policy Distillation with Control Variate Baseline

ArXiv ID: 2605.07865

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo

Abstract: On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.

Comment: Stabilizes on-policy distillation by deriving a closed-form per-token reverse-KL value baseline that reduces gradient variance without extra critic training.

Topic Match: The main contribution is a training-efficiency and stability improvement for large-model post-training via a principled low-overhead variance-reduction method.

Relevance: 8 Novelty: 8

10. Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning

ArXiv ID: 2605.07263

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Hao Chen, Zavareh Bozorgasl

Abstract: Over-the-air federated learning (OTA-FL) reduces uplink latency by exploiting waveform superposition, but conventional analog aggregation schemes typically require instantaneous channel state information (CSI), channel inversion, and coherent phase alignment, which can be difficult to maintain in practical wireless systems. This paper proposes resource-element energy difference (REED), a noncoherent aggregation primitive for continuous signed updates that avoids instantaneous CSI. REED maps the positive and negative parts of each real-valued update to transmit energies on two orthogonal resource elements with independent phase dithers, and the server estimates the signed aggregate from their energy difference. With only slow-timescale calibration of average channel powers, REED is unbiased for the desired signed sum and admits an exact closed-form variance under Rayleigh fading. We incorporate REED into full-participation FedAvg and prove a smooth nonconvex stationarity bound. Under an average per-client energy budget, the aggregation gain can be scheduled so that the REED-induced perturbation scales quadratically with the local stepsize, yielding the canonical (1/sqrt(T)) stationarity rate. Experiments on MNIST and Fashion-MNIST demonstrate that REED closely matches clean FedAvg and coherent CSIT aggregation in IID settings, while maintaining stable convergence with a moderate performance degradation under strong data heterogeneity.

Comment: Introduces REED, a noncoherent over-the-air aggregation primitive with unbiased signed-sum estimation and convergence guarantees for federated learning without instantaneous CSI.

Topic Match: The main contribution is a new communication-efficient distributed training primitive that materially changes aggregation assumptions and cost in large-scale/federated optimization.

Relevance: 8 Novelty: 8

Representation Learning Theory and Structure (16)

1. Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation

ArXiv ID: 2605.07302

Primary Topic: Representation Learning Theory and Structure

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Junjie Yu, Yue Wang, Zihan Deng, Yan Zhu, Wenxiao Ma, Quanying Liu

Abstract: Finetuning pretrained models occurs in a low-dimensional subspace of the full parameter space. Prior work has focused on characterizing this optimization subspace, but largely ignored the complementary question: why do certain directions remain unexplored during finetuning? Are these stable directions irrelevant to downstream tasks, or do they already encode task-relevant structure that requires no further adjustment? Answering this question is central to understanding how pretrained knowledge transfers. Through systematic spectral analysis across vision and language models, we show that the leading singular vectors of pretrained weight matrices remain highly stable under finetuning and are shared across unrelated downstream tasks, revealing that pretraining establishes a reusable spectral coordinate system. Models pretrained on larger datasets exhibit greater spectral stability under distribution shift or task change, directly linking pretraining scale to geometric transferability. Motivated by these findings, we propose a parameter-efficient method that freezes pretrained singular vectors and optimizes only leading spectral coefficients, achieving competitive performance on GLUE with 0.2% trainable parameters. Our results reveal that the stable directions encode transferable structure rather than irrelevant noise: successful pretraining discovers spectral bases that downstream tasks inherit and operate within.

Comment: Shows pretraining induces stable leading singular-vector bases reused across downstream tasks, motivating spectral-coefficient-only adaptation.

Topic Match: Best fit is representation_structure because the main result concerns the geometric structure transferred by pretraining and why finetuning stays in a restricted subspace.

Relevance: 9 Novelty: 8

2. Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

ArXiv ID: 2605.06870

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Xinyu Zhao, Nikita Karagodin, Hamed Hassani, Sinan Hersek, Paul Pu Liang, Yury Polyanskiy

Abstract: While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.

Comment: Explains VQ-VAE dimensional collapse theoretically and fixes it with an autoencoder warm-up phase that restores latent rank.

Topic Match: Best fit is representation_structure because the key result is about why discrete latent representations collapse and how to preserve richer feature dimensions.

Relevance: 9 Novelty: 8

3. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

ArXiv ID: 2605.07922

Primary Topic: Representation Learning Theory and Structure

Authors: Tue M. Cao, Hoang X. Nhat, Raed Alharbi, My T. Thai

Abstract: Learning hierarchical features in Sparse Autoencoders (SAEs) is essential for capturing the structured nature of real-world data and mitigating issues like feature absorption or splitting. Existing works attempt to identify hierarchical relationships within independent feature sets by relying on activation coverage, the assumption that child feature should only activate when its parent feature activates. However, we demonstrate that this condition alone is insufficient; that is, it often produces false positives where parent and child concepts are semantically unrelated. To address this, we introduce a novel reconstruction condition that enforces a deeper functional link between hierarchical levels. By combining both activation and reconstruction constraints, we propose the Tree SAE, a model designed to learn hierarchical structures directly from within the feature set. Our results demonstrate that Tree SAEs significantly surpass the existing SAEs at learning hierarchical pairs while maintaining competitive performance to the state-of-the-art on several key benchmarks. Finally, we demonstrate the practical utility of our Tree SAE in mapping the geometry of child feature subspaces and uncovering the complex hierarchical concept structures encoded within large language models.

Comment: Proposes sparse autoencoders that explicitly learn hierarchical feature relations using joint activation and reconstruction constraints.

Topic Match: This is directly about feature formation and structure inside learned representations, especially SAE feature hierarchies.

Relevance: 9 Novelty: 8

4. Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization

ArXiv ID: 2605.07483

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Leonel Aguilar, Jan Nagler, Christoph Hoelscher, Nino Antulov-Fantulin

Abstract: Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(\varphi, \psi, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.

Comment: Argues OOD extrapolation hinges on representation-level identifiability bias, not just in-distribution fit, across architectures and domains.

Topic Match: The paper is fundamentally about how feature representations encode the structural commitments needed for generalization.

Relevance: 8 Novelty: 9

5. Characterizing and Correcting Effective Target Shift in Online Learning

ArXiv ID: 2605.07886

Primary Topic: Representation Learning Theory and Structure

Authors: Ziyan Li, Naoki Hiratani

Abstract: Online learning from a stream of data is a defining feature of intelligence, yet modern machine learning systems often struggle in this setting, especially under distributional shift. To understand its basic properties, we study the relationship between online and offline learning in the context of kernel regression. We derive a closed-form expression for the function learned by online kernel regression, revealing that online kernel regression is equivalent to offline regression with shifted, inaccurate target outputs. Conversely, we show that by compensating for this effective shift in the teaching signal through target correction, online kernel-based learning can provably learn the same predictor as its offline counterpart. We derive both a closed-form expression for this target correction and an iterative form that can be applied sequentially. Applying this framework to image classification tasks on CIFAR-10 and CORe50, we show that online stochastic gradient descent with iteratively corrected targets outperforms learning with the true targets in continual learning settings. This work therefore provides a basic framework for analyzing and improving online learning in non-stationary environments.

Comment: Derives a closed-form target-shift view of online kernel regression and uses target correction to recover offline-equivalent predictors in continual learning.

Topic Match: Best fit is representation_structure because it provides a basic theoretical account of online learning dynamics through kernel-function representations rather than a task-specific application.

Relevance: 8 Novelty: 8

6. Structured Coupling for Flow Matching

ArXiv ID: 2605.07676

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Xavier Sumba, Carles Balsells-Rodas, Yingzhen Li

Abstract: Standard flow matching scales well but typically relies on an unstructured source distribution, limiting its ability to learn interpretable latent structure. Latent-variable models, by contrast, capture structure but often sacrifice generative quality. We bridge this gap by proposing Structured Coupling for Flow Matching (SCFM), a cooperative framework that augments flow matching with structured latent representation learning. By introducing structured latent variables and exogenous noise into the source, SCFM jointly learns a structured prior (via latent variable modeling) and a continuous transport map (via flow matching). It uses a shared time-dependent recognition network for both latent variable model variational inference and intermediate-time flow velocity estimation. This yields a structurally informed yet unconditional, simulation-free flow model, where the latent variable model can also assist flow sampling. Empirically, SCFM facilitates unsupervised latent representation learning for clustering, disentanglement and downstream tasks, while remaining competitive with flow matching in sample quality, showing that meaningful structure can be learned without sacrificing generative fidelity.

Comment: Combines flow matching with structured latent-variable learning so the source coupling itself carries interpretable latent structure.

Topic Match: The main value is learning structured latent representations inside a flow framework, squarely matching representation structure.

Relevance: 8 Novelty: 8

7. Susceptibilities and Patterning: A Primer on Linear Response in Bayesian Learning

ArXiv ID: 2605.07980

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Chris Elliott, Daniel Murfet

Abstract: These notes introduce the theory of susceptibilities as developed in [arXiv:2504.18274, arXiv:2601.12703] for interpreting neural networks. The susceptibility of an observable $\phi$ to a data perturbation is defined as a derivative of a posterior expectation, which by the fluctuation--dissipation theorem equals a posterior covariance. Different choices of $\phi$ yield different objects: per-sample losses give the influence matrix (the Bayesian influence function of [arXiv:2509.26544]), while component-localized observables give the structural susceptibility matrix that pairs model components with data patterns. The susceptibility matrix is (up to a factor of $n\beta$) the Jacobian of the map from data distributions to structural coordinates; its pseudo-inverse provides a linearized solution to the patterning problem of [arXiv:2601.13548]: finding data perturbations that produce a desired structural change. We motivate the theory from its statistical-mechanical foundations, then give a detailed exposition of susceptibilities, their empirical estimators, and their connection to the geometry of the loss landscape.

Comment: Introduces susceptibilities as posterior-covariance objects linking data perturbations to learned structure in Bayesian neural learning.

Topic Match: Its emphasis is on mechanistic understanding of learned structure and feature-pattern relationships in representations.

Relevance: 8 Novelty: 8

8. Distributional simplicity bias and effective convexity in Energy Based Models

ArXiv ID: 2605.07844

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Aur\'elien Decelle, Alfonso de Jes\'us Navas G\'omez, Beatriz Seoane

Abstract: Energy-based learning is a powerful framework for generative modelling, but its training is inherently non-convex, leading potentially to sensitivity to initialisation, poor local optima, and unstable gradient dynamics. We present a dynamical analysis of energy-based learning through the lens of the effective model, which can be interpreted as either a generalised Ising model with higher-order interactions or the Fourier expansion of the energy. Under sufficient expressivity, we show that the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, we show that perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Finally, we show that gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones. This provides a mechanistic explanation for the distributional simplicity bias and clarifies why fixed points that are not data-consistent at low orders are not observed in practice.

Comment: Provides a dynamical account of why EBMs learn low-order interactions first, explaining simplicity bias and effective local convexity near data-consistent solutions.

Topic Match: This is primarily mechanistic theory about how learned statistical structure forms during training, not an application paper.

Relevance: 8 Novelty: 8

9. When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification

ArXiv ID: 2605.07120

Primary Topic: Representation Learning Theory and Structure

Authors: Wenjie Guan, Jelena Bradic

Abstract: Template tasks have emerged as a clean testbed for asking whether transformers reason with abstract symbols rather than concrete token names. We study the fixed-label classification version of this problem, where train and test examples share latent templates but may use disjoint vocabularies. Unlike next-token prediction, the model need not emit unseen symbols; it must learn a decision rule invariant to symbol renaming. We analyze regularized kernel logistic classification in the transformer-kernel regime. Our main result decomposes the learned predictor into an ideal template-level classifier and a finite-sample perturbation caused by accidental token overlaps in the training data. We encode these overlaps by a colored collision graph and prove high-probability margin-transfer guarantees for fresh-symbol classification. This perspective extends template-based analyses to logistic classification and refines scalar diversity conditions: vocabulary size controls the average rate of collisions, but collision geometry controls whether the ideal classification margin is preserved. More broadly, the same perturbation framework applies to abstraction-augmented inputs, yielding a general margin-versus-collision criterion for identifying when prompting strategies improve fresh-symbol generalization. Synthetic template experiments illustrate the predicted roles of regularization, sample size, and transformer-kernel structure.

Comment: Analyzes fresh-symbol classification in the transformer-kernel regime via collision graphs, linking abstraction generalization to overlap geometry.

Topic Match: The paper is centrally about theoretical structure of learned representations and abstraction under symbol renaming, not an application or benchmark result.

Relevance: 8 Novelty: 8

10. Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

ArXiv ID: 2605.07984

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Nicole Ma, Nick Rui

Abstract: We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.

Comment: Localizes where future-token planning representations form and causally matter during generation using probing and activation/path patching.

Topic Match: The main value is mechanistic understanding of internal planning representations and their causal role, which is most directly a representation-structure contribution.

Relevance: 8 Novelty: 8

11. Tool Calling is Linearly Readable and Steerable in Language Models

ArXiv ID: 2605.07990

Primary Topic: Representation Learning Theory and Structure

Authors: Zekun Wu (University College London), Ze Wang (University College London), Seonglae Cho (Holistic AI), Yufei Yang (Imperial College London), Adriano Koshiyama (University College London), Sahan Bulathwela (University College London), Maria Perez-Ortiz (University College London)

Abstract: When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $\tau$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.

Comment: Finds tool identity is linearly readable and steerable in internal activations, with causal localization to output-direction and specific heads.

Topic Match: The core is mechanistic analysis of internal representations for tool selection and causal intervention on those representations, fitting representation structure best.

Relevance: 8 Novelty: 8

12. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

ArXiv ID: 2605.06840

Primary Topic: Representation Learning Theory and Structure

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar

Abstract: Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

Comment: Extracts explicit search trees from reasoning traces to show that LLM move selection is driven by shallow nodes despite apparent deep deliberation.

Topic Match: Its contribution is mechanistic understanding of internal reasoning structure and planning behavior, which fits representation structure better than RL.

Relevance: 8 Novelty: 8

13. Learning Large-Scale Modular Addition with an Auxiliary Modulus

ArXiv ID: 2605.07648

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Hanato Kikuchi, Ryosuke Masuya, Kazuhiko Kawamoto, Hiroshi Kera

Abstract: Learning parity functions, more general modular addition, is a challenging machine learning task due to its input sensitivity. A recent study substantially scaled modular addition learning in both the number of summands and the modulus. Its key idea is to increase zeros in training sequences, reducing the effective number of summands and thus controlling training difficulty; however, this induces covariate shift between training and test input distributions. This study theoretically and empirically analyzes this side effect and proposes a covariate-shift-free method for modular addition. Specifically, we introduce an auxiliary modulus $Kq$ during training, which reduces wrap-around frequency and problem difficulty while preserving the same input distribution across training and testing. Experiments show strong scalability and sample efficiency: even for large input length $N$, large modulus $q$, and small datasets -- where the sparse method fails to learn -- our method achieves equal or better match accuracy and relaxed $\tau$-accuracy. For example, at $N=64$ and $q=974269$, our method trained on 100K samples achieves $97.0\%$ $\tau$-accuracy at $\tau=0.05$, while the sparse method achieves only $9.5\%$ with the same data size and $93.9\%$ even when extended to 1M samples.

Comment: Analyzes why sparsifying modular-addition training causes covariate shift and proposes an auxiliary-modulus training scheme that preserves the train/test input distribution while easing optimization.

Topic Match: The core contribution is a mechanistic study of learning dynamics on a hard compositional task, plus a new training formulation that changes what structure the model can learn without distribution mismatch.

Relevance: 8 Novelty: 8

14. When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

ArXiv ID: 2605.06723

Primary Topic: Representation Learning Theory and Structure

Authors: Long Zhang, Wei-neng Chen, Feng-feng Wei, Zi-bo Qin

Abstract: Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model's answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilization}. For a model state and specified answer verbalizers, we project the model's own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, $\delta(\xi)=S_\theta(\mathrm{yes}\mid\xi)-S_\theta(\mathrm{no}\mid\xi)$. This target defines parser-based answer onset, retrospective stabilization time, and lead without relying on greedy rollouts or learned probes. In controlled delayed-verdict tasks with Qwen3-4B-Instruct, the contextual finite-answer projection stabilizes before the answer is parseable, with 17--31 token mean lead in the main templates and positive, shorter lead in a parser-clean replication. The signal tracks the model's eventual output rather than truth, is linearly recoverable from compact hidden summaries, is partly separable from cursor progress, and transfers as shared information without a single invariant coordinate. Diagnostics separate the measurement from online stopping, verbalizer-free belief, and causal answer control; exact steering shows local sensitivity of $\delta$ but not reliable generation control.

Comment: Defines a computable finite-answer preference stabilization signal to measure when an LM has effectively committed to an answer before verbalizing it.

Topic Match: The main value is mechanistic understanding of internal decision formation and answer commitment, i.e. structure in learned internal representations and trajectories.

Relevance: 8 Novelty: 8

15. PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

ArXiv ID: 2605.06979

Primary Topic: Representation Learning Theory and Structure

Authors: Jonathn Chang, Arya Datla, Ziv Goldfeld

Abstract: Causal abstraction offers a principled framework for mechanistic interpretability, aligning a high-level causal model with the low-level computation realized by a neural network through counterfactual intervention analysis. Existing methods such as distributed alignment search (DAS) learn expressive subspace interventions, but the relevant neural site is unknown a priori, so finding a handle requires a computationally burdensome search over candidate sites. We introduce PLOT (Progressive Localization via Optimal Transport), a transport-based framework that localizes causal variables from the output effect geometry of abstract and neural interventions. PLOT fits an optimal transport coupling between abstract variables and candidate neural sites, yielding a global soft correspondence that can be calibrated into intervention handles. In simple settings, a single coupling over individual neurons suffices. In larger models, PLOT is applied progressively, moving from coarse sites such as tokens, timesteps, or layers to finer supports such as coordinate groups or PCA spans, and optionally guiding DAS based on the localized signal. Across experiments of increasing complexity, transport-only PLOT handles are exceedingly fast and competitive on accuracy, while PLOT-guided DAS reaches DAS-level accuracy at a fraction of full DAS runtime, providing an efficient localization engine for causal abstraction research at scale.

Comment: Uses optimal transport to localize abstract causal variables to neural sites, greatly reducing the search cost of causal abstraction and intervention localization.

Topic Match: This is a mechanistic interpretability paper about mapping high-level causal variables onto neural representations, squarely fitting representation structure.

Relevance: 8 Novelty: 8

16. Interpreting Reinforcement Learning Agents with Susceptibilities

ArXiv ID: 2605.08007

Primary Topic: Representation Learning Theory and Structure

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Chris Elliott, Einar Urdshals, David Quarel, Daniel Murfet

Abstract: Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that nevertheless exhibits non-trivial stagewise development. We argue that susceptibilities reveal internal features of the development of the model in parameter space that one cannot detect purely by studying the development of the learned policy. We validate these results with activation-steering, and discuss the framework's extension to RLHF post-training.

Comment: Extends susceptibilities to deep RL to expose stagewise internal development not visible from policy behavior alone.

Topic Match: The contribution is chiefly interpretability of learned agent representations and training trajectories, rather than a new RL algorithm.

Relevance: 8 Novelty: 8

Memory Structures and Agent Memory Systems (2)

1. CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

ArXiv ID: 2605.06702

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Siyuan Guo, Yali Du, Hechang Chen, Yi Chang, Jun Wang

Abstract: Large language models (LLMs) have become a central foundation of modern artificial intelligence, yet their lifecycle remains constrained by a rigid separation between training and deployment, after which learning effectively ceases. This limitation contrasts with natural intelligence, which continually adapts through interaction with its environment. In this paper, we formalise deployment-time learning (DTL) as the third stage in the LLM lifecycle that enables LLM agents to improve from experience during deployment without modifying model parameters. We present CASCADE (CASe-based Continual Adaptation during DEployment), a general and principled framework that equips LLM agents with an explicit, evolving episodic memory. CASCADE formulates experience reuse as a contextual bandit problem, enabling principled exploration-exploitation trade-offs and establishing no-regret guarantees over long-term interactions. This design allows agents to accumulate, select, and refine task-relevant cases, transforming past experience into actionable knowledge. Across 16 diverse tasks spanning medical diagnosis, legal analysis, code generation, web search, tool use, and embodied interaction, CASCADE improves macro-averaged success rate by 20.9% over zero-shot prompting while consistently outperforming gradient-based and memory-based baselines. By reframing deployment as an adaptive learning process, this work establishes a foundation for continually improving AI systems.

Comment: Formalizes deployment-time learning as contextual-bandit-based episodic memory reuse, giving no-regret guarantees for continual agent adaptation without weight updates.

Topic Match: Best fit is memory_systems because the central contribution is an explicit evolving episodic memory mechanism for agents, including storage, selection, and reuse principles.

Relevance: 9 Novelty: 8

2. MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

ArXiv ID: 2605.07242

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Yang Zhao, Chengxiao Dai, Mengying Kou, Yue Xiu

Abstract: Agentic memory evolves across tasks into durable derived artifacts: summaries, cached outputs, embeddings, learned skills, and executable tool procedures. When a source artifact is deleted, corrected, or invalidated by tool or API migration, descendants derived from that source can remain visible and steer future actions with stale support. We formalize this failure mode as the cascade update problem, where repair targets the visible derived state of the memory store. We present MemoRepair, a barrier-first cascade-repair contract for agentic memory. A repair event induces a controlled transition from invalidated descendant state to validated successor state: affected descendants are withdrawn before repair, successors are constructed from retained support and staged repaired predecessors under the current interface, and republication is restricted to validated predecessor-closed successors. This contract induces a scalarized repair-selection problem for a fixed repair-cost tradeoff. We show that the induced publication problem reduces to maximum-weight predecessor closure and can be solved exactly by a single s-t min-cut. Experiments on ToolBench and MemoryArena show that, with complete influence provenance, MemoRepair reduces invalidated-memory exposure from 69.8-94.3% under systems without cascade repair to 0%. Compared with exhaustive Repair all, it recovers 91.1-94.3% of validated successors while reducing normalized repair-operator cost from 1.00 to 0.57-0.76.

Comment: Formalizes cascade repair in agent memory and gives an exact min-cut solution for predecessor-closed republishing after memory invalidation.

Topic Match: The core contribution is a new memory-update principle for agentic memory stores: how to withdraw, repair, and republish dependent memory artifacts under invalidation.

Relevance: 9 Novelty: 8

World Models, Exploration, and Open-Ended Reinforcement Learning (7)

1. AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

ArXiv ID: 2605.06841

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Qinshi Zhang (University of California, San Diego), Weipeng Deng (University of Hong Kong), Zhihan Jiang (Columbia University), Jiaming Qu (Amazon), Qianren Li (City University of Hong Kong), Weitao Xu (City University of Hong Kong), Ray LC (City University of Hong Kong)

Abstract: In model-based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non-executable when they are destroyed. We term such events structure-changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi-step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance-Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game-based simulated environments demonstrate the effectiveness of our method by achieving lower multi-step prediction error, better generalization to novel configurations, and improved interpretability.

Comment: Adds an explicit affordance DAG to world models so imagined rollouts track action prerequisites and changing executability.

Topic Match: Its core idea is a new world-model mechanism for planning in environments with structure-changing affordances, directly matching foundational model-based RL.

Relevance: 9 Novelty: 8

2. Predictive but Not Plannable: RC-aux for Latent World Models

ArXiv ID: 2605.07278

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC-aux.

Comment: Adds budget-conditioned reachability supervision so latent world-model spaces become aligned with what is actually plannable under finite horizons.

Topic Match: The paper directly targets the planning mismatch in latent world models, making world models the primary topic.

Relevance: 9 Novelty: 8

3. Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

ArXiv ID: 2605.07123

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

Abstract: In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.

Comment: Shows theoretically that Chain-of-Thought in linear Transformers implements repeated TD-style updates for in-context RL, with convergence guarantees.

Topic Match: The paper is fundamentally about in-context reinforcement learning behavior and its emergence, not just generic transformer theory.

Relevance: 9 Novelty: 8

4. Learning Visual Feature-Based World Models via Residual Latent Action

ArXiv ID: 2605.07079

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias

Abstract: World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as Residual Latent Action (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose RLA World Model (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

Comment: Learns residual latent actions from visual feature residuals and uses them to build a fast feature-space world model for planning and offline RL.

Topic Match: Its core contribution is a new action-conditioned world-model formulation for agent learning and planning, which squarely fits world models.

Relevance: 9 Novelty: 8

5. Finite-Time Analysis of MCTS in Continuous POMDP Planning

ArXiv ID: 2605.07703

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Da Kong, Vadim Indelman

Abstract: This paper presents a finite-time analysis for Monte Carlo Tree Search (MCTS) in Partially Observable Markov Decision Processes (POMDPs), with probabilistic concentration bounds in both discrete and continuous observation spaces. While MCTS-style solvers such as POMCP achieve empirical success in many applications, rigorous finite-time guarantees remain an open problem due to the nonstationarity and the interdependencies induced by heuristic action selection (e.g., UCB). In the discrete setting, we address these challenges by extending the polynomial exploration bonus to UCB in POMDP setting, yielding polynomial concentration bounds for the empirical value estimation at the root node. For continuous observation spaces, we introduce an abstract partitioning framework and propose a finite-time bound on partitioning loss. Under mild conditions, we prove highprobability bound on value estimates in POMDPs with continuous observation space. Specifically, we propose Voro-POMCPOW, a variant of POMCPOW with f inite-time guarantees that adaptively partitions the continuous observation space using Voronoi cells. This approach maintains a finite branching factor while preserving the original observation generator. Empirical validation demonstrates that the proposed Voro-POMCPOW shows competitive performance while providing theoretical guarantees. Although our analysis focuses on continuous POMDPs, the techniques developed herein are also applicable to continuous MDPs, closing another gap on the MDP side.

Comment: Provides finite-time concentration guarantees for MCTS in continuous-observation POMDPs via adaptive Voronoi partitioning.

Topic Match: Best fit is world_models_open_ended_rl because it is foundational theory for planning under partial observability, directly relevant to model-based RL and exploration settings.

Relevance: 8 Novelty: 8

6. On the Divergence of Differential Temporal Difference Learning without Local Clocks

ArXiv ID: 2605.06874

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: David Antrobius, Shangtong Zhang

Abstract: Learning rate is a critical component of reinforcement learning (RL). This work uses global and local clocks to distinguish two types of learning rates. The former is of the standard form $\alpha_t$ that depends only on the time step $t$ (i.e., a global clock). The latter is of the form $\alpha_{\nu(S_t, t)}$, where $\nu(s, t)$ counts the number of visits to state $s$ until time $t$ (i.e., a local clock). In discounted RL, an RL algorithm that is convergent with a local clock is always also convergent with a global clock, and vice versa. We are not aware of any counterexample. The key contribution of this work is to show that this nice correspondence breaks down in average-reward RL. Specifically, we construct a counterexample showing that although differential temporal difference learning is convergent with a local clock, it can diverge with a global clock. This counterexample closes the open problem in Wan et al. [2021], Blaser et al. [2026].

Comment: Shows differential TD can diverge with global-clock learning rates in average-reward RL even when local-clock updates converge.

Topic Match: The contribution is a sharp foundational result on RL learning dynamics and step-size design in average-reward temporal-difference learning.

Relevance: 8 Novelty: 8

7. Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

ArXiv ID: 2605.07727

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Juil Koo, Mingue Park, Jiwon Choi, Yunhong Min, Minhyuk Sung

Abstract: We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probability space. By construction, this gradient is decomposed into an ascent toward higher action-value regions and a score matching with the anchor policy as a trust region. We further derive a simple, tractable surrogate of the otherwise intractable update loss, akin to behavior cloning on top-K critic-selected actions. We find empirically that this mechanism uniquely benefits the drifting backbone owing to its non-ODE parameterization. With one-step inference, DFP achieves state-of-the-art performance on several manipulation tasks across Robomimic and OGBench, outperforming ODE-based policies.

Comment: Frames policy updates as Wasserstein gradient flow toward a soft target policy, yielding a one-step generative policy with a tractable update surrogate.

Topic Match: This is a foundational RL algorithm paper on policy parameterization and update dynamics, best matched to the RL bucket rather than generic architecture.

Relevance: 8 Novelty: 8

Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.

7-8: substantially related, but partly peripheral or focused on a narrower aspect.

5-6: touches the target topics, but the main contribution is elsewhere.

3-4: largely outside the target topics, often application-focused or domain-specific.

1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

9-10: new paradigm, theory, or major methodological breakthrough.

7-8: substantial methodological advance or strong new insight.

5-6: meaningful but incremental extension or refinement.

3-4: minor, narrow, or mostly engineering or domain-specific improvement.

1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.