Personalized Daily ArXiv Papers 2026-05-12

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	848095	66557	914652	2121	1450	183
`gpt-5.4`	Cost	$2.12	$1.00	$3.12	2121	1450	183

Topic Coverage:

Topic	Papers
Architecture and Training Dynamics	66
Efficiency, Compression, and Large-Scale Training	31
Representation Learning Theory and Structure	48
Memory Structures and Agent Memory Systems	12
World Models, Exploration, and Open-Ended Reinforcement Learning	26

Table of contents by topic:

Architecture and Training Dynamics (66)

Hierarchical Mixture-of-Experts with Two-Stage Optimization Authors: Gleb Molodtsov, Alexander Miasnikov, Aleksandr Beznosikov
ELF: Embedded Language Flows Authors: Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He
Priming: Hybrid State Space Models From Pre-trained Transformers Authors: Aditya Chattopadhyay, Elvis Nunez, Prannay Kaul, Benjamin Bowman, Evan Becker, Luca Zancato, David Thomas, Wei Xia, Stefano Soatto
Attention Drift: What Autoregressive Speculative Decoding Models Learn Authors: Do\u{g}a\c{c} Eldenk, Payal Mohapatra, Yigitcan Comlek, Kaan Oktay, Hongyang Zhang, Stephen Xia
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers Authors: Gabriel Smithline, Chris Mascioli
A Single-Layer Model Can Do Language Modeling Authors: Zanmin Wang
TIDES: Implicit Time-Awareness in Selective State Space Models Authors: Taylan Soydan, Miguel A. Bessa, Dirk Mohr, Rui Barreira
FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences Authors: Mengqi Li, Wensheng Lin, Jinshuai Yang, Lixin Li
Complex-Valued Phase-Coherent Transformer Authors: Leona Hioki
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices Authors: Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training Authors: Yuanyi Wang, Yifan Yang, Su Lu, Yanggan Gu, Pengkai Wang, Wenjun Wang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Jialun Cao, Shing-Chi Cheung, Hongxia Yang
Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge Authors: Dylan Forde
A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks Authors: Daning Cheng, Zeyu Liu, Jun Sun, Fen Xia, Boyang Zhang, Dongping Liu, Yunquan Zhang
SDG-MoE: Signed Debate Graph Mixture-of-Experts Authors: Stepan Kulibaba, Kirill Labzin, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gansnikov, Aleksei Shpilman
Key-Value Means Authors: Daniel Goldstein, Eugene Cheah
Continuity Laws for Sequential Models Authors: Annan Yu, Dongwei Lyu, N. Benjamin Erichson
Muown: Row-Norm Control for Muon Optimization Authors: Kai Lion, Florian H\"ubler, Bingcong Li, Antonio Orvieto, Niao He
Kaczmarz Linear Attention Authors: Jiaxuan Zou, Ruifeng Ren, Yong Liu
Scaling Limits of Long-Context Transformers Authors: Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet
Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity Authors: Zhongjie Shi, Wenjing Liao
Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression Authors: Mingsong Yan, Dongyang Li, Charles Kulick, Sui Tang
Mixture of Layers with Hybrid Attention Authors: Ivan Ternovtsii, Yurii Bilak
Sparse Layers are Critical to Scaling Looped Language Models Authors: Ryan Lee, Jacob Biloki, Edward J. Hu, Jonathan May
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition Authors: Haoren Xu, Guanhua Fang
Kinetic theory for Transformers and the lost-in-the-middle phenomenon Authors: Mitia Duerinckx, Borjan Geshkovski, Stefano Rossi
Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective Authors: Jiuqi Wang, Jayanth Srinivasa, Claire Chen, Shuze Daniel Liu, Ali Payani, Shangtong Zhang
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases Authors: Daniel Wolfson, Tal Wagner
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models Authors: Lin Zheng, Vasilisa Bashlovkina, Timothy Dozat, Dan Garrette, Laura Rimell, Joshua Maynez
Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models Authors: Jeonseong Kim
Teaching LLMs to See Graphs: Unifying Text and Structural Reasoning Authors: Dario Vajda
Lattice Deduction Transformers Authors: Liam Davis, Leopold Haller, Alberto Alfarano, Mark Santolucito
Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling Authors: Ali Syed, Aditya Nambiar, Jonathan W. Siegel
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition Authors: Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime Authors: Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith
Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising Authors: Youssef Saied, Fran\c{c}ois Fleuret
The Power of Second Order Methods for Sequence Preconditioning Authors: Annie Marsden, Elad Hazan
NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training Authors: Fang Wu, Haokai Zhao, Da Xing, Hanqun Cao, Tinson Xu, Yanchao Li, Xiangru Tang, Zehong Wang, Aaron Tu, Kuan Pang, Hanchen Wang, Hongbin Lin, Zeqi Zhou, Yinxi Li, Peng Xia, Li Erran Li, Molei Tao, Jure Leskovec, Aditya Joshi, Yejin Choi
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off Authors: Yu Chen, Yuanhao Liu, Qi Cao
On Variance Reduction in Learning Mean Flows Authors: Juanwu Lu, Ziran Wang
Infinite Mask Diffusion for Few-Step Distillation Authors: Jaehoon Yoo, Wonjung Kim, Chanhyuk Lee, Seunghoon Hong
Phases of Muon: When Muon Eclipses SignSGD Authors: Elliot Paquette, Noah Marshall, Lucas Benigni, Guangyuan Wang, Atish Agarwala, Courtney Paquette
Controlling Transient Amplification Improves Long-horizon Rollouts Authors: Adeel Pervez, Francesco Locatello
Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit Authors: Konstantin Riedl, Konstantinos Spiliopoulos, Justin Sirignano
Optimizer-Induced Mode Connectivity: From AdamW to Muon Authors: Fangzhao Zhang, Sungyoon Kim, Erica Zhang, Yiqi Jiang, Mert Pilanci
Fitting Multilinear Polynomials for Logic Gate Networks Authors: Youngsung Kim
Hyperparameter Transfer for Dense Associative Memories Authors: Roi Holtzman, Dmitry Krotov, Boris Hanin
Structured Recurrent Mixers for Massively Parallelized Sequence Generation Authors: Benjamin L. Badger
Dimension-Free Saddle-Point Escape in Muon Authors: Yanlin Long, Yufei Gu, Zeke Xie
Parameterized Complexity of Stationarity Testing for Piecewise-Affine Functions and Shallow CNN Losses Authors: Yuhan Ye
Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects Authors: Kevin Dao, Jose Israel Rodriguez
CATO: Charted Attention for Neural PDE Operators Authors: Chun-Wun Cheng, Sifan Wang, Carola-Bibiane Sch\"onlieb, Angelica I. Aviles-Rivero
RAwR: Role-Aware Rewiring via Approximate Equitable Partition Authors: Riccardo Porcedda, Giuseppe Squillace, Bastian Epping, Andrea Vandin, Michael Schaub, Mirco Tribastone, Francesca Chiaromonte
When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains Authors: Brandon Yee, Pairie Koh, Jack Rodriguez, Mihir Tekal
Exactness Matters for Physical Rule Enforcement Authors: Bum Jun Kim
Exact Fixed-Point Constraints in Neural-ODEs with Provable Universality Authors: Feliciano Giuseppe Pacifico, Duccio Fanelli, Lorenzo Buffoni, Lorenzo Chicchi, Diego Febbe, Raffaele Marino
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why Authors: Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N. M Hoang, Fartash Faghri, Yizhe Zhang, Minsik Cho, Mehrdad Farajtabar
Elucidating Representation Degradation Problem in Diffusion Model Training Authors: Zhipeng Yao, Dazhou Li, Zitong Zhang, Durude Mahee, Fan Zhu, Wenbin Zhang, Xinwei He, Yeying Jin, Rui Yu
A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models Authors: Djamel Bouchaffra
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits Authors: Tianhao Cheng, Zeyu Huang, Zihan Qiu, Yu Cheng, Edoardo Ponti, Yinghui Xu, Ivan Titov, Zenglin Xu
Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory Authors: Yao Shu, Jian Mu, Zhongxiang Dai
Recovering Physical Dynamics from Discrete Observations via Intrinsic Differential Consistency Authors: Yuxiang Luo, Andrew Perrault
RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings Authors: Byeongchan Kim, Arijit Sehanobish, Avinava Dubey, Min-hwan Oh, Krzysztof Choromanski
Improving Generalization by Permutation Routing Across Model Copies Authors: Shuhei Kashiwamura, Timothee Leleu
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs Authors: Xin Li, Hao Jiang, Annan Wang, Yichi Zhang, Chau Yuen
HyperTransport: Amortized Conditioning of T2I Generative Models Authors: Valentino Maiorca, Eleonora Gualdoni, Xavier Suau, Marco Cuturi, Luca Zappella, Pau Rodr\'iguez
Don't Fix the Basis -- Learn It: Spectral Representation with Adaptive Basis Learning for PDEs Authors: Xuxiang Zhao, Angelica I. Aviles-Rivero

Efficiency, Compression, and Large-Scale Training (31)

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Authors: Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, Rex Ying
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization Authors: Venugopalan Iyengar
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding Authors: Zihao An, Taichi Liu, Ziqiong Liu, Dong Li, Ruofeng Liu, Emad Barsoum
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation Authors: Ziyun Liu, Fengmiao Bian, Jian-Feng Cai
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning Authors: Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin
Test-Time Speculation Authors: Avinash Kumar, Sujay Sanghavi, Poulami Das
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms Authors: Omatharv Bharat Vaidya, Connor T. Jerzak, Nhat Ho, Chandrajit Bajaj
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning Authors: Aojie Yuan, Tianqi Shen, Dajun Zhang
Nectar: Neural Estimation of Cached-Token Attention via Regression Authors: Jo\~ao Monteiro, Michal Klein, Pierre Ablin, Marco Cuturi
TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling Authors: Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu
RubiConv -- Efficient Boundary-Respecting Convolutions Authors: Linda Friso, Annie Marsden, Xinyi Chen, Arushi Gupta, Peter Bartlett, Mark Braverman, Elad Hazan
LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss Authors: Euntae Choi, Sumin Song, Sungjoo Yoo
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs Authors: Chayne Thrash, Ali Abbasi, Soheil Kolouri
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale Authors: Liang Luo, Yinbin Ma, Quanyu Zhu, Vasiliy Kuznetsov, Yuxin Chen, Jian Jiao, Jiecao Yu, Buyun Zhang, Tongyi Tang, Xiaohan Wei, Yanli Zhao, Zeliang Chen, Yuchen Hao, Venkatesh Ranganathan, Sandeep Parab, Yantao Yao, Maxim Naumov, Chunzhi Yang, Shen Li, Ellie Wen, Wenlin Chen, Santanu Kolay, Chunqiang Tang
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration Authors: Jiahe Chen, Ziye Ma
GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference Authors: Zengzipeng Tang, Yuxuan Sun, Wei Chen, Jianwen Ding, Bo Ai
AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery Authors: Barbara Su, Fangshuo Liao, Anastasios Kyrillidis
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving Authors: Zhiqing Zhong, Zhijing Ye, Jian Zhang, Weijian Zheng, Bolun Sun, Xiaodong Yu
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World Authors: Christopher M. Bryant, Hao Liu
Locking Pretrained Weights via Deep Low-Rank Residual Distillation Authors: Keitaro Sakamoto, Pierre Ablin, Federico Danieli, Marco Cuturi
Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features Authors: Guangqi Li, Yongxin Li
Compute Where it Counts: Self Optimizing Language Models Authors: Yash Akhauri, Mohamed S. Abdelfattah
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds Authors: Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, Bamdev Mishra
Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization Authors: Yao Shu, Zilin Zhu
Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training Authors: Ting Sun, Junjie Zhang, Xiao Yan, Songxin Zhang, Zhuoyang Song, Jingyi Xi, Zunyao Mao, Bingyi Jing, Jiaxing Zhang, Zejian Xie
Adversary-Robust Learning from Fully Asynchronous Directional Derivative Estimates Authors: Anik Kumar Paul, Nibedita Roy, Nagesh Talagani, Swetha Ganesh, Gugan Thoppe, Alexandre Reiffers-Masson
Core-Halo Decomposition: Decentralizing Large-Scale Fixed-Point Problems Authors: Haixiang, Yang Xu, Jiefu Zhang, Xudong Wu, Zihan Zhou, Jun He, Jiayu Chen
Function-Space ADMM for Decentralized Federated Learning: A Control Theoretic Perspective Authors: Akihito Taya, Yuuki Nishiyama, Kaoru Sezaki
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning Authors: Yuhang Xu, Kaibin Tian, Yang Tian, Zhice Yang, Yifeng Yu, Yan Li, Shengzhong Liu, Fan Wu, Guihai Chen
TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators Authors: Chang Meng, Hanyu Wang, Yuyang Ye, Mingfei Yu, Wayne Burleson, Giovanni De Micheli
Unveiling High-Probability Generalization in Decentralized SGD Authors: Jiahuan Wang, Ping Luo, Ziqing Wen, Dongsheng Li, Tao Sun

Representation Learning Theory and Structure (48)

Learnability and Competition in High-Dimensional Multi-Component ICA Authors: Eser Ilke Genc, Samet Demir, Zafer Dogan
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery Authors: Andreas D. Demou, Panagiotis Koromilas, James Oldfield, Yannis Panagakis, Mihalis A. Nicolaou
Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning Authors: Nikhil J. Dhinagar, Vidhi Chhatbar, Chirag Jagad, Pavithra Senthilkumar, Sophia I. Thomopoulos, Mahir H. Khan, Sook-Lei Liew, the ENIGMA-Stroke Recovery Working Group, Paul M. Thompson
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure Authors: Nilesh Sarkar, Dawar Jyoti Deka
The two clocks and the innovation window: When and how generative models learn rules Authors: Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu
Bilinear autoencoders find interpretable manifolds Authors: Thomas Dooms, Ward Gauderis, Geraint Wiggins, Jose Oramas
The Geometric Structure of Models Learning Sparse Data Authors: Thomas Walker, T. Mitchell Roddenberry, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk
Beyond Language: Format-Agnostic Reasoning Subspaces in Large Language Models Authors: Aojie Yuan, Zhiyuan Su
The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning Authors: James Hazelden, Laura Driscoll, Eli Shlizerman, Eric Shea-Brown
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks Authors: Minh-Toan Nguyen, Jean Barbier
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions Authors: Andrew Lee, Fernanda Vi\'egas, Martin Wattenberg
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds Authors: Yiding Song, Hanming Ye
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs Authors: Sohan Venkatesh
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations Authors: Rania Elbadry, Ahmed Heakl, Fan Zhang, Dani Bouch, Yuxia Wang, Preslav Nakov, Zhuohan Xie
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently Authors: Elisabetta Cornacchia, Dan Mikulincer, Elchanan Mossel
SMIXAE: Towards Unsupervised Manifold Discovery in Language Models Authors: Collin Francel
The Polynomial Counting Capabilities of Message Passing Neural Networks Authors: Marco S\"alzer, Pascal Bergstr\"a{\ss}er, Anthony W. Lin
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence? Authors: Zhaoyang Zhang, Run Shao, Dongyue Wu, Jiajie Teng, Chao Tao, Jingdong Chen, Haifeng Li
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models Authors: Zehao Li, Yasuhiro Yoshikai, Shumpei Nemoto, Hiroyuki Kusuhara, Tadahaya Mizuno
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations Authors: Su-Hyeon Kim, Yo-Sub Han
HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds Authors: Honghan Wu, Tianyan Wang, Jiacong Mi, Zhoyang Jiang, Yunsoo Kim
The Propagation Field: A Geometric Substrate Theory of Deep Learning Authors: Xingrui Gu
Neural Information Causality Authors: Jeongho Bang, Marcin Paw{\l}owski
Neural Weight Norm = Kolmogorov Complexity Authors: Tiberiu Musat
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal Authors: Aojie Yuan, Zhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao
Towards Effective Theory of LLMs: A Representation Learning Approach Authors: Muhammed Ustaomeroglu, Guannan Qu
Belief or Circuitry? Causal Evidence for In-Context Graph Learning Authors: Katharine Kowalyshyn, Timothy Duggan, Daniel Little, Michael C Hughes
How LLMs Are Persuaded: A Few Attention Heads, Rerouted Authors: Xiangkun Sun, Lingkai Kong, Aoqi Zhang, Liang Zeng, Tonghan Wang
Reasoning emerges from constrained inference manifolds in large language models Authors: Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang, Yinan Wu, Zhe Qian, Yang Lu, Long Chen, Zhao Cao, Xiaoshuai Hao, Ji-Rong Wen, Jungong Han
What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching Authors: Alec Helbling, Sebastian Gutierrez Hernandez, Benjamin Hoover, Duen Horng Chau, Parikshit Ram
Deterministic Decomposition of Stochastic Generative Dynamics Authors: Xingyu Song, Yuan Mei, Naoya Takeishi
Diagnosing Spectral Ceilings in Equivariant Neural Force Fields Authors: Hyunmog Kim
Generalization Error Bounds for Picard-Type Operator Learning in Nonlinear Parabolic PDEs Authors: Koichi Taniguchi, Sho Sonoda
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models Authors: Hamid Kazemi, Atoosa Chegini, Maria Safi
Physical probes expose and alleviate chemical-environment collapse in molecular representations Authors: Jiebin Fang, Zidi Yan, Churu Mao, Yongjun Jiang, Xinyi Tang, Lei Miao, Dan Lu, Yun Huang, Wanjing Ding, Zhongjun Ma
A Deep Risk Estimator for Known Operator Learning Authors: Andreas Maier, Md Hasan, Paulina Conrad, Paula Andrea Perez-Toro
Mistake-Bounded Language Generation Authors: Jon Kleinberg, Charlotte Peale, Omer Reingold
In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification Authors: Ming Liu
Measuring and Decomposing Mode Separation via the Canonical Diffusion Authors: Shaul Tolkovsky, Ori Meidler, Or Zuk
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs Authors: Krishak Aneja, Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri
Optimality of Sub-network Laplace Approximations: New Results and Methods Authors: Swarnali Raha, Kshitij Khare, Rohit K Patra
Embeddings for Preferences, Not Semantics Authors: Carter Blair, Ariel D. Procaccia, Milind Tambe
Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation Authors: Lucas Morisset, Alain Durmus, Adrien Hardy
Non-Parametric Rehearsal Learning via Conditional Mean Embeddings Authors: Wen-Bo Du, Tian-Zuo Wang, Han-Jia Ye, Zhi-Hua Zhou
The Pok\'emon Theorem and other Fairness Impossibility Results Authors: Daniel Matsui Smola, Alex Smola
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration Authors: Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li, Jian Lou, Zunlei Feng, Mingli Song, Ruoxi Jia, Tianwei Zhang
Prospective Compression in Human Abstraction Learning Authors: Leonardo Hernandez Cano, Ivan Zareski, Luisa El Amouri, Pinzhe Zhao, Max Mascini, Emanuele Sansone, Yewen Pu, Bonan Zhao, Marta Kryven
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck Authors: Zihan Huang, Junda Wu, Tong Yu, Qianqi Yan, Rohan Surana, Uttaran Bhattacharya, Lina Yao, Xin Eric Wang, Julian McAuley

Memory Structures and Agent Memory Systems (12)

HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing Authors: Yuan Fang, Yi Xie, Xuming Ran
VORT: Adaptive Power-Law Memory for NLP Transformers Authors: Nabil Mlaiki
Continuous Latent Contexts Enable Efficient Online Learning in Transformers Authors: Emile Anand, Abdullah Ateyeh, Xinyuan Cao, Max Dabagia
Factual recall in linear associative memories: sharp asymptotics and mechanistic insights Authors: Alessio Giorlandino, Sebastian Goldt, Antoine Maillard
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm Authors: Haoyu Wang, Yifan Shang, Zhongxiang Sun, Weijie Yu, Xiao Zhang, Jun Xu
Workspace Optimization: How to Train Your Agent Authors: Elad Sarafian, Gal Kaplun, Ron Banner, Daniel Soudry, Boris Ginsburg
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium Authors: Yuqiao Meng, Sakshi Sunil Narvekar, Luoxi Tang, Rupali Rajendra Vaje, Yingxue Zhang, Muchao Ye, Zhaohan Xi
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents Authors: Min Yang, Jinghua Piao, Xu Xia, Xiaochong Lan, Jiaju Chen, Yongshun Gong, Yong Li
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection Authors: Jared Glover
HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations Authors: Lennon J. Shikhman
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning Authors: Debashis Guha
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents Authors: Ziyang Yu, Qiyue Li, Liang Zhao

World Models, Exploration, and Open-Ended Reinforcement Learning (26)

Latent Geometry Beyond Search: Amortizing Planning in World Models Authors: Hoang Nguyen, Xiaohao Xu, Xiaonan Huang
LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations Authors: Qixin Xiao, Maani Ghaffari
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay Authors: Yanan Xiao, Yixiang Tang, Zechen Feng, Lu Jiang, Minghao Yin, Pengyang Wang
Path-Coupled Bellman Flows for Distributional Reinforcement Learning Authors: Boyang Xu, Qing Zou, Siqin Yang, Hao Yan
Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients Authors: Alex DeWeese, Guannan Qu
ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models Authors: Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo, Yandan Yang, Bin Liu, Zhejia Cai, Feng Xiong, Mu Xu, jiachen Luo, De Ma, Zhiheng Ma, Gang Pan
Do multimodal models imagine electric sheep? Authors: Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Kr\"ahenb\"uhl, Vladlen Koltun
The Reciprocity Gradient Authors: Yue Lin, Pascal Poupart, Shuhui Zhu, Dan Qiao, Wenhao Li, Yuan Liu, Hongyuan Zha, Baoxiang Wang
Quantile-Coupled Flow Matching for Distributional Reinforcement Learning Authors: Michael Groom, Victor-Alexandru Darvariu, Lars Kunze, James Wilson, Nick Hawes
Generative Actor-Critic with Soft Bridge Policies Authors: Ke He, Le He, Shunpu Tang, Yafei Wang, Lisheng Fan
Zero-shot Imitation Learning by Latent Topology Mapping Authors: Maxwell J. Jacobson, Yexiang Xue
One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning Authors: Bowen He, Juncheng Dong, Lin Lin, Xiang Cheng
PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling Authors: Yaniv Oren, Viliam Vadocz, Joery A. de Vries, Wendelin B\"ohmer, Matthijs T. J. Spaan, Hendrik Baier
Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework Authors: Phalguni Nanda, Zaiwei Chen
Policy Gradient Methods for Non-Markovian Reinforcement Learning Authors: Avik Kar, Siddharth Chandak, Rahul Singh, Soumitra Sinhahajari, Eric Moulines, Shalabh Bhatnagar, Nicholas Bambos
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift Authors: Surbhi Goel, Jonathan Pei, James Wang
On Characterizing Learnability for Adversarial Noisy Bandits Authors: Steve Hanneke, Kun Wang
Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL Authors: Shengbo Wang, Zexi Zhang
Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions Authors: Soumita Hait, Ping Li, Haipeng Luo, Mengxiao Zhang
Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies Authors: Muyun Lu, Haoyang Hong, Huazheng Wang, Ying Lin
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning Authors: Haoqiang Kang, Xiaokang Ye, Yuhan Liu, Siddhant Hitesh Mantri, Lingjun Mao, James Fleming, Drishti Regmi, Lianhui Qin
Continual Harness: Online Adaptation for Self-Improving Foundation Agents Authors: Seth Karten, Joel Zhang, Tersoo Upaa Jr, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning Authors: Chen Li, Zhantao Yang, Fangyi Chen, Han Zhang, Anudeepsekhar Bolimera, Marios Savvides
The Value of Mechanistic Priors in Sequential Decision Making Authors: Itai Shufaro, Gal Benor, Shie Mannor
Shields to Guarantee Probabilistic Safety in MDPs Authors: Linus Heck, Filip Mac\'ak, Roman Andriushchenko, Milan \v{C}e\v{s}ka, Sebastian Junges
Switching-Geometry Analysis of Deflated Q-Value Iteration Authors: Donghwan Lee

Architecture and Training Dynamics (66)

1. Hierarchical Mixture-of-Experts with Two-Stage Optimization

ArXiv ID: 2605.08292

Primary Topic: Architecture and Training Dynamics

Authors: Gleb Molodtsov, Alexander Miasnikov, Aleksandr Beznosikov

Abstract: Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.

Comment: Proposes hierarchical MoE routing with separate inter-group balancing and intra-group specialization to stabilize expert specialization and avoid collapse.

Topic Match: This is directly about a core architectural mechanism—MoE routing—and its training dynamics under balancing versus specialization.

Relevance: 10 Novelty: 8

2. ELF: Embedded Language Flows

ArXiv ID: 2605.10938

Primary Topic: Architecture and Training Dynamics

Authors: Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He

Abstract: Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.

Comment: Introduces a continuous embedding-space flow-matching language model that stays continuous until the final token projection.

Topic Match: The main contribution is a new language-model architecture/formulation, adapting continuous flow matching to token generation in a mechanistically distinct way.

Relevance: 9 Novelty: 9

3. Priming: Hybrid State Space Models From Pre-trained Transformers

ArXiv ID: 2605.08301

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training, Memory Structures and Agent Memory Systems

Authors: Aditya Chattopadhyay, Elvis Nunez, Prannay Kaul, Benjamin Bowman, Evan Becker, Luca Zancato, David Thomas, Wei Xia, Stefano Soatto

Abstract: Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.

Comment: Shows how to convert pretrained Transformers into hybrid attention-SSM models with minimal extra training, enabling controlled large-scale comparison of SSM layer types.

Topic Match: The core contribution is architectural: a general method for exploring hybrid Transformer-SSM design and comparing recurrent memory mechanisms at scale.

Relevance: 9 Novelty: 8

4. Attention Drift: What Autoregressive Speculative Decoding Models Learn

ArXiv ID: 2605.09992

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Do\u{g}a\c{c} Eldenk, Payal Mohapatra, Yigitcan Comlek, Kaan Oktay, Hongyang Zhang, Stephen Xia

Abstract: Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to $2\times$ under template perturbation, $1.18\times$ on long-context tasks, and $1.10\times$ on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.

Comment: Identifies attention drift in speculative decoding drafters and ties it to unnormalized residual growth, then fixes it with normalization-based architectural changes.

Topic Match: The main contribution is mechanistic analysis of drafter dynamics plus a simple architectural stabilization that improves speculative decoding behavior.

Relevance: 9 Novelty: 8

5. Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

ArXiv ID: 2605.09403

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Gabriel Smithline, Chris Mascioli

Abstract: Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.

Comment: Shows that sparse FFN designs like MoE can relocate computation into attention, and that random routing nearly matches learned routing in driving this redistribution.

Topic Match: This is directly about architectural mechanism: how FFN sparsity and routing alter where computation happens inside Transformers.

Relevance: 9 Novelty: 8

6. A Single-Layer Model Can Do Language Modeling

ArXiv ID: 2605.10643

Primary Topic: Architecture and Training Dynamics

Also Matches: Memory Structures and Agent Memory Systems

Authors: Zanmin Wang

Abstract: Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads that split spontaneously into fast and slow retention pools.

Comment: Explores how far a single recurrent block with one shared state vector can go in language modeling, and analyzes the emergent memory geometry of that state.

Topic Match: The central contribution is an extreme recurrent architectural alternative to deep stacking, with direct analysis of its internal memory behavior.

Relevance: 9 Novelty: 8

7. TIDES: Implicit Time-Awareness in Selective State Space Models

ArXiv ID: 2605.09742

Primary Topic: Architecture and Training Dynamics

Authors: Taylan Soydan, Miguel A. Bessa, Dirk Mohr, Rui Barreira

Abstract: Selective state space models (SSMs), such as Mamba, achieve strong per-token expressivity by making the time discretization step $\Tilde{\Delta}$ a learned function of the input. However, in doing so, $\Tilde{\Delta}$ ceases to represent a physical sampling interval, limiting its irregular time series modeling capability. Continuous-time SSMs, such as S5, preserve the physical meaning of $\Tilde{\Delta}$ and handle irregular timestamps natively ($\Tilde{\Delta}\equiv\Delta)$, but their dynamics remain linear time-invariant (LTI), limiting per-token expressivity. We propose \textbf{TIDES}, a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, $\Tilde{\Delta}$ retains its physical meaning, tied to the state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective. We show this on a novel \emph{Fading Flash} experimental benchmark, a compact controlled diagnostic for sequence models that jointly tests input-dependence and extrapolation to out-of-distribution $\Delta$ values, and isolates the distinct failure modes of current state-of-the-art architectures that TIDES avoids by construction. On large-scale benchmarks, TIDES sets the new state-of-the-art average rank on UEA time-series classification and the Physiome-ODE regression benchmark. Code available at: https://github.com/TaylanSoydan/TIDES.

Comment: Reconciles selective SSM expressivity with physically meaningful irregular-time discretization by moving input dependence from step size to the state matrix.

Topic Match: This is directly about core sequence-model architecture, fixing a conceptual limitation in selective SSMs for irregularly sampled data.

Relevance: 9 Novelty: 8

8. FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences

ArXiv ID: 2605.08833

Primary Topic: Architecture and Training Dynamics

Also Matches: Memory Structures and Agent Memory Systems

Authors: Mengqi Li, Wensheng Lin, Jinshuai Yang, Lixin Li

Abstract: Effective sequence modeling fundamentally requires balancing the retention of unbounded history with the high-resolution detection of abrupt short-term variations common in real-world phenomena. However, existing state space models (SSMs) relying on high-order polynomial projection operators (HiPPO) face a critical trade-off where uniform measures dilute recent information to maintain timescale invariance, while exponential measures sacrifice global context to capture local dynamics. This paper proposes a Fractional Recurrent Architecture for Computational Temporal Analysis of Long sequences (FRACTAL), a novel architecture integrating fractional measure theory into recursive memory updates to address this limitation. By deriving projection operators with analytically characterized spectral properties and a tunable singularity index, the proposed method amplifies sensitivity to recent signal perturbations while preserving the spectral structure that encodes scale-invariant memory dynamics. This theoretical innovation is instantiated within a simplified diagonalized state space framework by modulating input projection initialization to enable simultaneous capture of multi-scale temporal features. FRACTAL achieves an average score of 87.11\% on the Long Range Arena benchmark, including 61.85\% on the ListOps task, outperforming the S5 model.

Comment: Introduces a fractional recurrent state-space architecture that analytically tunes recent-signal sensitivity while preserving long-range scale-invariant memory.

Topic Match: The paper proposes a new sequence-model architecture based on fractional-measure memory dynamics, making architecture the clearest primary fit.

Relevance: 9 Novelty: 8

9. Complex-Valued Phase-Coherent Transformer

ArXiv ID: 2605.10123

Primary Topic: Architecture and Training Dynamics

Authors: Leona Hioki

Abstract: Complex-valued Transformers have largely inherited softmax attention from real-valued architectures. However, row-normalised token competition is not necessarily aligned with phase-preserving computation. In this paper, we introduce the Phase-Coherent Transformer (PCT), which applies a real-valued, element-independent, smooth gate to L2-normalised complex query-key similarities. PCT replaces token competition with token-non-competing attention and is designed to preserve phase information across layers. Across mid-scale benchmarks spanning long-range memory, hierarchical long-range reasoning, positional retrieval, phase-based memory and superposition, and image classification, PCT shows strong generalisation across task categories. Under parameter-fair comparison, PCT consistently outperforms both the standard softmax Transformer and its direct complex-valued counterpart. Moreover, even on tasks traditionally considered difficult for complex-valued neural networks, such as NIAH and LRA-Text, PCT remains competitive with Multiscreen, the strongest real-valued NN baseline in our comparison. Experiments introducing gates that deliberately violate the PCT conditions show that the design is not incidental: smooth gates that preserve negatively aligned phase components remain strong, whereas gates that delete such components collapse on long-range retrieval, and gates whose outputs become excessively large suffer clear performance degradation. PCT also shows no depth-related accuracy collapse across the tested depth range. These results support introducing multi-layer phase-coherent structure into attention as a promising design principle for achieving generalisation in complex-valued Transformers.

Comment: Replaces softmax competition with phase-coherent gated complex attention to preserve phase information across transformer layers.

Topic Match: This is a direct architectural mechanism paper on attention design and stability in complex-valued transformers.

Relevance: 9 Novelty: 8

10. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

ArXiv ID: 2605.10933

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu

Abstract: While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints will be released.

Comment: Proposes a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU to match dense models under equal parameter budgets.

Topic Match: The strongest contribution is architectural: a new MoE design and routing/activation choices that change sparse model behavior, with efficiency as a consequence.

Relevance: 9 Novelty: 8

11. Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

ArXiv ID: 2605.09608

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Yuanyi Wang, Yifan Yang, Su Lu, Yanggan Gu, Pengkai Wang, Wenjun Wang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Jialun Cao, Shing-Chi Cheung, Hongxia Yang

Abstract: Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B--14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.

Comment: Explains continual post-training forgetting via state-relative covariance geometry of task updates and uses that geometry to gate data-free update merging.

Topic Match: The core contribution is a training-dynamics account of interference and a geometry-aware mechanism for integrating sequential updates.

Relevance: 9 Novelty: 8

ArXiv ID: 2605.08123

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Dylan Forde

Abstract: We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped $T$-step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the production $R=2$ case, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the $R=2$ score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost $O((T+R)LW)$, $O(Ld)$ input storage, and $O(L)$ additional HBM usage for fixed head dimension $d$ and band width $W$. We also formalize the current \texttt{dustbin_block} path as the same balanced surrogate on an augmented support, so the schedule lifts to the gap-aware transport path used in our TPU runs. We provide a local surrogate-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within $10^{-5}$--$10^{-10}$. On TPU v6e-8, a four-configuration Pfam screen completes end-to-end, and a promoted balanced $R=2$ run sustains roughly $8.5$ examples per second through a three-hour budget, reaching step $1437$. Held-out Pfam test shards improve reconstruction from $3.17$ to $0.99$ and sparse CE from $5.86$ to $5.69$ relative to step $0$. These results support exact fixed-depth backward theory, a theorem-matching gap-aware bridge, and trainability evidence for the production path.

Comment: Block-wise exact backward theory for fixed-depth Sinkhorn OT attention with theorem-matched low-memory TPU schedule.

Topic Match: The core contribution is a new attention mechanism training/backward formulation with exact gradient structure and trainability analysis, making architecture/training dynamics the best fit.

Relevance: 9 Novelty: 8

13. A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

ArXiv ID: 2605.08297

Primary Topic: Architecture and Training Dynamics

Authors: Daning Cheng, Zeyu Liu, Jun Sun, Fen Xia, Boyang Zhang, Dongping Liu, Yunquan Zhang

Abstract: The scaling behavior, in which test performance often improves as model size and data increase, is a central empirical phenomenon in modern deep learning, yet its theoretical basis remains incomplete. In this paper, we study depth expansion in normalized residual networks: starting from a trained model in an old hypothesis class, we insert a new residual block at an intermediate layer and ask when such an expansion can yield a provable improvement in test risk. We develop a unified framework that decomposes this question into representational gain, optimization gain, and generalization transfer. First, under a first-order descent condition near zero initialization, we prove that the expanded hypothesis class contains an auxiliary jumpboard model with strictly smaller population risk than the original model. Second, under norm control tailored to post-normalized residual architectures, we establish a norm-based Rademacher complexity bound for the expanded model class. These ingredients lead to two complementary test-risk guarantees: one route passes through population risk and is tighter when a positive population margin is available, while the other works directly at the train/test level, avoids Hoeffding transfer, and is more robust in degenerate regimes. Together, these results provide a theorem-driven mechanism under which residual depth expansion can improve test performance in normalized residual networks. More broadly, they suggest that scaling is inherently joint: depth creates new improving directions, width enhances the finite-sample observability of weak signals, and data determines whether the statistical cost of expansion can be controlled.

Comment: Provides a theorem-driven mechanism for why depth expansion in normalized residual networks can improve test risk.

Topic Match: This is directly about residual architecture scaling and training/generalization dynamics, with formal guarantees tied to normalized residual design.

Relevance: 9 Novelty: 8

14. SDG-MoE: Signed Debate Graph Mixture-of-Experts

ArXiv ID: 2605.08322

Primary Topic: Architecture and Training Dynamics

Authors: Stepan Kulibaba, Kirill Labzin, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gansnikov, Aleksei Shpilman

Abstract: Sparse MoE models achieve a good balance between capacity and compute by routing each token to a small subset of experts. However, in most MoE architectures, once a token is routed, the selected experts process it independently and their outputs are combined via a weighted sum. This leaves open whether enabling communication among them could improve performance. While prior work has raised this question, direct interaction among the active routed experts remains underexplored. In this paper, we propose SDG-MoE (Signed Debate Graph Mixture-of-Experts), a novel architecture that adds a lightweight, iterative deliberation step before final aggregation. SDG-MoE introduces three components: (i) two learned interaction matrices over the active experts, a support graph $A^+$ and a critique graph $A^-$, capturing reinforcing and corrective influences; (ii) a signed message-passing step that updates expert representations before aggregation; and (iii) a disagreement-gated Friedkin-Johnsen-style anchoring that controls deliberation strength while preventing expert drift. Together, these enable a structured deliberation process where interaction strength scales with disagreement and specialization is preserved. We also provide a theoretical analysis establishing stability conditions on expert states and showing that deliberation adds only low-order overhead over the active set. In controlled three-seed pretraining experiments, SDG-MoE improves validation perplexity over both an unsigned graph communication baseline and vanilla MoE, outperforming the strongest baseline by 19.8%, and gives the best external perplexity on WikiText-103, C4, and Paloma among the compared systems.

Comment: Introduces expert-to-expert signed deliberation inside MoE routing with stability analysis.

Topic Match: The paper proposes a new MoE computational mechanism—structured communication among active experts—rather than an application of MoE.

Relevance: 9 Novelty: 8

15. Key-Value Means

ArXiv ID: 2605.09877

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training, Memory Structures and Agent Memory Systems

Authors: Daniel Goldstein, Eugene Cheah

Abstract: We present Key-Value Means ("KVM"), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong $O(N)$ chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of prefill time complexity between $O(N)$ and $O(N^2)$. It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM-paper and trained models at https://huggingface.co/collections/recursal/key-value-means under the Apache 2.0 license.

Comment: Proposes a unified block-recurrent attention with fixed or growing state, bridging transformer caches and linear-time recurrence.

Topic Match: This is fundamentally a new sequence-model architecture that trades off recurrent state, cache growth, and attention-style context usage.

Relevance: 9 Novelty: 8

16. Continuity Laws for Sequential Models

ArXiv ID: 2605.08539

Primary Topic: Architecture and Training Dynamics

Authors: Annan Yu, Dongwei Lyu, N. Benjamin Erichson

Abstract: Inductive biases influence the behavior and performance of sequential models. In this work, we study an underexplored inductive bias in sequential modeling: continuity in time. We ask a simple question: do models motivated by continuous-time formulations, such as state-space models, actually behave continuously in time, and does this translate into better performance on tasks with continuous temporal structure? To answer this, we formalize model continuity as convergence under temporal refinement, where a model is continuous if its predictions approach an underlying continuous trajectory as the temporal discretization is refined. We show that S4 exhibits stable continuous behavior, whereas S6 (the core of Mamba) can be more sensitive to input amplitude and selective dynamics, despite being derived from a continuous dynamical system. To study whether this distinction matters for learning, we also need a corresponding notion of task continuity. We therefore introduce a metric to quantify the continuity of datasets directly from their temporal structure. Across benchmarks, we find a clear empirical alignment between task continuity, model continuity, and model performance. Beyond an inductive bias, continuity also has practical consequences: we show that it enables a simple temporal subsampling strategy that improves both efficiency and performance.

Comment: Formalizes continuity under temporal refinement and shows differing continuity behavior between S4 and Mamba-style S6 models.

Topic Match: This directly targets sequential architecture behavior and inductive bias, especially state-space modeling mechanisms and their consequences.

Relevance: 9 Novelty: 8

17. Muown: Row-Norm Control for Muon Optimization

ArXiv ID: 2605.10797

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Kai Lion, Florian H\"ubler, Bingcong Li, Antonio Orvieto, Niao He

Abstract: Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the $\ell_\infty$ geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic regimes under a dual norm aligned with the underlying geometries and with a stochastic noise coefficient that empirically remains below that of Muon throughout training. Across GPT-style pre-training on FineWeb-Edu with model sizes from 124M up to 2.7B parameters, Muown improves perplexity over Muon, SOAP, AdamW, and Lion. It also widens the plateau of near-optimal learning rates across model scales, reduces sensitivity to weight decay, and avoids the spectral norm drift at negligible step-time overhead when appropriately sharded.

Comment: Introduces a row-norm-controlled variant of Muon to stabilize spectral behavior and improve large-scale pretraining.

Topic Match: Primary fit is architecture/training dynamics because the paper centers on optimizer-induced training behavior, norm drift diagnosis, and a mechanistically motivated update rule.

Relevance: 9 Novelty: 8

18. Kaczmarz Linear Attention

ArXiv ID: 2605.08587

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Jiaxuan Zou, Ruifeng Ren, Yong Liu

Abstract: Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size $\beta_t = \eta_t / (|k_t|_2^2 + \epsilon)$ for residual updates. We propose Kaczmarz Linear Attention (KLA), a one-scalar modification of GDN that preserves the state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA achieves the lowest validation perplexity among evaluated linear-time baselines, 8.09 versus 8.50 for GDN, and remains stable up to 65K tokens. On controlled tasks, KLA reaches 100% on single-needle-in-a-haystack retrieval, improves 8x multi-query associative recall by 7.03 points over GDN, and delivers 2.1x higher decode throughput at 32K context. These results suggest that the key-norm-normalized Kaczmarz coefficient is a first-order design axis for delta-rule sequence models: it improves accuracy, extrapolation, and decoding efficiency without changing the recurrent state or hardware kernel.

Comment: Derives a key-norm-normalized residual update from the online-regression objective for delta-rule linear attention.

Topic Match: Primary fit is architecture/training because the contribution is a principled modification to a recurrent linear-attention state update, directly about sequence-model mechanism design.

Relevance: 9 Novelty: 8

19. Scaling Limits of Long-Context Transformers

ArXiv ID: 2605.08505

Primary Topic: Architecture and Training Dynamics

Authors: Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

Abstract: We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $\beta_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $\beta_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $\beta_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.

Comment: Analyzes the long-context scaling limit of softmax attention, identifying subcritical, critical, and collapse regimes.

Topic Match: Primary fit is architecture/training dynamics since it gives theory for a core Transformer mechanism—how attention behaves as context grows under temperature scaling.

Relevance: 9 Novelty: 8

20. Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

ArXiv ID: 2605.08811

Primary Topic: Architecture and Training Dynamics

Authors: Zhongjie Shi, Wenjing Liao

Abstract: This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $\alpha$-H\"older continuous functions with $\alpha \in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/\alpha})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2\alpha}{2\alpha+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

Comment: Provides a constructive learning theory for Transformers via softmax-based partition-of-unity local-to-global approximation.

Topic Match: Best fit is architecture/training because it develops approximation and generalization theory tied specifically to the Transformer attention mechanism.

Relevance: 9 Novelty: 8

21. Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

ArXiv ID: 2605.08475

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Mingsong Yan, Dongyang Li, Charles Kulick, Sui Tang

Abstract: Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/\epsilon))$ blocks and MLP width $O(\sqrt{N/\epsilon})$ that achieves $\epsilon$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.

Comment: Shows a standard softmax transformer can mechanistically implement preconditioned Richardson iteration for nonlinear in-context Gaussian kernel regression with prediction guarantees.

Topic Match: The core contribution is a mechanistic architectural analysis of what transformer blocks compute during in-context learning, making architecture/training dynamics the best fit.

Relevance: 9 Novelty: 8

22. Mixture of Layers with Hybrid Attention

ArXiv ID: 2605.09516

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Ivan Ternovtsii, Yurii Bilak

Abstract: Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

Comment: Proposes Mixture of Layers, routing across thin transformer blocks instead of only within-layer experts, plus hybrid attention to preserve coverage under sparse block routing.

Topic Match: This is a core architectural contribution on sparse modular computation and routing, not merely an efficiency tweak.

Relevance: 9 Novelty: 8

23. Sparse Layers are Critical to Scaling Looped Language Models

ArXiv ID: 2605.09165

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Ryan Lee, Jacob Biloki, Edward J. Hu, Jonathan May

Abstract: Looped language models repeat a set of transformer layers through depth, reducing memory costs and providing natural early-exit points at loop boundaries. However, looped models do not scale as favorably as standard transformers with unique layers. We compare standard and Mixture-of-Experts (MoE) transformers, with and without looping, and find two main results. First, we find Looped-MoE models scale better than the standard baseline while dense looped models do not. We trace this to routing divergence between loops: in Looped-MoE models, different experts are activated on each pass through the same shared layers, recovering expressivity without additional parameters. Our second finding is that looped models have better compute-quality trade-offs with early exits than standard models. Because each loop ends with the same layers that produce the final output, loop boundaries are superior exit points, as confirmed by earlier output convergence at these points. In sum, we provide a clear direction for scaling looped models: a Looped-MoE model with early exits can not only beat standard transformers at scale, but also enable significant memory and inference savings with minimal degradation in quality.

Comment: Shows sparse expert routing is what makes looped language models scale, with routing divergence across loops restoring expressivity and enabling better early exits.

Topic Match: The core finding is architectural: how looping and sparse layers interact to recover expressivity and favorable scaling.

Relevance: 9 Novelty: 8

24. Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

ArXiv ID: 2605.10466

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Haoren Xu, Guanhua Fang

Abstract: Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $\Theta_V\Sigma\Theta_K^{\top}\Theta_Q x_t$, where $\Sigma$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.

Comment: Derives a covariance-readout view of softmax attention that unifies in-context learning updates with repetition and mode-collapse behavior in long contexts.

Topic Match: The paper's main value is a mechanistic theory of what self-attention computes and how that yields emergent transformer behaviors.

Relevance: 9 Novelty: 8

25. Kinetic theory for Transformers and the lost-in-the-middle phenomenon

ArXiv ID: 2605.09213

Primary Topic: Architecture and Training Dynamics

Authors: Mitia Duerinckx, Borjan Geshkovski, Stefano Rossi

Abstract: We study causal self-attention dynamics -- a toy model for decoder Transformers -- which we interpret as a non-exchangeable interacting particle system. Adapting cumulant expansions to the triangular causal dependency structure of the model, and appealing to non-hierarchical methods to estimate correlations using Glauber calculus, we prove a quantitative mean-field limit result and a next-order characterization of correlations. For iid uniformly distributed tokens, the limiting correlation equation can be solved in closed form and we obtain a rigorous explanation of the empirically observed \emph{lost-in-the-middle} phenomenon: the token retrieval profile, as a function of the source position in the prompt, is $\mathsf{U}$-shaped, with primacy, recency, and a unique interior minimum under an explicit smallness condition.

Comment: Provides a rigorous mean-field and correlation analysis of causal self-attention, including a closed-form explanation of lost-in-the-middle.

Topic Match: The core contribution is theoretical analysis of Transformer attention dynamics and sequence-position effects, squarely in architecture and training dynamics.

Relevance: 9 Novelty: 8

26. Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective

ArXiv ID: 2605.09044

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Jiuqi Wang, Jayanth Srinivasa, Claire Chen, Shuze Daniel Liu, Ali Payani, Shangtong Zhang

Abstract: Deep continual learning requires models to adapt to new tasks without retraining from scratch. However, neural networks can lose their ability to adapt to new tasks after training on previous ones, a phenomenon known as loss of plasticity. There have been several explanations and diagnostics proposed for plasticity loss. Motivated by the philosophy that "all models are wrong, but some are useful", we ask: can existing diagnostics predict a neural network's plasticity? In this work, we take a practical view to interpret plasticity as trainability, i.e., a neural network's future optimization gain on a target task. We first take a theoretical approach, showing, by constructing a few counterexamples, that some widely adopted diagnostics of plasticity, including representation rank and neural tangent kernel rank, can fail to predict the loss of trainability in both regression and classification settings. We instead propose a novel metric, called optimization readiness, which combines gradient strength and gradient reliability. We prove that optimization readiness lower bounds one-step optimization gain under standard smoothness assumptions, providing a theoretical guarantee for its predictive power. Empirically, we show that across commonly used deep continual learning settings, such as Slowly-Changing Regression and Permuted MNIST, optimization readiness more reliably ranks checkpoints by trainability than prior diagnostics, even with substantially fewer samples.

Comment: Shows why common plasticity diagnostics fail and proposes optimization readiness as a theoretically grounded predictor of future trainability.

Topic Match: This is primarily about training dynamics and continual adaptation behavior in neural networks, not downstream application performance.

Relevance: 9 Novelty: 8

27. Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

ArXiv ID: 2605.09472

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Daniel Wolfson, Tal Wagner

Abstract: Positional encoding in transformers is commonly implemented through positional embeddings, attention masks, or bias terms, but formal connections between these mechanisms remain limited. We study attention with positional bias through the lens of locality-sensitive hashing (LSH), focusing on Attention with Linear Biases (ALiBi). We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular (positionally unbiased) attention operations. Conceptually, this connects positional bias, masks, and positional embeddings in a single formal framework and suggests an approach to efficient ALiBi-biased attention. Experiments on large language models validate our theoretical findings.

Comment: Gives a formal connection between ALiBi positional bias and randomized block masks, yielding approximation guarantees and a path to efficient biased attention.

Topic Match: The main contribution is a new structural understanding of positional bias inside attention, with efficiency implications as a secondary benefit.

Relevance: 9 Novelty: 8

28. Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

ArXiv ID: 2605.09630

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Lin Zheng, Vasilisa Bashlovkina, Timothy Dozat, Dan Garrette, Laura Rimell, Joshua Maynez

Abstract: Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.

Comment: Decouples byte-level model compute from patch size using entropy-triggered scratchpads that refresh patch context and reduce patch-lag degradation.

Topic Match: Its key idea is an architectural mechanism for causal patch representations in tokenizer-free models, with efficiency as a built-in consequence.

Relevance: 9 Novelty: 8

29. Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models

ArXiv ID: 2605.09303

Primary Topic: Architecture and Training Dynamics

Authors: Jeonseong Kim

Abstract: Diffusion language models (DLMs) offer a structural alternative to autoregressive generation: denoising can update tokens in arbitrary orders or in parallel rather than along a fixed left-to-right chain. In practice, fast DLM decoding remains strongly order-sensitive and often drifts toward autoregressive-like trajectories. We trace this tension to compatibility. At each reverse-time step, a DLM provides local denoising conditionals over the unresolved tokens. Arbitrary-order denoising becomes well defined when these local conditionals compose into order-invariant pseudo-joints. We formalize this view by defining order-induced pseudo-joints and a local denoising circulation: the log-ratio between the two pseudo-joints obtained by swapping a pair of unresolved positions. This circulation is zero under compatible conditionals, and global order gaps decompose into sums of local circulations along adjacent swaps. We further separate incompatibility-driven path dependence from conditional-dependence error in parallel updates and from order-specific estimation error. The resulting framework provides inference-only diagnostics for testing when DLM decoding is genuinely order-free.

Comment: Defines local denoising circulation to diagnose order sensitivity and compatibility failures in diffusion language model decoding.

Topic Match: The contribution is a mechanistic analysis of decoding dynamics and conditional compatibility in diffusion language models.

Relevance: 9 Novelty: 8

30. Teaching LLMs to See Graphs: Unifying Text and Structural Reasoning

ArXiv ID: 2605.10247

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Dario Vajda

Abstract: Using Large Language Models (LLMs) to process graph-structured data is an active research area, yet current state-of-the-art approaches typically rely on multi-step pipelines with Graph Neural Network (GNN) encoders that compress rich textual attributes into solitary tokens, creating a significant semantic bottleneck. In this paper, we introduce the Graph Transformer Language Model (GTLM), a novel architecture that enables pretrained LLMs to natively process graph topologies while entirely eliminating this compressive bottleneck. GTLM is exceptionally parameter-efficient: by injecting graph-aware attention biases directly into the LLM's attention modules, it introduces only 0.015% additional parameters relative to the base model. We theoretically prove that our bidirectional attention prefix preserves node permutation equivariance while maintaining exact backward compatibility with the pretrained base model. Extensive evaluations demonstrate that a 1B-parameter GTLM matches or exceeds the performance of 7B-parameter state-of-the-art models on standard Text-Attributed Graph benchmarks, while significantly surpassing baselines on GraphQA. Finally, we demonstrate that GTLM attention heads implicitly learn to simulate message passing, explaining its superior performance on algorithmic tasks. This paradigm shift enables true algorithmic reasoning within LLMs and provides a scalable foundation for next-generation GraphRAG and relational deep learning.

Comment: Adds graph-aware attention biases to pretrained LLMs to natively process graph topology while preserving permutation equivariance and backward compatibility.

Topic Match: The main contribution is a new attention mechanism and architectural integration scheme for structured inputs, with theory about equivariance and backward compatibility.

Relevance: 9 Novelty: 8

31. Lattice Deduction Transformers

ArXiv ID: 2605.08605

Primary Topic: Architecture and Training Dynamics

Authors: Liam Davis, Leopold Haller, Alberto Alfarano, Mark Santolucito

Abstract: We introduce the Lattice Deduction Transformer (LDT), a recurrent transformer that approximates logically sound deduction by projecting its latent state through a lattice between forward passes. We train on-policy in a process that mirrors deduction in a search-based constraint solver and supervise training via a domain-agnostic, abstract-interpretation-based approximation of the set of solution candidates. An $800$K-parameter LDT achieves $100\%$ accuracy on Sudoku-Extreme and Snowflake Sudoku, at a fraction of the training cost of prior small recurrent reasoners, while remaining empirically sound: the model returns a correct answer or abstains. A $1.8$M-parameter variant reaches $99.9\%$ accuracy on Maze-Hard. Frontier LLMs score $0\%$ on all three benchmarks.

Comment: Builds a recurrent transformer with lattice projection to enforce empirically sound deductive state updates.

Topic Match: The paper introduces a distinctive recurrent architectural mechanism for constrained logical state evolution, with clear mechanistic novelty.

Relevance: 8 Novelty: 9

32. Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling

ArXiv ID: 2605.08377

Primary Topic: Architecture and Training Dynamics

Authors: Ali Syed, Aditya Nambiar, Jonathan W. Siegel

Abstract: In many practical applications it is important to build symmetries into neural network architectures. Consider the important case of permutation symmetry on point clouds consisting of $n$ points in $d$ dimensions. In this case the network learns a function on a set of $n$ points in $\mathbb{R}^d$, and a natural paradigm for constructing invariant networks is Janossy pooling, which generalizes the popular Deep Sets architecture. We study the universality of this approach, in particular the important question of how large the embedding dimension must be to guarantee universality of this architecture. Specifically, using a novel technique, we prove new lower bounds on the required size of this embedding dimension. For Deep Sets, this gives the correct minimal dimension up to a constant factor for all $d > 1$. For $k$-ary Janossy pooling, we prove the first non-trivial lower bound on the required embedding dimension when $k > 1$.

Comment: Proves lower bounds on the embedding dimension required for universality in Deep Sets and Janossy pooling, including the first non-trivial bound for k-ary Janossy pooling with k>1.

Topic Match: The contribution is foundational architectural theory about expressivity limits of permutation-invariant architectures.

Relevance: 8 Novelty: 8

33. bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

ArXiv ID: 2605.10661

Primary Topic: Architecture and Training Dynamics

Authors: Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta

Abstract: Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.

Comment: Shows a single recurrently reused ViT block can recover much of standard depth, with analyses indicating step-dependent computations emerge from hidden-state evolution.

Topic Match: The paper is fundamentally about recurrence as an architectural substitute for depth, with mechanistic analysis of emergent step specialization.

Relevance: 8 Novelty: 8

34. Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

ArXiv ID: 2605.10931

Primary Topic: Architecture and Training Dynamics

Authors: Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith

Abstract: Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(\beta+1)}/{\beta}}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $\beta^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\log\beta$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $\beta$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.

Comment: Provides mean-field low-temperature theory showing token distributions in deep self-attention rapidly concentrate onto projection-induced limiting structure.

Topic Match: This is a foundational theoretical analysis of transformer attention dynamics at inference time.

Relevance: 8 Novelty: 8

35. Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising

ArXiv ID: 2605.08193

Primary Topic: Architecture and Training Dynamics

Authors: Youssef Saied, Fran\c{c}ois Fleuret

Abstract: Normalization Equivariance (NE), equivariance to global contrast and brightness transforms, improves robustness to distribution shift in image-to-image prediction. Existing methods enforce this prior by constraining internal layers to NE-compatible families, limiting compatibility with standard components such as attention and LayerNorm, and adding runtime cost. We characterize the full NE function class: a function is NE if and only if it admits a normalize-process-denormalize factorization. This turns exact NE enforcement, for the ideal wrapper, from an internal architectural constraint into an input-output parameterization problem, allowing a parameter-free wrapper (WNE) to enforce NE around any backbone, including transformers. In a single-noise mismatch diagnostic for blind denoising, the wrapper improves CNN and transformer robustness with no measurable GPU overhead; architectural NE baselines incur up to a 1.6x slowdown.

Comment: Characterizes the full normalization-equivariant function class and gives a parameter-free wrapper for arbitrary backbones.

Topic Match: Its contribution is a general architectural/training principle about normalization-equivariant parameterization, not the denoising application.

Relevance: 8 Novelty: 8

36. The Power of Second Order Methods for Sequence Preconditioning

ArXiv ID: 2605.08390

Primary Topic: Architecture and Training Dynamics

Authors: Annie Marsden, Elad Hazan

Abstract: Sequence prediction methods for dynamical systems with long memory, i.e. marginally stable systems, typically achieve regret that grows polynomially with the hidden dimension of the underlying generative model. Universal Sequence Preconditioning (USP) is a method that compresses any sequence which comes from a linear dynamical system into a "preconditioned" sequence which requires exponentially shorter memory for accurate prediction. However, the preconditioned sequence yields exponentially larger diameters and gradients, hindering USP from unlocking optimal regret bounds. Inspired by the minimum description length principle, we show that the Vovk-Azoury-Warmuth (VAW) algorithm is naturally matched to the USP regime. Indeed, it takes advantage of the memory compression while remaining robust to the exponential explosion of the diameter. We prove that combining USP with VAW achieves astoundingly strong results: for any marginally-stable linear dynamical system, this algorithm achieves polylogarithmic regret $O \left( \log^3 T \right)$ even in the presence of asymmetric hidden transition matrices. Finally, we extend the applicability of USP beyond bounded-spectrum systems by providing new complex-analytic bounds on Chebyshev polynomials, allowing for systems with constant complex arguments.

Comment: Combines universal sequence preconditioning with Vovk-Azoury-Warmuth second-order updates to get polylog regret for marginally stable linear dynamical systems.

Topic Match: Best fit is architecture/training dynamics because it studies core sequence modeling and optimization dynamics for long-memory systems rather than an application domain.

Relevance: 8 Novelty: 8

37. NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

ArXiv ID: 2605.08144

Primary Topic: Architecture and Training Dynamics

Authors: Fang Wu, Haokai Zhao, Da Xing, Hanqun Cao, Tinson Xu, Yanchao Li, Xiangru Tang, Zehong Wang, Aaron Tu, Kuan Pang, Hanchen Wang, Hongbin Lin, Zeqi Zhou, Yinxi Li, Peng Xia, Li Erran Li, Molei Tao, Jure Leskovec, Aditya Joshi, Yejin Choi

Abstract: Diffusion models have achieved remarkable success across a wide range of generative tasks, yet their training paradigm largely treats injected noise as uniformly informative. In this work, we challenge this assumption and introduce NoiseRater, a meta-learning framework for instance-level noise valuation in diffusion model training. We propose a parametric noise rater that assigns importance scores to individual noise realizations conditioned on data and timestep, enabling adaptive reweighting of the training objective. The rater is trained via bilevel optimization to improve downstream validation performance after inner-loop diffusion updates. To enable efficient deployment, we further design a decoupled two-stage pipeline that transitions from soft weighting during meta-training to hard noise selection during standard training. Extensive experiments on FFHQ and ImageNet demonstrate that not all noise samples contribute equally, and that prioritizing informative noise improves both training efficiency and generation quality. Our results establish noise valuation as a complementary and previously underexplored axis for improving diffusion model training. Our code is available at: https://anonymous.4open.science/r/NoiseRater-DEB116.

Comment: Meta-learns instance-level noise importance for diffusion training, treating noise samples as selectively informative rather than uniform.

Topic Match: Primary fit is architecture/training dynamics because it changes the training signal of diffusion models by analyzing and reweighting injected noise at the instance level.

Relevance: 8 Novelty: 8

38. Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

ArXiv ID: 2605.08878

Primary Topic: Architecture and Training Dynamics

Authors: Yu Chen, Yuanhao Liu, Qi Cao

Abstract: Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model's behavior from refusal to answering while preserving the model's harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a harmful input along RED. We then prove that RED can be exactly decomposed into contributions from operator-level sources across the model's operator structure, and identify normalization, residual-wiring, and terminal sources as analytically constrained operator-level sources. To eliminate RED, the shared expressive modules -- self-attention and MLP -- must eliminate the contributions from these analytically constrained sources while preserving the mechanisms that support benign responses. These competing requirements give rise to a conditional safety-utility trade-off. Experiments across multiple models and attack methods empirically analyze RED from two complementary perspectives and show that added token dimensions can expose RED, while successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions.

Comment: Identifies refusal-escape directions and decomposes jailbreakability into operator-level sources tied to normalization and residual structure.

Topic Match: Best fit is architecture/training because the paper is fundamentally about mechanistic sources of behavior in model operators and network structure.

Relevance: 8 Novelty: 8

39. On Variance Reduction in Learning Mean Flows

ArXiv ID: 2605.09235

Primary Topic: Architecture and Training Dynamics

Authors: Juanwu Lu, Ziran Wang

Abstract: One-step generative modeling has emerged as a leading approach to amortize the inference cost of diffusion and flow-matching models. Among distillation-free methods, MeanFlow training is notoriously unstable, with non-decreasing loss and unbounded gradient variance. In this work, we establish a theory that attributes this pathology to a misuse of the conditional velocity field: it plays two distinct statistical roles in the loss, both as an unbiased regression target and as a Monte Carlo control variate inside a Jacobi-vector product, with the original loss assigning the wrong coefficient to the latter. We derive the optimal coefficient in closed form, and show that a family of fixes in concurrent works corresponds to different practical realizations of the same optimum. A controlled sweep of this coefficient on two-dimensional benchmarks and on a latent Diffusion Transformer recovers the predicted bias-variance ordering. The optimal coefficient yields up to a %54 improvement in sample quality on two-dimensional benchmarks and a monotone FID trend at every matched-step DiT checkpoint. Crucially, the same DiT measurement also reveals a quantitative FID-MSE landscape mismatch: although gradient variance is minimized at an interior coefficient value, the coefficient that minimizes FID prefers the direct use of conditional velocity.

Comment: Explains MeanFlow instability via a control-variate coefficient mismatch and derives the variance-optimal coefficient in closed form.

Topic Match: The paper is fundamentally about training dynamics and variance control in generative-model optimization.

Relevance: 8 Novelty: 8

40. Infinite Mask Diffusion for Few-Step Distillation

ArXiv ID: 2605.10518

Primary Topic: Architecture and Training Dynamics

Authors: Jaehoon Yoo, Wonjung Kim, Chanhyuk Lee, Seunghoon Hong

Abstract: Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.

Comment: Introduces an infinite-state stochastic mask to reduce the factorization-error floor of masked diffusion language models for few-step generation.

Topic Match: The paper proposes a new generative architecture/mechanism rather than a downstream application.

Relevance: 8 Novelty: 8

41. Phases of Muon: When Muon Eclipses SignSGD

ArXiv ID: 2605.09552

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Elliot Paquette, Noah Marshall, Lucas Benigni, Guangyuan Wang, Atish Agarwala, Courtney Paquette

Abstract: Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent $\alpha$ and target exponent $\beta$ shows there are three phases in the $(\alpha,\beta)$ plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.

Comment: Provides phase-based theoretical analysis of Muon/SignSVD versus SignSGD, explaining when spectral preconditioning helps under anisotropic data and finite batch noise.

Topic Match: This is primarily a foundational optimization and training-dynamics analysis of a notable spectral optimizer.

Relevance: 8 Novelty: 8

42. Controlling Transient Amplification Improves Long-horizon Rollouts

ArXiv ID: 2605.08856

Primary Topic: Architecture and Training Dynamics

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Adeel Pervez, Francesco Locatello

Abstract: Autoregressive neural simulators now match classical solvers on short-horizon prediction of physical systems, yet their accuracy degrades rapidly when rolled out over long horizons. In this work, we identify transient amplification of perturbations around rollout trajectories as a structural mechanism driving rollout error. Using a linearization analysis we show that when the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in model rollout drift even when the overall system is asymptotically stable. Building on the analysis, we propose commutativity regularization: a combination of two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps. The penalties are estimated with Jacobian-vector products and have no inference-time cost. We show a propagator bound that quantifies rollout error under approximate commutativity and normality. We evaluate UNet and FNO variants with commutativity regularization on 1D and 2D spatio-temporal data in synthetic and real settings, showing successful long-horizon rollouts over thousands of steps. Further, we show that the method improves FourCastNet climate forecasts on ERA5 without using any new data. The gain is most pronounced out-of-distribution: trained on trajectories of a few hundred steps, regularized models remain in-distribution for thousands of rollout steps on initial conditions where baselines diverge.

Comment: Identifies transient amplification from non-normal, non-commuting Jacobians as a cause of rollout drift and regularizes for commutativity to stabilize long-horizon prediction.

Topic Match: The main contribution is a training-stability mechanism based on Jacobian dynamics, not an application-specific forecasting result.

Relevance: 8 Novelty: 8

43. Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

ArXiv ID: 2605.08352

Primary Topic: Architecture and Training Dynamics

Authors: Konstantin Riedl, Konstantinos Spiliopoulos, Justin Sirignano

Abstract: A convergence analysis is developed for the regularized Newton method for training neural networks (NNs) in the overparameterized limit. As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a ``Newton neural tangent kernel'' (NNTK). Explicit rates characterizing this convergence are provided and, in the infinite-width limit, we prove that the NN converges exponentially fast to the target data (i.e., a global minimizer with zero loss). We show that this convergence is uniform across the frequency spectrum, addressing the spectral bias inherent in gradient descent. The eigenvalues of the NTK for gradient descent accumulate at zero, leading to slow convergence for target data with high-frequency components. In contrast, the NNTK has uniformly lower bounded eigenvalues if the regularization parameter is selected appropriately, allowing Newton's method to converge more quickly for data with high-frequency components. Mathematical challenges that need to be addressed in our analysis include the implicit parameter update of the Newton method with a potentially indefinite Hessian matrix and the fact that the dimension of this linear system of equations tends to infinity as the NN width grows. This complicates deriving the training dynamics in the overparameterized limit as well as proving the convergence of the finite-width dynamics thereto. The analysis identifies a scaling formula for selecting the regularization parameter, which we show can vanish at a suitable rate as the number of hidden units becomes larger. We prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training and the Newton updates for individual NN parameters converge to zero, showing that the model behaves as a linearization around the initialization.

Comment: Analyzes regularized Newton training for overparameterized neural networks via a Newton neural tangent kernel with uniform spectral convergence properties.

Topic Match: This is a foundational theory paper on optimization dynamics and convergence in neural network training.

Relevance: 8 Novelty: 8

44. Optimizer-Induced Mode Connectivity: From AdamW to Muon

ArXiv ID: 2605.09991

Primary Topic: Architecture and Training Dynamics

Authors: Fangzhao Zhang, Sungyoon Kim, Erica Zhang, Yiqi Jiang, Mert Pilanci

Abstract: Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.

Comment: Shows mode connectivity depends on optimizer-induced implicit regularization, with same-optimizer solution sets connected and cross-optimizer regions potentially disjoint.

Topic Match: The paper is about optimizer-dependent training geometry and implicit regularization, which squarely fits training dynamics.

Relevance: 8 Novelty: 8

45. Fitting Multilinear Polynomials for Logic Gate Networks

ArXiv ID: 2605.08657

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Youngsung Kim

Abstract: We study learnable logic gate networks that stack layers of 2-input Boolean gates to build combinational circuits. Every 2-input gate has a unique multilinear polynomial with 4 coefficients, so the 16 Boolean gates form a codebook of prototypes in a 4-dimensional space, reducing training to a vector-quantization problem. The baseline method, Soft-Mix, learns a 16-dimensional softmax over gate identities, but the codebook has rank~4: 11 of 15 simplex directions carry nullspace gradient, and at uniform initialization the backward signal vanishes exactly. We prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE, and show that the covariance Jacobian of soft-VQ selection bypasses it by coupling the starved coefficient to the always-active constant channel. Working in the 4-dimensional polynomial space reduces each neuron from 16 to 4 parameters. On seven datasets, at least one 4-parameter method matches or exceeds Soft-Mix on every dataset; the CovJac advantage over STE grows monotonically with interaction demand across all seven datasets. At depth, Soft-Mix collapses ($-37.3$pp on CIFAR-10 at 12 layers) while CovJac holds ($-0.5$pp on CIFAR-10, stable on MNIST).

Comment: Recasts learnable logic-gate networks in multilinear polynomial space, exposing gradient starvation in soft gate mixing and proposing a covariance-Jacobian fix.

Topic Match: It introduces a new mechanistic parameterization and training analysis for a specialized neural architecture.

Relevance: 8 Novelty: 8

46. Hyperparameter Transfer for Dense Associative Memories

ArXiv ID: 2605.10164

Primary Topic: Architecture and Training Dynamics

Authors: Roi Holtzman, Dmitry Krotov, Boris Hanin

Abstract: Dense Associative Memory (DenseAM) is a promising family of AI architectures that is represented by a neural network performing temporal dynamics on an energy landscape. While hyperparameter transfer methods are well-studied for feed-forward networks, these methods have not been developed for settings in which weights are shared across layers and within the layer, which is common in DenseAMs. Additionally, DenseAMs utilize rapidly peaking activation functions that are rarely used in feed-forward architectures. The confluence of these aspects makes DenseAM a challenging framework for using existing methods for hyperparameter transfer. Our work initiates the development of hyperparameter transfer methods for this class of models. We derive explicit prescriptions for how the hyperparameters tuned on small models can be transferred to models trained at scale. We demonstrate excellent agreement between these theoretical findings and empirical results.

Comment: Derives explicit hyperparameter transfer prescriptions for Dense Associative Memories, a recurrent energy-based architecture with shared weights and unusual activations.

Topic Match: This is squarely about training dynamics and scaling rules for a specialized architecture rather than an application.

Relevance: 8 Novelty: 8

47. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

ArXiv ID: 2605.08696

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Benjamin L. Badger

Abstract: Over the last two decades, language modeling has experienced a shift from predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

Comment: Presents a sequence model with train-time parallel and inference-time recurrent forms, targeting the training/inference trade-off in long-sequence generation.

Topic Match: The central contribution is a new recurrent/parallel sequence architecture rather than a systems-only optimization.

Relevance: 8 Novelty: 8

48. Dimension-Free Saddle-Point Escape in Muon

ArXiv ID: 2605.09331

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Yanlin Long, Yufei Gu, Zeke Xie

Abstract: Modern Large Language Model (LLM) training is fundamentally bottlenecked by pathologically flat saddle points in extreme high-dimensional landscapes. Motivated by this challenge, we analyze the saddle-point escape dynamics of the emerging Muon optimizer, demonstrating its resilience against the $\mathcal{O}(D)$ dimensional curse that severely traps element-wise adaptive optimizers like AdamW. By extending generalized matrix perturbation theory, we develop a theoretical framework to capture Muon's non-equilibrium optimization trajectories. This theoretical machinery mathematically proves that Muon elegantly bypasses the dimensional curse via a non-linear spectral shaping mechanism. By leveraging resolvent functional calculus and macroscopic Cauchy contour integration, we avoid isotropic noise assumptions and Tracy-Widom edge singularities. We establish that structural incoherence securely shields the trajectory from orthogonal drift, enabling a dimension-free saddle-point escape, and triggering a deterministic $\mathcal{O}(1)$ discrete ballistic ejection under sufficient spectral gap. Consequently, we provide an algebraically dimension-free escape bound for Muon, formalizing the underlying mechanics of its non-convex optimization dynamics.

Comment: Provides theory that Muon escapes saddle points without the usual dimensional curse via spectral shaping, illuminating optimizer behavior in very high dimensions.

Topic Match: Although optimizer-related, the paper's emphasis is on nonconvex training dynamics and why a specific optimizer behaves differently in large models.

Relevance: 8 Novelty: 8

49. Parameterized Complexity of Stationarity Testing for Piecewise-Affine Functions and Shallow CNN Losses

ArXiv ID: 2605.10219

Primary Topic: Architecture and Training Dynamics

Authors: Yuhan Ye

Abstract: We study the parameterized complexity of testing approximate first-order stationarity at a prescribed point for continuous piecewise-affine (PA) functions, a basic task in nonsmooth optimization. PA functions form a canonical model for nonsmooth stationarity testing and capture the local polyhedral geometry that appears in ReLU-type training losses. Recent work by Tian and So (SODA 2025) shows that testing approximate stationarity notions for PA functions is computationally intractable in the worst case, and identifies fixed-dimensional tractability as an open direction. We address this direction from the viewpoint of parameterized complexity, with the ambient dimension $d$ as the parameter. In this paper, we give XP algorithms in fixed dimension for the tractable sides, and prove W[1]-hardness for the complementary sides. Moreover, lower bounds under the Exponential Time Hypothesis rule out algorithms running in time $\rho(d)\size^{o(d)}$ for any computable function $\rho$, where $\size$ denotes the total binary encoding length of the stationarity-testing instance. As a further consequence, our results yield the corresponding parameterized complexity picture for testing local minimality of continuous PA functions. We further extend our hardness results to a family of shallow ReLU CNN training losses, with stationarity tested in the trainable weight space. Thus, the same parameterized-complexity picture also appears for simple CNN training losses.

Comment: Parameterized complexity results for approximate stationarity testing on piecewise-affine functions and shallow ReLU CNN losses directly analyze optimization hardness in neural training landscapes.

Topic Match: Best fit is architecture and training dynamics because the paper studies fundamental computational hardness of stationarity in piecewise-linear neural loss geometry rather than an application domain.

Relevance: 8 Novelty: 8

50. Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

ArXiv ID: 2605.09609

Primary Topic: Architecture and Training Dynamics

Authors: Kevin Dao, Jose Israel Rodriguez

Abstract: We provide a counterexample to the minimal unimodal conjecture for polynomial neural networks (PNNs) with power activation functions. Fixing the input and output widths, the conjecture states that any minimal filling architecture has unimodal widths for the hidden layers. We found a counterexample via a frontier search and certified it using recursive dimension bounds and symbolic computation. Notably, several subarchitectures of this example exhibit large defect, in contrast with the predominantly small-defect behavior observed in prior examples.

Comment: Finds a counterexample to the minimal unimodal conjecture for polynomial neural networks, revealing nontrivial structure in minimal expressive architectures.

Topic Match: The core contribution is architectural theory: characterizing minimal width patterns and defects in polynomial neural network architectures.

Relevance: 8 Novelty: 8

51. CATO: Charted Attention for Neural PDE Operators

ArXiv ID: 2605.09016

Primary Topic: Architecture and Training Dynamics

Authors: Chun-Wun Cheng, Sifan Wang, Carola-Bibiane Sch\"onlieb, Angelica I. Aviles-Rivero

Abstract: Neural operators have emerged as powerful data-driven solvers for PDEs, offering substantial acceleration over classical numerical methods. However, existing transformer-based operators still face critical challenges when modeling PDEs on complex geometries: directly processing over massive mesh points is computationally expensive, while operating in raw discretization coordinates may obscure the intrinsic geometry where physical interactions are more naturally expressed. To address these limitations, we introduce the Charted Axial Transformer Operator (CATO), a geometry-adaptive and derivative-aware neural operator for PDEs on general geometries. Instead of applying attention directly in the physical coordinate system, CATO learns a continuous latent chart that maps mesh coordinates into a learned chart space, where chart-conditioned axial attention efficiently captures long-range dependencies with reduced computational cost. In addition, CATO introduces a derivative-aware physics loss for steady-state PDEs that jointly supervises solution values, mesh-consistent gradients, and an auxiliary flux-like field, improving physical fidelity and reducing oversmoothing. We further provide a theoretical approximation result showing that, under a favorable chart, charted axial attention can represent low-rank axial solution operators with controlled error, and that small chart perturbations induce bounded approximation degradation. CATO achieves the best performance across all evaluated datasets, yielding an average improvement of approximately 26.76\% over the strongest competing baselines while reducing the number of parameters by 81.98\%. These results highlight the effectiveness of learning geometry-adaptive charts and derivative-aware physical supervision for accurate and efficient PDE operator learning.

Comment: Introduces chart-conditioned axial attention as a geometry-adaptive mechanism for neural PDE operators with a derivative-aware physics loss.

Topic Match: Primary fit is architecture and training dynamics because the central contribution is a new attention mechanism and loss design for operator learning on complex geometries.

Relevance: 8 Novelty: 8

52. RAwR: Role-Aware Rewiring via Approximate Equitable Partition

ArXiv ID: 2605.09457

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Riccardo Porcedda, Giuseppe Squillace, Bastian Epping, Andrea Vandin, Michael Schaub, Mirco Tribastone, Francesca Chiaromonte

Abstract: While Graph Neural Networks (GNNs) have demonstrated significant efficacy in node classification tasks, where predictions rely on local neighborhood information, the performance of GNNs often drops when prediction tasks depend on long-range interactions. These limitations are attributed to phenomena such as oversquashing, where structural bottlenecks restrict signal propagation across the network topology. To address this challenge, we introduce RAwR, a computationally efficient rewiring framework that augments the input graph with a quotient graph derived from equitable partitions. This approach facilitates accelerated communication between nodes that share identical structural roles, as identified by the Weisfeiler-Leman graph coloring, and thereby reduces the total effective resistance of the system. Furthermore, by employing an approximate definition of the equitable partition, RAwR enables a controllable reduction of the quotient graph, which, in its most condensed state, recovers the conventional Master Node rewiring technique. Empirical evaluations across a diverse suite of benchmarks -- including homophilic, heterophilic, and synthetic long-range datasets -- demonstrate that RAwR achieves state-of-the-art results. Our contribution is further supported by an analytical investigation using a teacher-student model of linear GNNs, which elucidates the theoretical foundations of role-based rewiring. This analysis leads to the formulation of Spectral Role Lift (SRL), a metric designed to identify the optimal approximate equitable partition for maximizing predictive performance.

Comment: Uses approximate equitable partitions to rewire graphs by structural role, with theory linking rewiring to oversquashing relief and predictive gains.

Topic Match: Primary fit is architecture and training dynamics since the work introduces a new graph architectural rewiring principle tied to message-passing behavior.

Relevance: 8 Novelty: 8

53. When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

ArXiv ID: 2605.08318

Primary Topic: Architecture and Training Dynamics

Authors: Brandon Yee, Pairie Koh, Jack Rodriguez, Mihir Tekal

Abstract: We study the problem of \emph{architecture selection} for deep learning models trained to solve partial differential equations (PDEs), asking when transformer-based architectures with learned attention outperform Fourier-domain neural operators. We introduce the \textbf{Multi-Scale Attention Transformer} (\msat{}), a deep learning architecture that encodes spatiotemporal solution histories as token sequences and trains end-to-end via a composite supervised objective with optional physics-informed regularization terms. We conduct a comprehensive empirical evaluation against nine baselines -- including physics-informed neural networks (PINNs), neural operators (FNO, DeepONet, GNOT), and state-space models (Mamba-NO) -- across five benchmark problems from the PINNacle suite, using identical train/test splits and reference data for all methods. \msat{} achieves state-of-the-art generalization on complex geometry problems ($L^2_\mathrm{rel} = 0.0101$ on Heat2D-CG, a $3.7\times$ improvement over FNO) at $34\,\mathrm{s}$ total inference vs.\ $120{,}812\,\mathrm{s}$ for Mamba-NO. Ablation studies over the physics regularization component reveal a precise inductive bias tradeoff: physics priors reduce test error on diffusion-dominated problems but degrade generalization on chaotic and recirculating-flow regimes, directly characterizing the prior misspecification boundary. Approximation error bounds as a function of domain boundary complexity $\kappa$ provide a theoretical basis for these empirical findings and a principled rule for architecture selection.

Comment: Identifies when learned attention beats Fourier operators for PDE solving, with architecture-selection theory tied to domain-boundary complexity.

Topic Match: The paper is fundamentally about architectural mechanisms and inductive-bias tradeoffs between transformers and neural operators.

Relevance: 8 Novelty: 8

54. Exactness Matters for Physical Rule Enforcement

ArXiv ID: 2605.08285

Primary Topic: Architecture and Training Dynamics

Authors: Bum Jun Kim

Abstract: Autoregressive scientific forecasters often enforce physical or structural constraints by repairing each predicted state before feeding it back into the model. However, it remains unclear when stronger physical rule enforcement becomes reliable and when it becomes a source of distribution shift. We study this question through operator exactness, meaning whether the repair map is the identity on the target manifold and is aligned with the target geometry. We compare raw forecasting, post hoc repair, and in-loop repair across periodic incompressible Navier--Stokes, non-periodic CFDBench flows, and a hierarchical-forecasting support task. In the exact periodic regime, Fourier projection substantially improves rollout accuracy. On the NS-128 benchmark, a strong Raw-FNO has a final-step rollout MSE at horizon 100 of $(9.390 \pm 6.290)\times 10^{-5}$, and post hoc and in-loop projection reduce it to $(1.130 \pm 0.165)\times 10^{-6}$ and $(5.370 \pm 0.113)\times 10^{-7}$. However, once an exact projection is unavailable and only approximate boundary-preserving cleanup is available, the ordering changes. Across cavity, tube, dam, and cylinder flow, stronger Poisson-based cleanup can reduce divergence while worsening rollout error; target-distortion MSE predicts this harm far better than a linear-system residual. Controlled mismatch, screened cleanup, adaptive gating, and external-backbone checks show that the best approximate-regime operating point can be raw or near-identity. Hierarchical forecasting gives the same broader pattern. Exact forecast reconciliation is a stable baseline, whereas blended top-down repair, a validation-tuned interpolation toward historical-proportion top-down reconciliation, is dataset-dependent. Thus, constraint enforcement should be benchmarked by operator--data alignment before enforcement strength.

Comment: Identifies operator exactness as the key criterion governing when in-loop physical-rule enforcement helps versus causes harmful distribution shift.

Topic Match: Primary fit is training dynamics because the paper isolates a mechanistic principle for stable autoregressive forecasting under constraint-repair interventions.

Relevance: 8 Novelty: 8

55. Exact Fixed-Point Constraints in Neural-ODEs with Provable Universality

ArXiv ID: 2605.10613

Primary Topic: Architecture and Training Dynamics

Authors: Feliciano Giuseppe Pacifico, Duccio Fanelli, Lorenzo Buffoni, Lorenzo Chicchi, Diego Febbe, Raffaele Marino

Abstract: We introduce a technique that enables Neural-ODEs to approximate arbitrary velocity fields with a priori planted fixed-points. Specifically, a recipe is given to explicitly accommodate for a finite collection of points in the reference multi-dimensional space of the Neural-ODE where the velocity field is exactly equal to zero. In this way, the gradient-based training is rigorously constrained inside the prescribed hypothesis class while leaving the expressive power of the Neural-ODE unaltered. We rigorously prove the universality of the Neural-ODE under any local constraints in the velocity field and give a computationally convenient way of imposing the fixed points. Our method is then tested on two paradigmatic physical models.

Comment: Shows Neural-ODEs can enforce exact planted fixed points without losing universality, a core architectural constraint mechanism.

Topic Match: Primarily about architectural expressivity and constrained dynamics in continuous-depth models, not downstream use.

Relevance: 8 Novelty: 8

56. Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

ArXiv ID: 2605.10889

Primary Topic: Architecture and Training Dynamics

Authors: Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N. M Hoang, Fartash Faghri, Yizhe Zhang, Minsik Cho, Mehrdad Farajtabar

Abstract: On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.

Comment: Provides a training-free per-token gradient-alignment diagnostic for when on-policy distillation helps or hurts.

Topic Match: The contribution is fundamentally about token-level training signal quality and distillation dynamics, not application performance.

Relevance: 8 Novelty: 8

57. Elucidating Representation Degradation Problem in Diffusion Model Training

ArXiv ID: 2605.10790

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Zhipeng Yao, Dazhou Li, Zitong Zhang, Durude Mahee, Fan Zhu, Wenbin Zhang, Xinwei He, Yeying Jin, Rui Yu

Abstract: Diffusion models have achieved remarkable success, yet their training remains inefficient due to a severe optimization bottleneck, which we term Representation Degradation. As noise levels increase, the outputs of the trained model exhibit progressive structural distortion, which can destabilize training and impair generation quality. Our analysis suggests that this instability is driven by mismatched target recoverability, which is associated with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. To address this, we propose Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability. By stabilizing representation learning without external supervision, ERD accelerates convergence and achieves strong empirical performance across diffusion backbones.

Comment: Analyzes representation degradation in diffusion training via recoverability and NTK spectra, then reallocates optimization effort accordingly.

Topic Match: The paper is primarily about training instability and optimization dynamics in generative architectures.

Relevance: 8 Novelty: 8

58. A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models

ArXiv ID: 2605.09515

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training, Representation Learning Theory and Structure

Authors: Djamel Bouchaffra

Abstract: Large language models rely on multihead attention, but interactions among heads remain poorly understood. We apply the Game Theoretic Free Energy Principle (GTFEP): a framework casting multiagent systems as distributed variational inference to analyze attention heads as bounded rational agents. According to GTFEP, each head minimizes its variational free energy, and collective behavior follows a Gibbs distribution over coalition structures whose energy is decomposed into Harsanyi dividends. Using a tractable approximation (uniform prior, deterministic dynamics), coalition free energy reduces to joint Shannon entropy of discretized head outputs (argmax key index). Pairwise dividends become mutual information (nonnegative), while triple dividends correspond to interaction information and can be negative. On BERT, GPT2, and Llama with GSM8K, triple dividends are consistently negative, revealing higher order redundancy. The Nash FEP correspondence guarantees that stationary points of collective free energy are epsilon Nash equilibria; thus, heads with negligible contribution can be pruned with minimal performance loss. Pruning heads with low marginal contribution reduces computational cost with minimal performance loss: for example, pruning 20% of heads in GPT2 reduces FLOPs by 18%, increases throughput by 22%, and raises perplexity only modestly (from 28.4 to 33.4 on GSM8K). Our work shows GTFEP provides a principled foundation for analyzing and optimizing transformer architectures.

Comment: Uses game-theoretic free energy to quantify higher-order synergy and redundancy among attention heads, with pruning implications.

Topic Match: The central contribution is mechanistic analysis of attention-head interactions as an architectural computation problem.

Relevance: 8 Novelty: 8

59. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

ArXiv ID: 2605.08666

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Tianhao Cheng, Zeyu Huang, Zihan Qiu, Yu Cheng, Edoardo Ponti, Yinghui Xu, Ivan Titov, Zenglin Xu

Abstract: A commonly accepted explanation of critic-free RL for LLMs, based on sequence-level rewards, is that it reinforces successful rollouts with a positive advantage while penalizing failed ones. In contrast, we study critic-free RL from a token-level perspective, revealing the token-flipping phenomenon: positive and negative rollouts exhibit remarkably similar proportions of tokens whose probabilities are boosted or suppressed during RL training. To explain this phenomenon, we further show that a token's change in probability is not fully determined by its own advantage; coupled gradient interactions with other tokens also play a non-negligible role. Specifically, these token coupling effects occur primarily between identical tokens that are both predicted with low confidence. Building upon this analysis, we propose the cancellation hypothesis: as a result of coupling, opposing signals cancel out for tokens shared by positive and negative rollouts, while tokens more specific to successful rollouts receive stronger reinforcement, thereby inducing hidden token-level credit assignment from rollout-level rewards. We support this hypothesis with complementary empirical evidence. (1) Compared with training on only positive rollouts, critic-free RL shifts updates from template and formatting tokens toward reasoning tokens; (2) Tokens boosted by critic-free RL consistently demonstrate higher value than suppressed tokens, regardless of whether they originate from positive or negative rollouts. Guided by this view, we implement two batching interventions to encourage or preserve cancellation in critic-free RL training: query-preserved mini-batching and reward-balanced batching. Despite their simplicity, these interventions improve RLVR training across multiple model scales, supporting cancellation as both an explanatory principle and a practical design criterion for critic-free RL training.

Comment: Provides a token-level mechanistic account of critic-free RL through cancellation effects between shared low-confidence tokens across positive and negative rollouts.

Topic Match: Best fit is training dynamics: it explains how rollout-level rewards induce hidden token-level credit assignment and proposes batching rules derived from that mechanism.

Relevance: 8 Novelty: 8

60. Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

ArXiv ID: 2605.10658

Primary Topic: Architecture and Training Dynamics

Authors: Yao Shu, Jian Mu, Zhongxiang Dai

Abstract: Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO--ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations. As an algorithmic transfer, RISE applies the calibrated ZO shape to exact FO gradients inside parameter blocks. Its target is a stability--plasticity tradeoff: randomized shaping may reduce the retention exposure paid by FO, exact gradients remove finite-smoothing bias from finite-difference ZO, and blockwise sampling supplies many local shaping directions after one gradient computation. The blockwise analysis separates mean-step damage from centered random exposure, showing how block-diagonal curvature, cross-block coupling, and local shaping diagnostics specify where this exact-gradient transfer is most likely to be visible.

Comment: Explains lower forgetting in zeroth-order adaptation through randomized gradient shaping and transfers the effect to first-order updates with RISE.

Topic Match: This is mainly a training-dynamics paper: it gives a mechanistic theory for adaptation-forgetting tradeoffs and proposes a derived optimization method.

Relevance: 8 Novelty: 8

61. Recovering Physical Dynamics from Discrete Observations via Intrinsic Differential Consistency

ArXiv ID: 2605.08454

Primary Topic: Architecture and Training Dynamics

Authors: Yuxiang Luo, Andrew Perrault

Abstract: Recovering continuous-time dynamics from discrete observations is difficult because local supervision (e.g., pointwise regression targets, derivative approximations, or equation residuals) loses fidelity as the observation interval grows. We replace local supervision with a global structural constraint: any flow representing autonomous dynamics must satisfy the semi-group property under time translation. We train a time-conditioned secant velocity field whose deviation from this property, which we call Symmetry Rupture, serves two purposes. As a training regularizer, it confines the hypothesis space to flows that compose consistently across temporal scales. As an inference oracle, it lets the solver select the largest step size that preserves internal consistency, replacing the local truncation error that conventional adaptive solvers depend on. On the diffusion-reaction benchmark under time-informed inference, our method reduces rollout RMSE by 87\% while using 5x fewer function evaluations than a Neural ODE baseline. In the more demanding direct auto-regressive setting, where the model must predict distant future frames without intermediate temporal cues, our adaptive solver allocates compute based on local geometric complexity -- maintaining the lowest rollout RMSE on two of three PDE benchmarks while baselines either diverge or require up to an order of magnitude more function evaluations to remain stable.

Comment: Uses semigroup consistency of autonomous flows as a global structural prior for learning continuous dynamics from sparse temporal observations.

Topic Match: This is a mechanistic training/objective paper: the key idea is a new structural regularizer and adaptive inference principle for learned dynamical models.

Relevance: 8 Novelty: 8

62. RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

ArXiv ID: 2605.10706

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Byeongchan Kim, Arijit Sehanobish, Avinava Dubey, Min-hwan Oh, Krzysztof Choromanski

Abstract: We present a new class of efficient attention mechanisms applying universal 3D Relative Positional Encoding (RPE) methods given by arbitrary integrable modulation functions $f$. They lead to the new class of 3D-Transformer models, called \textit{RelFlexformers}, flexibly integrating those RPEs, and characterized by the $O(L \log L)$ time complexity of the attention computation for the $L$-length input sequences. RelFlexformers builds on the theory of the Non-Uniform Fourier Transform (NU-FFT), naturally generalizing several existing efficient RPE-attention methods from structured settings with tokens homogeneously embedded in unweighted grids into general non-structured heterogeneous scenarios, where tokens' positions are arbitrarily distributed in the corresponding 3D spaces. As such, RelFlexformers can be applied in particular to model point clouds. Our extensive empirical evaluation on a large portfolio of 3D datasets confirms quality improvements provided by the NU-FFT-driven attention modulation techniques in the RelFlexformers.

Comment: Introduces an O(L log L) attention mechanism for arbitrary integrable 3D relative positional encodings using NU-FFT, extending efficient attention beyond grid-structured settings.

Topic Match: The core contribution is a new attention/RPE mechanism and architectural formulation, with efficiency gains as a secondary benefit.

Relevance: 8 Novelty: 8

63. Improving Generalization by Permutation Routing Across Model Copies

ArXiv ID: 2605.09256

Primary Topic: Architecture and Training Dynamics

Authors: Shuhei Kashiwamura, Timothee Leleu

Abstract: We introduce a use of the (M)-cover (or (M)-layer) transform for machine learning. The method replicates a model (M) times, but instead of coupling the copies through parameter averaging or an explicit attractive force, as in replicated SGD or Elastic SGD, it rewires the contexts in which local learning messages are computed. Each local loss is evaluated on a routed model whose parameters are drawn from different copies according to permutations sampled from a structured mixing kernel (Q). Training then uses the original local update rule, while the resulting learning messages are redistributed across the copies through these routed computational paths. Thus (Q) defines a topology for message transport and controls the long-loop structure of the lifted factor graph. We formulate this construction for perceptrons, committee machines, and multilayer perceptrons, showing that the same principle applies from discrete models to differentiable neural networks. The resulting framework provides a mechanism for improving generalization through structured message sharing rather than replica collapse or parameter-space coupling.

Comment: Uses permutation routing across replicated model copies to improve generalization through structured message sharing rather than parameter averaging.

Topic Match: This is a new training/architectural mechanism for how learning signals are routed across replicated models, directly matching training dynamics.

Relevance: 8 Novelty: 8

64. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

ArXiv ID: 2605.08737

Primary Topic: Architecture and Training Dynamics

Authors: Xin Li, Hao Jiang, Annan Wang, Yichi Zhang, Chau Yuen

Abstract: On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

Comment: Identifies a closed-form clip-safety threshold governing when on-policy distillation of near-deterministic structured outputs collapses format adherence.

Topic Match: This is directly about post-training dynamics and failure boundaries in a common LLM optimization regime, with a mechanistic threshold analysis rather than a benchmark tweak.

Relevance: 8 Novelty: 8

65. HyperTransport: Amortized Conditioning of T2I Generative Models

ArXiv ID: 2605.08254

Primary Topic: Architecture and Training Dynamics

Authors: Valentino Maiorca, Eleonora Gualdoni, Xavier Suau, Marco Cuturi, Luca Zappella, Pau Rodr\'iguez

Abstract: As foundation models grow in capability, the ability to efficiently and reliably control their behavior becomes critical. Fine-tuning these models can be costly, and while prompting can be practical for controllability, it remains fragile due to models' high sensitivity to exact prompt wording and structure. This brittleness has driven interest in activation steering techniques that offer more stable and predictable control over model behavior. However, existing activation steering methods require per-concept optimization, which makes them ill-suited to deployment scenarios where the concept set is large, evolving, or only specified at request time: each new concept incurs at least minutes of optimization on the target model. We propose HyperTransport, a hypernetwork framework that amortizes this cost by mapping embeddings from a pretrained encoder (CLIP in our instantiation) directly to intervention parameters, trained end-to-end using an optimal transport loss. Once trained, HyperTransport produces each new intervention in a single hypernetwork forward pass, 3600-7000x faster than per-concept fitting. On concepts unseen during training, it matches the strongest per-concept baselines at inducing the target concept. By decoupling concept representation from intervention prediction, HyperTransport combines three capabilities that no existing approach offers as a set: amortized steering for open-ended concept sets, continuous interpretable strength control, and cross-modal conditioning where reference images can directly steer text-based generation. We validate HyperTransport on DMD2 and Nitro-1-PixArt across 167 held-out test concepts via CLIP-based metrics, a VLM-as-a-judge evaluation, and a user study. In pairwise comparisons, both human and VLM judges prefer HyperTransport over prompting ~2x as often.

Comment: Amortizes activation-steering by learning a hypernetwork that maps concept embeddings directly to intervention parameters, avoiding per-concept optimization.

Topic Match: Best fit is architectural mechanism design: the core contribution is a new control architecture for activation steering via a hypernetwork, not an application benchmark.

Relevance: 8 Novelty: 8

66. Don't Fix the Basis -- Learn It: Spectral Representation with Adaptive Basis Learning for PDEs

ArXiv ID: 2605.10451

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Xuxiang Zhao, Angelica I. Aviles-Rivero

Abstract: Spectral neural operators achieve strong performance for PDE learning, but rely on fixed global bases that limit their ability to represent spatially heterogeneous and multiscale dynamics. We propose Adaptive Basis Learning (ABLE), a framework that learns data-dependent spectral representations instead of relying on predefined bases. ABLE constructs a spatially adaptive Parseval frame via a learned ancillary density, enabling the operator to act in a lifted spectral space while preserving invertibility and maintaining $O(N\log N)$ complexity through FFT-based implementation. This shifts the source of expressivity from spectral coefficients to the representation itself, allowing the model to capture localized structures and non-translation-invariant interactions more efficiently. ABLE integrates seamlessly into existing neural operator architectures as a drop-in replacement for spectral layers. Across a range of benchmarks ABLE improves accuracy over strong baselines, with the largest gains in regimes characterized by sharp gradients and multiscale behavior. Moreover, augmenting existing models (e.g., U-FNO, HPM) with ABLE further enhances their performance, demonstrating its role as a general and complementary spectral refinement. Our results highlight that the data-driven choice of representation, rather than operator complexity alone, is a key bottleneck in neural operator design. By learning the basis itself, ABLE provides a principled and efficient framework for improving spectral methods in PDE learning.

Comment: Replaces fixed spectral bases in neural operators with learned adaptive bases, shifting expressivity from coefficients to representation choice.

Topic Match: Primary fit is architecture/training because the core innovation is a new spectral-layer design with adaptive basis learning as the key mechanism.

Relevance: 8 Novelty: 8

Efficiency, Compression, and Large-Scale Training (31)

1. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

ArXiv ID: 2605.09649

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, Rex Ying

Abstract: The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.

Comment: Introduces a globally calibrated, learnable KV-cache eviction policy that can improve long-context performance rather than only approximate full-cache inference.

Topic Match: The main contribution is a new KV-cache design for memory-efficient inference with algorithmic impact on long-context behavior, making efficiency/scaling the best fit.

Relevance: 10 Novelty: 8

2. BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

ArXiv ID: 2605.10655

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Venugopalan Iyengar

Abstract: Trellis-coded quantization sets the current 2-bit post-training frontier for LLMs (QTIP), but pushing below the PTQ ceiling requires quantization-aware training, and QAT on a trellis is obstructed by the non-differentiable Viterbi argmax. We introduce BCJR-QAT, a relaxation that replaces the argmax with the BCJR forward-backward sum-product algorithm at temperature $T$, producing a soft codeword equal to the Boltzmann expectation over trellis paths, exactly differentiable, recovering the hard QTIP code as $T \to 0$, and mathematically identical to the transfer-matrix computation for a 1D Ising-like spin chain. We contribute (i) a fused Triton kernel making BCJR tractable on a single consumer GPU ($6.57\times$ speedup, fp32 parity); (ii) a quantitative drift-budget theory of when BCJR-QAT can escape the QTIP-PTQ Voronoi basin, verified across four experiments; and (iii) a positive empirical result on Llama-3.2-1B at 2 bpw under end-to-end forward-KL distillation: with the right schedule (skip the high-$T$ phase to avoid an overshoot we diagnose), single-layer BCJR-QAT beats QTIP-PTQ by $\mathbf{-0.084}$ PPL on WikiText-2, and multi-layer compounding is super-additive.

Comment: Introduces a differentiable BCJR relaxation for trellis-coded quantization-aware training below the PTQ frontier, plus theory for escaping the PTQ basin.

Topic Match: This is a strong efficiency/compression paper: a new differentiable route to ultra-low-bit quantization-aware training with theory and systems support.

Relevance: 9 Novelty: 8

3. PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

ArXiv ID: 2605.08632

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Zihao An, Taichi Liu, Ziqiong Liu, Dong Li, Ruofeng Liu, Emad Barsoum

Abstract: Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate that PARD-2 achieves up to 6.94$\times$ lossless acceleration, surpassing EAGLE-3 by 1.9$\times$ and PARD by 1.3$\times$ on Llama3.1-8B. Our code is available at https://github.com/AMD-AGI/PARD.

Comment: Trains speculative draft models against acceptance-length objectives rather than token accuracy, aligning training directly with decoding-time speedup.

Topic Match: This is a core inference-efficiency paper with a new objective tailored to speculative decoding behavior.

Relevance: 9 Novelty: 8

4. AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

ArXiv ID: 2605.08734

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Ziyun Liu, Fengmiao Bian, Jian-Feng Cai

Abstract: Low-Rank Adaptation (LoRA) reparameterizes a weight update as a product of two low-rank factors, but the Jacobian $J_{G}$ of the generator mapping the factors to the weight matrix is rank-deficient, so the factor-space preconditioner $J_{G}^ {F}t J}$ induced by any ${W}$-space preconditioner ${Ft$ is singular, and consequently the standard chain rule cannot be uniquely inverted to map a preconditioned ${W}$-space direction back to a factor-space update. We cast existing LoRA optimizers in a unified framework parameterized by two choices: (i) which invertible surrogate for $J^ {F}t J}$ to use, and (ii) which ${Ft$ on ${W}$ to use. Existing methods occupy four families along these axes: factor-space adaptive updates, block-diagonal surrogates for $J_t$-weighted norm. Across GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization, AdaPreLoRA is competitive with or improves over a representative set of LoRA optimizers while keeping peak GPU memory at the LoRA optimizer level.}^* J_{G}$, Frobenius-residual pseudoinverse methods, and Riemannian manifold constraint. Within this design space, a gradient-statistics-aware ${F}_t$ paired with a closed-form factor-space solve at ${O}((m+n)r)$ memory remains underexplored. We propose \textbf{AdaPreLoRA}, which fills this gap by adopting the Adafactor diagonal Kronecker preconditioner ${H}_t$ on ${W}$ and selecting from the resulting factor-space solution family the element minimizing an ${H}_t$-weighted imbalance between the two factor contributions; by construction, the resulting factor update is the closest LoRA approximation to the preconditioned ${W}$-space direction under the ${H

Comment: Derives an Adafactor-preconditioned LoRA update by solving the singular factor-space inverse problem with an H_t-weighted optimality criterion.

Topic Match: The primary contribution is a principled low-rank adaptation optimizer that improves parameter-efficient training under realistic memory constraints.

Relevance: 9 Novelty: 8

5. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

ArXiv ID: 2605.08639

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin

Abstract: Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL's rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6$\times$ over Megatron-LM and by up to 1.2$\times$ over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.

Comment: Uses routing replay from RL rollouts to do micro-batch and batch-level MoE load balancing, a concrete systems idea for unstable expert demand.

Topic Match: The main advance is an MoE training-system method that materially improves throughput under RL-specific routing volatility.

Relevance: 9 Novelty: 8

6. Test-Time Speculation

ArXiv ID: 2605.09329

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Avinash Kumar, Sujay Sanghavi, Poulami Das

Abstract: Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

Comment: Adapts speculative decoding online at test time using the target model's own verification signal, improving long-generation acceptance length.

Topic Match: The key advance is a new inference-efficiency algorithm that changes long-output speculative decoding behavior without extra supervision.

Relevance: 9 Novelty: 8

7. Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

ArXiv ID: 2605.08423

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics, Memory Structures and Agent Memory Systems

Authors: Omatharv Bharat Vaidya, Connor T. Jerzak, Nhat Ho, Chandrajit Bajaj

Abstract: We present a data-adaptive method for parameter-efficient fine-tuning of large neural networks. Standard low-rank adaptation methods improve efficiency by restricting each layer update to a fixed low-rank form, but this static parameterization can be too rigid when the appropriate correction depends on the input and on the evolving depth-wise computation of the network. Our approach replaces a purely layer-local adapter with a shared queryable memory of low-rank update atoms. For each block of layers, the model forms a query from the current low-rank state and a running summary of previous blocks, uses this query to retrieve a content-dependent combination of shared update components via attention, and applies the resulting routed operator within the low-rank bottleneck. In this way, the method retains the efficiency and scalability of low-rank adaptation while allowing the effective update to vary across inputs and to share reusable structure across layers. The resulting architecture provides a principled middle ground between static LoRA-style updates and fully generated parameter updates: it remains compact and parameter-efficient while supporting dynamic, context-sensitive adaptation. Further, we incorporate instruction-regularization by augmenting routing logits with a language-induced prior over update atoms, thereby biasing the selection of low-rank transformations toward semantically relevant directions without generating unconstrained parameter updates. Experiments on noisy non-linear regression tasks and LLM fine-tuning suggest that this queryable update-memory formulation can improve final test performance and training stability compared to standard low-rank adaptation, while using a comparable number of trainable parameters.

Comment: Replaces static LoRA with a shared queryable memory of low-rank update atoms routed by content and instruction-conditioned priors.

Topic Match: The core idea is a parameter-efficient adaptation mechanism that dynamically composes low-rank updates, squarely within efficient fine-tuning.

Relevance: 9 Novelty: 8

8. Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

ArXiv ID: 2605.09490

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: Aojie Yuan, Tianqi Shen, Dajun Zhang

Abstract: Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.

Comment: Semantics-aware KV-cache hierarchy shows offloading without eviction preserves reasoning while cutting HBM use.

Topic Match: The main idea is a materially new memory/cache design for efficient inference, with strong algorithmic and systems implications for large-model serving.

Relevance: 9 Novelty: 8

9. Nectar: Neural Estimation of Cached-Token Attention via Regression

ArXiv ID: 2605.09778

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: Jo\~ao Monteiro, Michal Klein, Pierre Ablin, Marco Cuturi

Abstract: Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|\theta|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

Comment: Approximates long-context cached-token attention with compact learned regressors whose cost is independent of cache length.

Topic Match: Best fit is efficiency/scaling because the contribution is a new KV-cache/attention replacement mechanism for long-context inference efficiency.

Relevance: 9 Novelty: 8

10. TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

ArXiv ID: 2605.09281

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu

Abstract: Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10$\times$ and reduces inference latency to $\sim$5\% while preserving state-of-the-art accuracy.

Comment: Introduces 2D-tiled low-rank post-training quantization for MoE experts with fused inference to cut memory overhead and latency.

Topic Match: Primary fit is efficiency/compression since it targets MoE deployment cost with a specific new quantization and execution scheme.

Relevance: 9 Novelty: 8

11. RubiConv -- Efficient Boundary-Respecting Convolutions

ArXiv ID: 2605.08451

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Linda Friso, Annie Marsden, Xinyi Chen, Arushi Gupta, Peter Bartlett, Mark Braverman, Elad Hazan

Abstract: Convolutional architectures have emerged as powerful alternatives to Transformers for sequence modeling. The primary advantage is that they offer improved theoretical sequence length complexity by leveraging the Fast Fourier Transform (FFT). However, this theoretical improvement does not always meaningfully land in practice. One critical obstacle is that applying standard FFTs is not amenable to the large-scale training pipeline wherein data is packed from different sources into a single sequence for hardware efficiency. Indeed, standard FFT algorithms are not easily amenable to document packing. Existing workarounds suffer from severe inefficiencies, crippling the practical performance of convolutional architectures. We close this gap with RubiConv, a novel algorithm for performing hardware-efficient, boundary-respecting convolutions on packed sequences. Extensive experiments show that RubiConv achieves significant speedups over both attention and standard FFT-based baselines. This work makes the theoretical efficiency of long convolutional models a practical reality for large-scale, real-world data packing.

Comment: Boundary-respecting FFT convolutions for packed sequences make long-convolution sequence models practical in large-scale training.

Topic Match: Best fit is efficiency/scaling because the core contribution is a new training/inference algorithm that changes the practical cost profile of convolutional sequence models under real packing constraints.

Relevance: 9 Novelty: 8

12. LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

ArXiv ID: 2605.08755

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Euntae Choi, Sumin Song, Sungjoo Yoo

Abstract: Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes -- including state-of-the-art end-to-end (E2E) QAT -- lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining reasoning-domain calibration with a one-layer lookahead loss whose implicit cross-layer co-adaptation preserves the next-layer residual stream. For Qwen3-4B under W3G128 quantization, LAQuant improves AIME25 Pass@1 over ParoQuant by 15.11pp (1.93pp over ParoQuant++ at matched calibration) while achieving a 3.42x decoding speedup over FP16 on RTX A6000, compared with ParoQuant's 3.01x.

Comment: Introduces a layer-wise lookahead QAT loss aimed at preserving KV-cache fidelity and long-decoding reasoning under low-bit quantization.

Topic Match: Best fit is efficiency/scaling because the contribution is a new quantization method for lowering decoding cost while preserving large-reasoning-model behavior.

Relevance: 9 Novelty: 8

13. ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

ArXiv ID: 2605.10793

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Chayne Thrash, Ali Abbasi, Soheil Kolouri

Abstract: Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error. Recent rotation-based methods address this by applying orthogonal transformations that redistribute activation magnitude across dimensions, but existing approaches either require expensive end-to-end rotation training or rely on stored activation corpora, introducing significant compute or storage overhead. We propose a lightweight post-training rotation calibration method for LLM activation quantization. Our method learns orthogonal rotations that align normalized activations with the corners of an inscribed hypercube, encouraging activation energy to be distributed more evenly across dimensions. This objective admits an efficient closed-form update via the orthogonal Procrustes problem, avoiding gradient-based optimization over the orthogonal group. We further introduce an online calibration procedure that updates rotations as calibration samples are processed, eliminating the need to store activations on disk and allowing rotations to adapt to quantized activation distributions during calibration. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters show that our method achieves competitive or improved performance across perplexity benchmarks and common sense reasoning tasks while avoiding both costly end-to-end training and large offline activation storage.

Comment: Learns orthogonal rotations for low-bit activation quantization via a lightweight Procrustes-based calibration procedure without end-to-end retraining or activation dumps.

Topic Match: This is a clear match to activation quantization and efficient LLM inference under constrained precision.

Relevance: 9 Novelty: 8

14. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

ArXiv ID: 2605.10886

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Liang Luo, Yinbin Ma, Quanyu Zhu, Vasiliy Kuznetsov, Yuxin Chen, Jian Jiao, Jiecao Yu, Buyun Zhang, Tongyi Tang, Xiaohan Wei, Yanli Zhao, Zeliang Chen, Yuchen Hao, Venkatesh Ranganathan, Sandeep Parab, Yantao Yao, Maxim Naumov, Chunzhi Yang, Shen Li, Ellie Wen, Wenlin Chen, Santanu Kolay, Chunqiang Tang

Abstract: Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

Comment: Makes FP8 practical for recommendation models through a system-model co-design stack combining statistical safety probing, model adaptations, and runtime kernel dispatch.

Topic Match: This is squarely an efficiency paper: low-precision training/inference with workload-aware algorithm-system co-design for large-scale models.

Relevance: 9 Novelty: 8

15. Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

ArXiv ID: 2605.09034

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Jiahe Chen, Ziye Ma

Abstract: Zeroth-order (ZO) optimization has become increasingly popular and important in fine-tuning large language models (LLMs), especially on edge devices due to its ability to adjust the model to local data without the need for memory-intensive back-propagation. Recent works try to reduce ZO variance through low-dimensional subspace search, but subspace restriction alone leaves key optimization geometry under-exploited, motivating additional acceleration. In this work, we focus on the hidden layer training problem in which spectral optimizers like Muon outperform AdamW due to its ability to exploit weak spectral directions by orthogonalization. However, we have discovered that unlike in the first-order setting, full orthogonalization works poorly in the ZO setting since the gradient estimates are highly noisy and unreliable. To address this issue, we propose a key approach we call partial orthogonalization. To do so, we replace the iconic Newton-Schulz procedure in Muon with the faster, more concentrated power-iteration method so that it only amplifies dominant spectral directions. Furthermore, to improve the efficiency and generalization of the algorithm, we adopted a streaming variant of power-iteration that requires low variance in gradients, which was achieved through constraining our search inside a subspace obtained through the projection of momentum, echoing recent advances. Experiments on LLM fine-tuning show that our method can achieve from 1.5x to 4x the convergence speed of ZO-Muon, the current SOTA algorithm, across SuperGlue datasets in the OPT-13B model. Across different models, we also reach competitive final accuracies with less time in most cases compared with strong ZO baselines such as MeZO, LOZO and ZO-Muon. Code is available at https://github.com/MOFA-LAB/ZO-MOPI.git.

Comment: Introduces partial orthogonalization for zeroth-order spectral optimization, improving memory-light LLM tuning dynamics.

Topic Match: The main contribution is a new optimization method for efficient large-model training without backprop, directly in the efficiency/scaling bucket.

Relevance: 9 Novelty: 8

16. GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

ArXiv ID: 2605.10124

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Zengzipeng Tang, Yuxuan Sun, Wei Chen, Jianwen Ding, Bo Ai

Abstract: The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candidate tokens to be verified by a powerful target model. However, a fundamental challenge lies in achieving per-token resource scheduling to effectively adapt SD paradigm to resource-constrained edge environment. This paper proposes a Generative Entropy- and Lyapunov-based Adaptive Token Offloading framework, named GELATO, to maximize decoding throughput under energy constraints in a device-edge collaborative SD system. Specifically, an outer drift-plus-penalty loop makes online decisions to establish a reference drafting budget, managing long-term energy-throughput trade-off. Further, a nested entropy-driven generation mechanism executes early exiting to adapt to per-token dynamic generative uncertainty. Theoretical analysis establishes a rigorous performance bound on long-term throughput for GELATO. Extensive evaluations demonstrate that GELATO achieves a globally optimal tradeoff, outperforming state-of-the-art distributed SD architectures by 64.98% in token throughput and reducing energy consumption by 47.47% under resource-constrained environments, while preserving LLM decoding quality.

Comment: Designs adaptive token offloading for device-edge speculative decoding using entropy-based control and Lyapunov optimization.

Topic Match: This squarely targets memory/compute-efficient inference with a new scheduling algorithm for distributed speculative decoding.

Relevance: 9 Novelty: 8

17. AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery

ArXiv ID: 2605.10741

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Barbara Su, Fangshuo Liao, Anastasios Kyrillidis

Abstract: Fine-tuning large language models with LoRA requires choosing a rank r before training starts. Existing approaches either extract rank-1 components sequentially, freezing each component's error permanently into every subsequent residual, or optimize the full low-rank factorization jointly with guarantees that describe only the joint update, not individual rank-1 directions. We present AdaPaD (Adaptive Parallel Deflation), which trains all rank-1 components simultaneously: each worker refines its component against a deflation target built from the latest estimates of all predecessors, and as those estimates improve, the targets improve too. We call this property self-correction: deflation errors converge to zero over rounds rather than persisting as fixed residuals. On top of this backbone, AdaPaD adds advance learning (private pre-training before activation) and per-module dynamic rank discovery (importance-based growth until a shared budget is exhausted), making the rank distribution an output rather than an input. We prove that every component's error decays exponentially after a warm-up period, with a generalization bound that splits into a vanishing algorithmic term and an irreducible statistical floor. Empirically, AdaPaD is competitive with adaptive-rank LoRA baselines on GLUE with DeBERTaV3-base at matched parameter budgets, and competitive with fixed-rank LoRA on Qwen3-0.6B SQuAD/SQuAD v2 while deploying an adapter that is on average 30.7% smaller.

Comment: Adaptive Parallel Deflation gives a new self-correcting way to train rank-1 LoRA components jointly while discovering per-module rank during training.

Topic Match: The core contribution is a PEFT efficiency method: dynamic rank discovery and a new parallel deflation algorithm that changes how low-rank adapters are trained.

Relevance: 9 Novelty: 8

18. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

ArXiv ID: 2605.09735

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Zhiqing Zhong, Zhijing Ye, Jian Zhang, Weijian Zheng, Bolun Sun, Xiaodong Yu

Abstract: Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving.

Comment: Regularizes KV-cache movement beneath a static-graph decoder using a pager and committed descriptors to recover serving flexibility without dynamic kernels.

Topic Match: This is directly about KV-cache design and memory-efficient LLM serving, with a concrete runtime mechanism rather than routine deployment tuning.

Relevance: 9 Novelty: 8

19. Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

ArXiv ID: 2605.09189

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Christopher M. Bryant, Hao Liu

Abstract: The scaling laws guiding modern model training were calibrated for a single regime: data-rich, single-epoch pretraining. The dominant such scaling law form, Chinchilla's $L = E + A/N^\alpha + B/D^\beta$, has three structural limitations outside that regime: it diverges as unique data shrinks instead of saturating at the uninformed baseline; it cannot represent overfitting when capacity exceeds the data; and it conflates total examples seen with unique examples available. We propose a closed-form extension, $L(N, D, T) = E + (L_0 - E)\,h/(1+h)$ with $h = a/N^\alpha + b/T^\beta + c\,N^\gamma/D^\delta$, that decomposes loss into undercapacity, undertraining, and overfitting terms. It saturates between the irreducible loss $E$ and an uninformed baseline $L_0$ fixed by the loss type, and reduces to Chinchilla in the data-rich, single-epoch limit. We validate it on four multi-epoch experiments spanning four architecture families (MLPs, ResNets, Fourier neural operators, and transformers) across vision, scientific ML, and language domains, and refit it to five published LLM scaling-law grids. Extrapolating to higher compute and larger unique data than seen at fit time, our form achieves state-of-the-art RMSE on every published LLM grid we evaluate and on most cells of our constructed experiments. Once calibrated, the form admits a cost-aware allocation that recovers Chinchilla's optimum when data is free and shifts toward smaller corpora and more epochs as data grows expensive.

Comment: Proposes a closed-form scaling law that separates undercapacity, undertraining, and overfitting, extending beyond Chinchilla to multi-epoch, data-constrained regimes.

Topic Match: The paper is best categorized as efficiency/scaling because it offers a more realistic compute-to-performance law for training allocation under limited data.

Relevance: 8 Novelty: 8

20. Locking Pretrained Weights via Deep Low-Rank Residual Distillation

ArXiv ID: 2605.10777

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Keitaro Sakamoto, Pierre Ablin, Federico Danieli, Marco Cuturi

Abstract: The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.

Comment: Locks pretrained models by replacing MLPs with deep low-rank residual modules that are cheap in forward pass but expensive to fine-tune via backprop.

Topic Match: Despite the security framing, the technical core is an architectural/training-cost asymmetry mechanism exploiting backward-memory overhead.

Relevance: 8 Novelty: 8

21. Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features

ArXiv ID: 2605.09345

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Representation Learning Theory and Structure

Authors: Guangqi Li, Yongxin Li

Abstract: We identify a Selection Plateau phenomenon in one-shot neural network pruning: all rank-monotone weight scorers converge to identical accuracy at fixed sparsity, independent of functional form. We propose the Sparsity-Information-Complexity Spectrum (SICS) hypothesis: a sparsity-dependent minimum feature complexity kappa(S) governs plateau escape, with kappa=0 sufficient at low sparsity (S0.75). On ViT-Small/CIFAR-10, testing nine feature classes across four sparsities, smooth non-monotone features provide +6.6% escape at S=0.7, while only raw features with high-frequency wiggle escape at S=0.8 (+2.6%). A fake non-monotone scorer underperforms the gradient baseline, indicating the requirement is magnitude-independent non-monotonicity. A handcrafted Gaussian bump achieves only +0.006 escape vs. chaos-derived +0.046, indicating rank-alignment is necessary but insufficient. SICS provides a unifying explanation for the performance clustering of diverse pruning methods and suggests that future selection algorithms should adapt feature complexity to target sparsity.

Comment: Identifies a selection plateau in one-shot pruning and proposes a sparsity-dependent hierarchy of feature complexity needed to escape it.

Topic Match: Its primary value is mechanistic understanding of pruning behavior and what feature complexity matters at different sparsity levels.

Relevance: 8 Novelty: 8

22. Compute Where it Counts: Self Optimizing Language Models

ArXiv ID: 2605.10875

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Yash Akhauri, Mohamed S. Abdelfattah

Abstract: Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.

Comment: Learns per-token compute allocation inside a frozen LLM by choosing sparsity, pruning, and quantization actions from hidden states.

Topic Match: This is primarily an efficiency paper because the main idea is dynamic inference-time budget allocation for cost-quality tradeoffs.

Relevance: 8 Novelty: 8

23. Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

ArXiv ID: 2605.09238

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, Bamdev Mishra

Abstract: Muon and related norm-constrained matrix optimizers have become central to large-scale learning problems. They are formulated as a linear maximization oracle (LMO) over an ambient matrix-norm ball in unconstrained Euclidean space. However, these do not generalize cleanly to manifold-valued parameters such as low-rank factorizations, orthogonality constraints, or symmetric positive definite (SPD) matrices. Naively restricting the Muon LMO to the tangent space (i) breaks quotient symmetries and (ii) couples the tangent-space constraint with an ambient norm bound, thereby obstructing closed-form solutions on various manifolds of interest. We resolve both issues with a single observation: every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic norm constrained LMO is symmetry preserving. Building on this, we introduce intrinsic Muon (iMuon), a unified framework that yields closed-form updates on the fixed-rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm, including the spectral, Frobenius, and nuclear norms. We establish convergence guarantees for both deterministic and stochastic iMuon with rate constants that depend only on the manifold dimension. Notably, on the fixed-rank manifold this constant depends only on the rank, making the rate independent of factor conditioning and removing the runtime factor-rescaling required by prior work. Experiments on LoRA finetuning of LLMs, image classification, and subspace learning illustrate the efficacy of the proposed approach.

Comment: Extends Muon-style norm-constrained optimization to manifold-valued parameters via intrinsic tangent-space norms with closed-form updates.

Topic Match: This is primarily an optimizer and large-scale training-method paper, even though it also has strong mathematical structure.

Relevance: 8 Novelty: 8

24. Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization

ArXiv ID: 2605.10673

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Yao Shu, Zilin Zhu

Abstract: Low-bit forward evaluation is an attractive route to memory-efficient zeroth-order (ZO) adaptation: the optimizer needs only scalar losses, and the model can be queried near deployment precision. The obstacle is that a quantized ZO query is not a continuous finite difference followed by harmless storage rounding. The query chooses endpoints, the low-precision engine rounds them, and the loss difference is measured along the rounded chord. For nonuniform companding quantizers, this makes the codebook insufficient to predict ZO behavior: a fixed weight-space radius can collapse in dense cells, over-span sparse cells, or assign a rounded chord to an unrounded update direction. We identify the missing object as query geometry and model scalar nonuniform quantization as $Q = \phi^{-1} \circ U \circ \phi$. CAQ-ZO (Compander-Aligned Queries for Zeroth-Order Optimization) forms one-grid-step Rademacher stencils $z \pm \Delta r$ in $z = \phi(x)$, maps endpoints back through $\phi^{-1}$, and updates in $z$. Our theory proves the grid-span mismatch, decomposes endpoint-rounding estimator residuals, and gives stationarity bounds in which generic off-grid queries retain a $\Delta^2/\mu^2$ residual channel while CAQ-ZO makes the query-time residual exactly zero. Synthetic experiments isolate this channel, and matched NF4 Qwen/Llama fine-tuning shows that CAQ-ZO improves the trained NF4 baseline under the same quantizer and evaluation budget.

Comment: Shows quantized zeroth-order optimization depends on query geometry and proposes compander-aligned queries that remove query-time residual error.

Topic Match: The main contribution is a new low-bit optimization method that changes adaptation behavior under quantized evaluation budgets.

Relevance: 8 Novelty: 8

25. Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training

ArXiv ID: 2605.09994

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Ting Sun, Junjie Zhang, Xiao Yan, Songxin Zhang, Zhuoyang Song, Jingyi Xi, Zunyao Mao, Bingyi Jing, Jiaxing Zhang, Zejian Xie

Abstract: Modern Large Foundation Model (LFM) training has transformed the data pipeline from a static ingestion layer into a dynamic component that must co-evolve with the training process. Existing systems are ill-equipped: colocated dataloaders offer no failure isolation, while message queue-based disaggregated dataloaders operate on a record/offset abstraction that cannot express the batch-level semantics required by distributed training. We present Lakestream, a brokerless, object-store-native training data plane with three key properties. First, it introduces the Transactional Global Batch (TGB), which builds on lakehouse-style ACID storage semantics and extends them with training-specific consistency, including atomic all-rank batch visibility, a globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. Second, it realizes recovery and retention directly in the storage layer, by inlining producer state in the manifest and tying reclamation to distributed checkpoint state. Third, its Decentralized Adaptive Commit (DAC) algorithm sustains stable ingestion throughput as the manifest grows, without any inter-producer communication. Evaluations on large-scale multimodal pre-training and SFT workloads using 64 GPUs show that Lakestream outperforms colocated dataloader throughput while providing full failure isolation, outperforms Apache Kafka in ingestion throughput, and achieves lower consumer read latency than Kafka.

Comment: Designs a brokerless training data plane with transactional global batches and exactly-once recovery semantics tailored to distributed foundation-model training.

Topic Match: This is not routine infrastructure; it proposes new batch-consistency and recovery semantics that materially affect large-scale training behavior.

Relevance: 8 Novelty: 8

26. Adversary-Robust Learning from Fully Asynchronous Directional Derivative Estimates

ArXiv ID: 2605.09337

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Anik Kumar Paul, Nibedita Roy, Nagesh Talagani, Swetha Ganesh, Gugan Thoppe, Alexandre Reiffers-Masson

Abstract: We propose FAR-SIGN (Fully Asynchronous Robust optimization via SIGNed directional projections) for adversary-resilient learning in parameter-server--worker systems. FAR-SIGN achieves robustness through sign-based updates along carefully designed directions and mitigates the resulting bias via a two-timescale mechanism. It admits both first-order and zeroth-order implementations and enables fully asynchronous execution without requiring a private reference dataset at the server. We establish almost-sure convergence of FAR-SIGN to the set of stationary points for smooth, nonconvex objectives. Moreover, we prove the near-optimal rate of $O(n^{-1/4+\epsilon})$ in the first-order setting and the standard $O(n^{-1/6+\epsilon})$ in the zeroth-order setting, where $n$ is the iteration count and $\epsilon>0$ can be chosen arbitrarily small. Experiments on MNIST show that FAR-SIGN outperforms robust aggregation-based methods in both accuracy and wall-clock time.

Comment: Proposes an adversary-robust fully asynchronous distributed optimization method with convergence guarantees and no private server reference data.

Topic Match: Its main contribution is a new distributed training algorithm with robustness and async execution guarantees, a strong scaling-systems fit.

Relevance: 8 Novelty: 8

27. Core-Halo Decomposition: Decentralizing Large-Scale Fixed-Point Problems

ArXiv ID: 2605.08681

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Haixiang, Yang Xu, Jiefu Zhang, Xudong Wu, Zihan Zhou, Jun He, Jiayu Chen

Abstract: We study solving large-scale fixed-point equation (x^\star=\bar F(x^\star)) with decomposition. Standard strict decomposition assigns each agent a disjoint block and evaluates updates using only owned coordinates. For most operators, however, a block update may depend on variables outside the block. Truncating these dependencies by strict decomposition changes the mean operator and creates structural bias that cannot be removed by more samples, smaller stepsizes, or additional consensus. We therefore propose Core-Halo decomposition, which separates write ownership from read-only evaluation context: each agent updates its own core and reads from an overlapping halo. By aligning the Core-Halo decomposition with the block-dependence structure of $\bar F$, the original fixed-point problem can be implemented faithfully in a decentralized multi-agent system. We further characterize the fundamental obstruction faced by strict decomposition through a Bellman closure condition and a blockwise bias lower bound, showing that local-only updates can alter the original fixed-point operator. Finally, we conduct extensive experiments across a range of application settings, and demonstrate that Core-Halo achieves near-centralized performance while retaining the parallelism benefits of decentralization.

Comment: Separates write ownership from read context via overlapping halos, avoiding structural bias in decentralized fixed-point updates.

Topic Match: This is a foundational distributed computation paper about decomposition structure and faithful large-scale fixed-point solving.

Relevance: 8 Novelty: 8

28. Function-Space ADMM for Decentralized Federated Learning: A Control Theoretic Perspective

ArXiv ID: 2605.09356

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Akihito Taya, Yuuki Nishiyama, Kaoru Sezaki

Abstract: Decentralized federated learning (FL) is a promising approach for training machine learning models on sensor networks, Internet of Things (IoT) devices, and other edge systems where no central server exists. While federated learning offers advantages such as preserving data privacy, it often suffers from non-independent and identically distributed (IID) data distributions across devices, which cause significant performance degradation. This issue is particularly severe when directly optimizing model parameters, because neural network training is inherently non-convex and standard convergence guarantees for convex optimization do not apply. Unlike existing decentralized FL methods that primarily operate in parameter space, we propose federated function-space alternating direction method of multipliers (FedF-ADMM). FedF-ADMM exploits the convexity of loss functionals within function space to derive alternating direction method of multipliers (ADMM)-based update directions, which are subsequently projected onto the parameter space via knowledge distillation. We further introduce a stabilization coefficient to enhance robustness under severe non-IID settings and analyze its behavior from a control-theoretic perspective by interpreting it as a proportional-integral (PI) term. Experiments under challenging non-IID scenarios, including settings where each device has data from only a single label, demonstrate that FedF-ADMM achieves faster and more stable convergence than existing decentralized FL methods, while attaining higher accuracy and better consensus among devices.

Comment: Moves decentralized federated learning into function space with ADMM updates projected back by distillation, giving a new training algorithm under severe non-IID data.

Topic Match: Best fit is efficiency and large-scale training because it proposes a new distributed optimization method that materially changes collaborative training behavior under heterogeneity.

Relevance: 8 Novelty: 8

29. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

ArXiv ID: 2605.08862

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Yuhang Xu, Kaibin Tian, Yang Tian, Zhice Yang, Yifeng Yu, Yan Li, Shengzhong Liu, Fan Wu, Guihai Chen

Abstract: Reinforcement Learning (RL) has become a cornerstone for improving the performance of Large Language Models (LLMs). However, its rollout phase constitutes a significant efficiency bottleneck, mainly arising from the long-tail bubbles across data parallel ranks, particularly in long-context scenarios where faster GPUs remain idle while waiting for stragglers. Existing solutions, such as partial rollout or asynchronous RL, mitigate these bubbles by compromising the algorithm's strict synchronous nature. Instead, we propose BubbleSpec, a novel framework that accelerates RL rollouts while strictly keeping the mathematical exactness. Instead of attempting to eliminate bubbles, BubbleSpec exploits them. We exploit the idle time windows of faster ranks to pre-generate rollout results for subsequent steps, serving as drafts for speculative decoding. Unlike prior speculative methods that rely on historical epoch similarity and warm-ups, BubbleSpec is agnostic to dataset size and provides immediate acceleration from the onset of training. Extensive evaluations demonstrate that BubbleSpec reduces decoding steps by 50% and increases rollout throughput by up to 1.8x. Critically, BubbleSpec is seamlessly compatible with various RL frameworks and strategies as it sustains the strict synchronous property of RL algorithms.

Comment: Exploits long-tail rollout bubbles with speculative drafts while preserving exact synchronous RL semantics.

Topic Match: The contribution is a systems/algorithmic efficiency method that materially improves RL training throughput without changing the underlying algorithm.

Relevance: 8 Novelty: 8

30. TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators

ArXiv ID: 2605.08231

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Chang Meng, Hanyu Wang, Yuyang Ye, Mingfei Yu, Wayne Burleson, Giovanni De Micheli

Abstract: Reducing power consumption in AI accelerators is increasingly important. Approximate computing can reduce power consumption while keeping the accuracy loss small. Since multipliers are power-hungry components in AI models, this paper focuses on synthesizing low-power approximate multipliers (AxMs). Unlike prior works that design AxMs separately from AI model training, we present TRAM, which jointly optimizes the AxM structure and AI model parameters to lower power with small accuracy loss. Experiments show that compared to state-of-the-art AxMs, TRAM achieves up to 25.05% AxM power reduction on CNNs with CIFAR-10, and reduces power by up to 27.09% on vision transformers with ImageNet.

Comment: Jointly trains approximate multiplier hardware structures with model parameters, making accelerator approximation part of the learning problem rather than a post hoc replacement.

Topic Match: The main contribution is an efficiency/compression co-design method that materially changes compute-power tradeoffs during model deployment/training.

Relevance: 8 Novelty: 8

31. Unveiling High-Probability Generalization in Decentralized SGD

ArXiv ID: 2605.10205

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Jiahuan Wang, Ping Luo, Ziqing Wen, Dongsheng Li, Tao Sun

Abstract: Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{\delta \sqrt{mn}}\right)$, where $\delta$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/\delta)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/\delta)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.

Comment: Derives high-probability generalization bounds for decentralized SGD using pointwise uniform stability, narrowing the gap to single-worker SGD theory.

Topic Match: Although theoretical, the contribution is about distributed training behavior and guarantees for a core large-scale optimization algorithm.

Relevance: 8 Novelty: 8

Representation Learning Theory and Structure (48)

1. Learnability and Competition in High-Dimensional Multi-Component ICA

ArXiv ID: 2605.08552

Primary Topic: Representation Learning Theory and Structure

Authors: Eser Ilke Genc, Samet Demir, Zafer Dogan

Abstract: Independent Component Analysis (ICA) is a foundational tool for unsupervised representation learning, yet its high-dimensional theory remains largely limited to single-component recovery. We develop an asymptotically exact mean-field theory for multi-component online ICA, capturing the coupling induced by simultaneous learning and orthogonalization. In the high-dimensional limit, the joint empirical distribution of learned estimates and ground-truth components converges to a deterministic process, yielding a closed ODE system for the overlap matrix between learned directions and true components. This characterization reveals a genuinely multi-component, initialization-driven phase structure: a decoupled regime, where estimates align with distinct components and evolve nearly independently, and a competition regime, where overlapping initializations induce orthogonality-driven conflicts, slow reorientation, and delayed convergence. Our steady-state analysis gives explicit learnability boundaries and competition conditions linking step size, data moments, and initialization. These conditions show that larger higher-order moments and competition shrink the stable learning-rate window, increase convergence times, and predict a staircase phenomenon in which the number of recoverable components changes discretely with the learning rate. Experiments on synthetic data and hyperspectral remote sensing data validate the predicted trajectories and phase behavior.

Comment: Asymptotically exact mean-field theory for multi-component online ICA, exposing competition-driven learnability phases and explicit recovery boundaries.

Topic Match: This is directly about mechanistic theory of learned representations in high-dimensional ICA, with explicit dynamics and identifiability-style phase analysis.

Relevance: 10 Novelty: 8

2. fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

ArXiv ID: 2605.09438

Primary Topic: Representation Learning Theory and Structure

Authors: Andreas D. Demou, Panagiotis Koromilas, James Oldfield, Yannis Panagakis, Mihalis A. Nicolaou

Abstract: Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent's activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained cross-layer parameterization and unregularized cross-layer dependence. We address both by introducing fmxcoders, which (i) replace the encoder and decoder with low-rank tensor factorizations that draw every latent's per-layer weights from a shared cross-layer basis, and (ii) apply stochastic layer masking, a denoising regularizer along the layer axis that penalizes latents whose contribution collapses when a single layer is masked. Across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, fmxcoders lift mean probing F1 by 10-30 points, surpassing per-layer SAE baselines that standard crosscoders fail to reach, reduce reconstruction MSE by 25-50%, and roughly double mean functional coherence. An LLM-as-a-judge evaluation further shows that fmxcoders recover 3-13$\times$ more semantically coherent latents than standard crosscoders across all four base LLMs.

Comment: Shows why standard cross-layer crosscoders fail and introduces factorized masked crosscoders to recover genuinely cross-layer features.

Topic Match: The paper is directly about discovering and characterizing cross-layer learned features, which is a strong match to representation structure.

Relevance: 10 Novelty: 8

ArXiv ID: 2605.08764

Primary Topic: Representation Learning Theory and Structure

Authors: Nikhil J. Dhinagar, Vidhi Chhatbar, Chirag Jagad, Pavithra Senthilkumar, Sophia I. Thomopoulos, Mahir H. Khan, Sook-Lei Liew, the ENIGMA-Stroke Recovery Working Group, Paul M. Thompson

Abstract: Deep vision models degrade sharply in low-data regimes, particularly in medical imaging where labeled samples are scarce. We show this arises not merely from overfitting but from a geometric failure: finite-sample noise corrupts the embedding covariance, collapsing the eigengap and limiting the number of recoverable signal-bearing modes. We develop a spectral theory of finite-sample representation learning that quantifies the recoverable dimension K(N), the number of eigenmodes that can be stably estimated from N samples. Using perturbation theory and concentration bounds, we show that only modes with eigenvalues above the noise floor $|\hat{\Sigma} - \Sigma|_{\mathrm{op}} \sim \sqrt{D/N}$ are reliable, yielding a truncated Mahalanobis energy that governs classification performance. Under a power-law spectral model, this energy can be approximated by a truncated Riemann zeta function, linking eigenvalue decay to data efficiency and AUC. Within this framework, multimodal learning acts as spectral stabilization: vision-language models impose low-rank constraints that suppress noise-dominated directions and preserve the eigengap, increasing K(N) under data scarcity. Across MNIST and multi-disease neuroimaging, we show that multimodal training maintains more stable modes and improves class separation, even when unimodal models achieve comparable few-shot accuracy. These results identify spectral collapse as a fundamental bottleneck in low-data learning. We use truncated Mahalanobis energy and K(N) to diagnose encoder quality, and introduce zeta-based spectral filtering as a principled approach to improve data efficiency.

Comment: Develops a spectral theory for low-data representation learning, tying recoverable dimension and eigengap collapse to few-shot performance and multimodal stabilization.

Topic Match: The core contribution is theoretical structure of learned embeddings under finite-sample noise, not the application domain.

Relevance: 9 Novelty: 8

4. Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

ArXiv ID: 2605.08740

Primary Topic: Representation Learning Theory and Structure

Authors: Nilesh Sarkar, Dawar Jyoti Deka

Abstract: Sparse autoencoders (SAEs) decompose transformer residual streams into interpretable feature dictionaries, yet the relationship between SAE width and causal influence on model output has not been systematically characterised. We introduce causal dimensionality kappa(L, M, T), defined as the effective rank of the expected Jacobian outer product at layer L, and show it can be estimated via the SAE width sweep paired with attribution patching. Across seven SAE widths from 16,384 to 1,048,576 features on Gemma-2-2B layer 12, representational capacity grows 15.6x while causal capacity grows only 4.35x: a robust separation we term the representational-causal wedge. A saturating fit yields kappa-hat approximately 1,990 with kappa-hat / d_model = 0.86 and participation-ratio lower bound kappa_PR approximately 280. Crucially, kappa is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions). Across eight network depths kappa is constant while the absolute attribution threshold drops 20x from layer 1 to layer 23. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, and a four-cell encoder/decoder ablation) pin down what kappa measures and what it does not. Our findings establish kappa as a measurable, model-intrinsic property of transformer layers: sub-linearly recoverable by SAE width, invariant to model scaling, and structured across network depth.

Comment: Introduces causal dimensionality as a measurable property of transformer layers, separating recoverable representational width from output-relevant causal capacity.

Topic Match: This is a direct probe of representation structure and causal feature organization in transformers using SAE width sweeps and attribution patching.

Relevance: 9 Novelty: 8

5. The two clocks and the innovation window: When and how generative models learn rules

ArXiv ID: 2605.10019

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu

Abstract: Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $\tau_{\mathrm{rule}}$, the step at which generations first become rule-valid, and $\tau_{\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $\tau_{\mathrm{rule}}$ and $\tau_{\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $\tau_{\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $\tau_{\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \emph{innovation window} as the interval $[\tau_{\mathrm{rule}}, \tau_{\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $\tau_{\mathrm{rule}} \geq \tau_{\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $\tau_{\mathrm{rule}}$, while training samples' basins begin to dominate around $\tau_{\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.

Comment: Characterizes when generative models learn rules before memorizing data via the two-clock framework and innovation window.

Topic Match: The paper is centrally about training dynamics and emergent rule structure in learned representations, with a predictive account of generalization vs memorization.

Relevance: 9 Novelty: 8

6. Bilinear autoencoders find interpretable manifolds

ArXiv ID: 2605.08891

Primary Topic: Representation Learning Theory and Structure

Authors: Thomas Dooms, Ward Gauderis, Geraint Wiggins, Jose Oramas

Abstract: Sparse autoencoders have become a standard tool for uncovering interpretable latent representations in neural networks. Yet salient concepts often span manifolds that current linear methods cannot capture without post hoc analysis. This paper uses quadratic latents to close this gap: we implement these with bilinear autoencoders, which decompose activations into low-rank quadratic forms, compose linearly in weight space, and admit input-independent geometric analysis. This qualitative difference in what concepts quadratic latents can detect challenges the standard linear representation hypothesis. Our experiments and visualisations show that multi-dimensional geometries are highly prevalent and that composite latents capture them well, systematically improving reconstruction error in language models. Furthermore, we show that autoencoders with varying geometric priors recover the same input subspace despite their dictionary entries being distinct. Practically, these models serve as an unsupervised tool for manifold discovery, which we demonstrate through an interactive online visualizer for Qwen 3.5. This is a step toward nonlinear but mathematically tractable latent representations whose composition is expressive and interpretable by design.

Comment: Shows bilinear autoencoders recover interpretable manifold-valued features beyond linear sparse autoencoders.

Topic Match: The paper directly targets the geometry and identifiability of learned features, extending representation analysis beyond linear concept dictionaries.

Relevance: 9 Novelty: 8

7. The Geometric Structure of Models Learning Sparse Data

ArXiv ID: 2605.08464

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Thomas Walker, T. Mitchell Roddenberry, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

Abstract: The manifold hypothesis (MH) is often used to explain how machine learning can overcome the curse of dimensionality. However, the MH is only applicable in regimes where the training data provides a sufficiently dense sample of the underlying low-dimensional data manifold, or where such a low-dimensional manifold is conceivably present. We describe the regimes where the MH is not applicable as sparse. In this paper, we demonstrate that models succeed in the sparse regime by exploiting a highly structured local geometry, a property we formalize as normal alignment. We prove that normal-aligned classifiers -- whose input-output Jacobians are rank-one and align perfectly with the training data -- minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. For continuous piecewise-affine deep networks, normal alignment manifests geometrically as centroid alignment within the network's induced power diagram partition and results from the feature-learning regime. Motivated by these theoretical insights, we introduce GrokAlign, a regularization strategy that actively induces normal alignment. We demonstrate that GrokAlign significantly accelerates the training dynamics of deep networks relevant to the grokking phenomenon. Furthermore, we apply the principle of normal alignment to Recursive Feature Machines (RFMs) to introduce Recursive Feature Alignment Machines (RFAMs). We show that RFAMs exhibit greater adversarial robustness compared to RFMs when trained on tabular data.

Comment: Introduces normal alignment as a geometric principle explaining learning in sparse-data regimes and connects it to grokking and robustness.

Topic Match: The paper is centered on theory of learned geometry and feature formation when manifold assumptions fail.

Relevance: 9 Novelty: 8

8. Beyond Language: Format-Agnostic Reasoning Subspaces in Large Language Models

ArXiv ID: 2605.09496

Primary Topic: Representation Learning Theory and Structure

Authors: Aojie Yuan, Zhiyuan Su

Abstract: Large language models represent the same reasoning in vastly different surface forms -- English prose, Python code, mathematical notation -- yet whether they share a common internal substrate across these symbolic systems remains unknown. We introduce the TriForm Benchmark (18 concepts x 6 forms x 3 instances = 324 stimuli) and study five LLMs (1.6B-8B) across three architecture families. Using permutation-corrected RSA, cross-form probing, and activation patching, we find converging evidence for a Format-Agnostic Reasoning Subspace (FARS) in middle layers. We make FARS concrete: concept-centroid PCA extracts a 10-dimensional subspace that amplifies concept structure 3x while suppressing form information to near zero. Replacing only these 10 dimensions during cross-form patching preserves 90-96% of model output -- far exceeding both full activation replacement (44-56%) and variance-maximizing PCA (60-74%) -- while ablating them causes targeted disruption. FARS generalizes to held-out concepts and converges across architectures (CCA > 0.79 for all model pairs), providing within-modality evidence for the Platonic Representation Hypothesis. We further discover a declarative-procedural asymmetry: representations are far more compatible between prose and mathematics than between either and code, suggesting that the critical axis of divergence is not linguistic vs. formal but declarative vs. procedural.

Comment: Identifies a low-dimensional format-agnostic reasoning subspace shared across prose, math, and code forms.

Topic Match: The paper is primarily about internal representation geometry and cross-format invariances in reasoning representations.

Relevance: 9 Novelty: 8

9. The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

ArXiv ID: 2605.08746

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: James Hazelden, Laura Driscoll, Eli Shlizerman, Eric Shea-Brown

Abstract: In training a neural network with gradient descent (GD), each iteration induces a linear operator that governs first-order updates to a model's internal state variables. We define this operator as the Global Empirical Neural Tangent Kernel (NTK). In finite-width networks, the NTK is typically intractable to form, leading prior work to focus on restrictive settings such as tracking outputs only or taking infinite-width limits. Here, we study the structure of the NTK for a range of models. Formulating the model state as the solution to a single global implicit constraint, we derive the NTK as a product of two operators: K, accounting for immediate parameter-to-state interactions, and P, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that K admits an exact, computable form given by the Gram matrix of weight-site variables. This core structure reveals that the NTK is structurally bottlenecked, constraining its effective rank and giving rise to a self-referential bias whereby GD preferentially learns within dominant modes of joint hidden and input activity. For recurrent models, we examine the spectrum of the NTK and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that model dynamics at initialization bias the NTK, restricting learning and preventing task components from being learned effectively. Finally, we show that the NTK associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that the NTK possesses tractable structure that explains GD bias toward task solutions and the emergence of low-rank representations. To enable use of the NTK as a practical metric, we build kpflow, a library relying on randomized matrix-free numerical linear algebra.

Comment: Derives tractable finite-width global empirical NTK structure showing low-rank bottlenecks and self-referential bias in GD learning for RNNs and transformers.

Topic Match: Best fit is representation structure because the paper explains how kernel structure constrains learned modes and low-rank representations, though it is also deeply about optimization dynamics.

Relevance: 9 Novelty: 8

10. Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

ArXiv ID: 2605.10395

Primary Topic: Representation Learning Theory and Structure

Authors: Minh-Toan Nguyen, Jean Barbier

Abstract: We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2\beta)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $\beta>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=\Theta(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.

Comment: Characterizes sharp feature-learning phase transitions and Bayes-optimal scaling laws in extensive-width teacher-student networks.

Topic Match: Primary fit is representation structure because the main result is theoretical understanding of when hierarchical features become learnable and how that governs scaling behavior.

Relevance: 9 Novelty: 8

11. Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

ArXiv ID: 2605.09967

Primary Topic: Representation Learning Theory and Structure

Authors: Andrew Lee, Fernanda Vi\'egas, Martin Wattenberg

Abstract: While researchers are finding concepts represented as linear directions in language models, a bag of linear directions fails to capture relational structure. To better understand this dichotomy, we study a model with known linear representations, but trained in a highly structured domain -- the board game Othello. While the model's internal board-state representation is linearly decodable, we find additional structure in the form of tensor product representations (TPRs). We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them to construct the model's board-state representation. We find geometric signatures within the weights of our TPR probe that align with the structure of the board, but perhaps more importantly, that the linear probes can be recovered directly from the parameters of our TPR probe. Our findings suggest that directional representations may be projections of more structured underlying representations.

Comment: Uses tensor-product-representation probes to show linear directions can arise as projections of more structured internal representations.

Topic Match: Best fit is representation structure because the core contribution is uncovering compositional structure beneath linearly decodable features.

Relevance: 9 Novelty: 8

12. Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

ArXiv ID: 2605.09724

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Yiding Song, Hanming Ye

Abstract: Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed $T_{\text{mem}}(P)$ and a generalisation speed $T_{\text{gen}}(P)$, both of which are functions of model parameter count $P$. Adapting the information capacity framework of Morris et al. (2025), we estimate $T_{\text{mem}}(P)$ on random-label data of equivalent complexity and $T_{\text{gen}}(P)$ on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect. The framework also suggests an empirical model for predicting memorisation speed given model capacity and dataset complexity, recovering the previously reported empirical observation that larger models memorise faster. Overall, we motivate the formalisation of different learning timescales as important abstractions to study when explaining how model capacity shapes grokking on algorithmic tasks.

Comment: Gives an information-theoretic account of grokking through competing memorization and generalization timescales that vary with model capacity.

Topic Match: The contribution is a foundational theory of how learned representations and generalization behavior emerge as capacity changes.

Relevance: 9 Novelty: 8

13. Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

ArXiv ID: 2605.09239

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Sohan Venkatesh

Abstract: Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88--93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.

Comment: Demonstrates that repeated-token counting failures come from late MLP overwrite of correctly encoded counts, revealing a representation-output dissociation.

Topic Match: Its main value is mechanistic understanding of what information is represented versus what is expressed at output, so representation structure is the best fit.

Relevance: 9 Novelty: 8

14. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

ArXiv ID: 2605.09195

Primary Topic: Representation Learning Theory and Structure

Also Matches: Memory Structures and Agent Memory Systems

Authors: Rania Elbadry, Ahmed Heakl, Fan Zhang, Dani Bouch, Yuxia Wang, Preslav Nakov, Zhuohan Xie

Abstract: Large language models confidently produce outdated answers, and no existing method can detect them. We show this is not an engineering failure but a structural one: temporal drift, whether a stored fact has changed since training, is encoded as a direction in the residual stream geometrically orthogonal to both correctness and uncertainty. Any method operating on correctness or uncertainty signals is therefore blind to drift by construction. We verify this across six instruction-tuned models. A linear probe trained directly on drift labels achieves AUROC $0.83$--$0.95$; methods based on token entropy, semantic entropy, CCS, and SAPLMA all remain near chance ($0.49$--$0.57$). Five tests confirm the geometric orthogonality: weight cosines ($|\cos| \leq 0.14$), score correlations ($|r| \leq 0.20$), bidirectional null-space projection ($|\Delta| \leq 0.008$), iterative null-space projection with $k{=}10$, and difference-of-means dissociation. Mechanistically, the MLP retrieval circuit produces identical dynamics for stale recall and confabulation ($r > 0.81$, six models), explaining why output confidence cannot separate them. A cross-cutoff experiment holds inputs constant and varies only the model: the probe fires on the model whose training predates the fact's transition and stays silent otherwise ($P(A{>}B) = 0.975$--$0.998$, twelve model pairs), confirming it reads model-internal knowledge state rather than input properties. Our code and datasets will be publicly released.

Comment: Identifies temporal knowledge drift as a representation direction orthogonal to correctness and uncertainty in residual space.

Topic Match: The paper's central claim is geometric structure in model representations, with forgetting/drift analyzed as a distinct internal axis.

Relevance: 9 Novelty: 8

15. The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently

ArXiv ID: 2605.10237

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Elisabetta Cornacchia, Dan Mikulincer, Elchanan Mossel

Abstract: We study how temporal correlations in the data can make certain sparse learning problems efficiently learnable by gradient-based methods. Our focus is on Boolean k-juntas, a canonical sparse learning problem known to pose barriers for gradient-based methods under independent uniform samples. We show that this picture changes when the samples are generated by a lazy random walk on the hypercube. In this setting, the temporal dependencies can be exploited by a two-layer ReLU network trained using stylized-SGD with a temporal-difference loss, which compares target and predicted increments across consecutive samples. For every fixed k, the resulting sample complexity is essentially linear in the ambient dimension d. By contrast, we show that for large-batch gradient methods using standard convex pointwise losses, temporal correlations do not provide the same advantage.

Comment: Proves that temporal correlations from random walks make SGD with temporal-difference loss efficiently learn sparse juntas where iid samples do not.

Topic Match: The paper is a foundational learning-theory result about how data structure changes what gradient methods can learn.

Relevance: 9 Novelty: 8

16. SMIXAE: Towards Unsupervised Manifold Discovery in Language Models

ArXiv ID: 2605.09224

Primary Topic: Representation Learning Theory and Structure

Authors: Collin Francel

Abstract: Sparse autoencoders (SAEs) have been used widely to decompose and interpret neural network activations, especially those of transformer language models. One key issue with SAEs is their inability to directly model multidimensional features. Instead, SAEs may tile such features by a set of independent directions that must be grouped together after the SAE training phase, impeding discoverability and interpretation of learned feature representations. We begin to address this issue by introducing the Sparse MIXture of Autoencoders (SMIXAE) architecture. Empirically, we provide evidence that SMIXAE models have success both in directly learning previously identified manifold structures, as well as finding novel structures, within the open source Gemma 2 2B and 9B models. Finally, we discuss several limitations and point towards areas for future work.

Comment: Introduces a sparse mixture of autoencoders to directly capture multidimensional/manifold features in LM activations, addressing a core limitation of standard SAEs.

Topic Match: The paper is centrally about the structure and interpretability of learned representations, specifically manifold discovery in language-model activations.

Relevance: 9 Novelty: 8

17. The Polynomial Counting Capabilities of Message Passing Neural Networks

ArXiv ID: 2605.10393

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Marco S\"alzer, Pascal Bergstr\"a{\ss}er, Anthony W. Lin

Abstract: The counting power of Message Passing Neural Networks (MPNN) has been the subject of many recent papers, showing that they can express logic that involves counting up to a threshold or more generally satisfy a linear arithmetic constraint. In this paper, we study the counting capabilities of MPNN beyond linear arithmetic, primarily utilising local and global mean aggregations. In particular, our goal is to tease out conditions required to express extensions of graded modal logic with polynomial counting constraints. We show that global polynomial counting constraints in node-labelled graphs can be checked using mean MPNN under mild assumptions. Checking local constraints is also possible, if we consider formulas with no nested modalities and additionally either (i) permit sum/max aggregations, or (ii) only restrict to regular graphs. We also show how formulas with nested modalities can be captured by mean MPNN over graphs with tree-like structures and similar assumptions.

Comment: Characterizes when mean-aggregation MPNNs can express polynomial counting constraints beyond linear arithmetic.

Topic Match: This is a theory paper on representational expressivity and counting structure in message passing networks.

Relevance: 9 Novelty: 8

18. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

ArXiv ID: 2605.09352

Primary Topic: Representation Learning Theory and Structure

Authors: Zhaoyang Zhang, Run Shao, Dongyue Wu, Jiajie Teng, Chao Tao, Jingdong Chen, Haifeng Li

Abstract: Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales--yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.

Comment: Introduces directional convergence analysis showing multimodal representations asymmetrically move toward language-like neighborhood structure.

Topic Match: This is directly about representation geometry, convergence direction, and a mechanistic hypothesis for multimodal structure formation.

Relevance: 9 Novelty: 8

19. From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

ArXiv ID: 2605.09949

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Zehao Li, Yasuhiro Yoshikai, Shumpei Nemoto, Hiroyuki Kusuhara, Tadahaya Mizuno

Abstract: Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning stagnation is not explained by model capacity alone and instead reflects the complexity of chiral constraints. Analyses of attention dynamics, residual-stream trajectories, and latent-space geometry support an encoder-centered mechanism in which chiral-token representations undergo transient destabilization and reconstruction, seen as a V-shaped drop and recovery in vector norm and directional stability, together with a clear reorganization of chiral molecular representations in the latent space. Encoder-decoder cross-evaluation further supports the encoder-centered nature of the transition, and targeted attention-head ablation identifies a small set of chiral-sensitive heads whose removal selectively reduces chiral-token accuracy even in the fully trained model. These findings show that SMILES translation can serve as a useful experimental system for mechanistic analysis of semantic emergence in CLMs, with implications for interpretable chemical representation learning.

Comment: Tracks the abrupt emergence of chirality in SMILES translation models via checkpoint-level analysis of attention, residual streams, and latent geometry.

Topic Match: This is a strong representation-structure match because it studies when and how a semantic feature forms inside a model and identifies mechanistic signals of that emergence.

Relevance: 9 Novelty: 8

20. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

ArXiv ID: 2605.09875

Primary Topic: Representation Learning Theory and Structure

Authors: Su-Hyeon Kim, Yo-Sub Han

Abstract: Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.

Comment: Finds transferable behavioral directions across model families by projecting hidden states into a shared anchor coordinate space, enabling cross-family steering and detection.

Topic Match: This is directly about the structure and comparability of learned representations across models, with mechanistic interpretability implications.

Relevance: 9 Novelty: 8

21. HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds

ArXiv ID: 2605.10536

Primary Topic: Representation Learning Theory and Structure

Authors: Honghan Wu, Tianyan Wang, Jiacong Mi, Zhoyang Jiang, Yunsoo Kim

Abstract: Rare semantic innovations in high-dimensional, mission-critical domains are often obscured by dense background contexts, a challenge we define as \textit{feature density conflict}. We introduce the \textbf{Hybrid Hierarchical SAE (HH-SAE)} to resolve this by factorizing manifolds into a nested hierarchy of \textbf{Contextual} ($L_0$), \textbf{Atomic} ($f_1$), and \textbf{Compository} ($f_2$) tiers. Evaluating across disparate manifolds, HH-SAE demonstrates superior resolution by \textbf{``fracturing'' administrative clinical labels into physiological modes} and achieving a peak \textbf{cross-domain zero-shot AUC of 0.9156 in fraud detection}. Path ablation confirms the architecture's structural necessity, revealing a 13.46\% utility collapse when contextual subtraction is removed. Finally, knowledge-steered synthesis achieves a +9.9\% AUPRC lift over state-of-the-art generators, proving that HH-SAE effectively prioritizes high-order mechanistic innovation over environmental proxies to enable high-precision discovery in high-stakes environments.

Comment: Introduces a hierarchical SAE that separates contextual, atomic, and compositional features to resolve dense feature interference.

Topic Match: This is a strong match because the contribution is a new sparse representation-learning structure for disentangling and steering hierarchical features.

Relevance: 9 Novelty: 8

22. The Propagation Field: A Geometric Substrate Theory of Deep Learning

ArXiv ID: 2605.08529

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Xingrui Gu

Abstract: Modern deep learning treats neural networks primarily as endpoint functions from inputs to outputs. Inspired by the shift from force to geometry in physics, we ask whether a network should instead be understood through the geometry of its internal propagation. We define a neural propagation field as the collection of hidden-state trajectories and local Jacobian operators across depth. Endpoint losses constrain only the boundary behavior of this field, leaving its interior geometry underdetermined. We show that endpoint-equivalent models can differ by orders of magnitude in trajectory and Jacobian structure, and introduce observable field metrics such as path sensitivity, solver consistency, and trajectory/Jacobian retention. In controlled teacher-flow and PDE systems, endpoint fitting fails to recover the underlying propagation law. In real multi-path tasks, field-aware objectives improve unseen-path generalization, OOD robustness, and calibration when aligned with the observation structure, but can collapse when over-constrained. In continual learning, field-preservation regularization complements replay and distillation: on Split CIFAR-100, DER++ with field preservation improves average accuracy, backward transfer, and field-retention metrics. These results identify propagation-field quality as a measurable and trainable property of neural networks beyond endpoint performance.

Comment: Reframes neural networks in terms of hidden-state trajectory and Jacobian geometry, with measurable field metrics and field-preservation training objectives.

Topic Match: The paper is mainly about understanding and shaping internal representational geometry rather than proposing a standard new architecture.

Relevance: 8 Novelty: 9

23. Neural Information Causality

ArXiv ID: 2605.09316

Primary Topic: Representation Learning Theory and Structure

Authors: Jeongho Bang, Marcin Paw{\l}owski

Abstract: Query-separated computation forces a representation to play an operational role: data are encoded before a query is known, and a later decoder can answer only through the intermediate interface. In this regime the representation functions as a message rather than merely as a feature map. We formalize this observation by embedding information causality (IC) into representation learning, obtaining a framework called neural information causality (Neural-IC). The revised formulation separates two logically distinct statements. First, every query-separated architecture induces a random-access communication experiment and obeys the embedding inequality $I_{\mathrm{N\text{-}RAC}}\le I(\vec a:H,B)$. Second, any independently certified physical capacity bound on the interface, such as a hard $m$-bit alphabet, a finite-precision register, or a power-constrained noisy channel, implies $I_{\mathrm{N\text{-}RAC}}\le C_H$. This separation avoids treating capacity as a post hoc definition and makes Neural-IC an operational diagnostic for query leakage, precision leakage, and episode-specific memory. We also provide an exact one-bit classical RAC benchmark, showing explicitly that the relevant quantum enhancement is not total information beyond the bottleneck, but fair query-conditioned access. For CHSH-type correlation layers, nested Neural-RAC protocols multiply correlation biases across depth; requiring stability of a one-bit bottleneck for arbitrary depth selects the Tsirelson threshold. We extend the analysis to asymmetric seed biases, to multi-capacity finite-depth phase diagrams, and to correlated data via a conditional information score. Controlled simulations, including straight-through binary bottlenecks and deliberately leaky ablations, verify that apparent violations are accounted for by broken query separation or undercounted capacity.

Comment: Recasts bottlenecked representation learning as a query-separated communication problem with capacity-based diagnostics.

Topic Match: This is fundamentally about what internal representations can encode under bottleneck constraints, with a new theoretical framing.

Relevance: 8 Novelty: 9

24. Neural Weight Norm = Kolmogorov Complexity

ArXiv ID: 2605.10878

Primary Topic: Representation Learning Theory and Structure

Authors: Tiberiu Musat

Abstract: Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.

Comment: Argues that fixed-precision neural weight norm tracks Kolmogorov complexity up to logarithmic factors, giving a theory for why weight decay works.

Topic Match: This is a foundational theory paper about what neural parameter norms represent, connecting regularization to description complexity.

Relevance: 8 Novelty: 9

25. Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

ArXiv ID: 2605.09502

Primary Topic: Representation Learning Theory and Structure

Authors: Aojie Yuan, Zhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao

Abstract: Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC -- from the very first reasoning step (0.79) -- while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions -- activation steering, probe-guided best-of-N, self-correction, and activation patching -- all fail; patching destroys output coherence entirely. The signal is diagnostic, not causal: a readout of computation quality, not a lever to redirect it. This delineates a boundary for mechanistic interpretability: error representations during reasoning are fundamentally different from the factual knowledge representations that prior work has successfully edited.

Comment: Shows hidden states linearly encode chain-of-thought error awareness far better than text does, while interventions fail to causally use that signal.

Topic Match: This is best viewed as representation-structure work because it isolates a diagnostic-but-noncausal internal signal about reasoning quality.

Relevance: 8 Novelty: 8

26. Towards Effective Theory of LLMs: A Representation Learning Approach

ArXiv ID: 2605.09294

Primary Topic: Representation Learning Theory and Structure

Authors: Muhammed Ustaomeroglu, Guannan Qu

Abstract: We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental-state" trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high-level, dynamically meaningful variables that support interpretation, prediction, and intervention.

Comment: Learns coarse-grained macrovariables from hidden-state trajectories to capture interpretable, temporally consistent computation in LLMs.

Topic Match: The paper is fundamentally about uncovering and formalizing learned representational structure in hidden-state dynamics.

Relevance: 8 Novelty: 8

27. Belief or Circuitry? Causal Evidence for In-Context Graph Learning

ArXiv ID: 2605.08405

Primary Topic: Representation Learning Theory and Structure

Authors: Katharine Kowalyshyn, Timothy Duggan, Daniel Little, Michael C Hughes

Abstract: How do LLMs learn in-context? Is it by pattern-matching recent tokens, or by inferring latent structure? We probe this question using a toy graph random-walk across two competing graph structures. This task's answer is, in principle, decidable: either the model tracks global topology, or it copies local transitions. We present two lines of evidence that neither account alone is sufficient. First, reconstructing the internal representation structure via PCA reveals that at intermediate mixture ratios, both graph topologies are encoded in orthogonal principal subspaces simultaneously. This pattern is difficult to reconcile with purely local transition copying. Second, residual-stream activation patching and graph-difference steering causally intervene on this graph-family signal: late-layer patching almost fully transfers the clean graph preference, while linear steering moves predictions in the intended direction and fails under norm-matched and label-shuffled controls. Taken together, our findings are most consistent with a dual-mechanism account in which genuine structure inference and induction circuits operate in parallel.

Comment: Provides causal evidence that in-context graph learning combines latent structure inference with induction-style local circuitry rather than either alone.

Topic Match: The main contribution is mechanistic evidence about how representations encode and use latent structure during in-context learning.

Relevance: 8 Novelty: 8

28. How LLMs Are Persuaded: A Few Attention Heads, Rerouted

ArXiv ID: 2605.09314

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Xiangkun Sun, Lingkai Kong, Aoqi Zhang, Liang Zeng, Tonghan Wang

Abstract: Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.

Comment: Identifies a compact causal circuit for persuasion in LLMs, including a rank-one evidence-routing feature and the attention heads that construct and use it.

Topic Match: The central result is mechanistic understanding of internal representations and circuits that govern model behavior.

Relevance: 8 Novelty: 8

29. Reasoning emerges from constrained inference manifolds in large language models

ArXiv ID: 2605.08142

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang, Yinan Wu, Zhe Qian, Yang Lu, Long Chen, Zhao Cao, Xiaoshuai Hao, Ji-Rong Wen, Jungong Han

Abstract: Reasoning in large language models is predominantly evaluated through labeled benchmarks, conflating task performance with the quality of internal inference. Here we study reasoning as an intrinsic dynamical process by examining the evolution of internal representations during inference. We find that inference-time dynamics consistently self-organize into low-dimensional manifolds embedded within high-dimensional representation spaces. we find that such geometric compression, although pervasive, is not sufficient for stable or reliable reasoning. Instead, effective reasoning dynamics emerge within a constrained structural regime characterized by three conditions: adequate representational expressivity, spontaneous manifold compression, and preservation of non-degenerate information volume within the compressed subspace. Models outside this regime exhibit characteristic pathological inference dynamics. Based on these insights, we introduce a unified, label-free diagnostic computed solely from internal dynamics. These findings suggest that reasoning in LLMs is fundamentally governed by geometric and informational constraints, offering a complementary framework to benchmark-centric assessment.

Comment: Argues reasoning quality is governed by constrained low-dimensional inference manifolds and proposes a label-free diagnostic from internal dynamics.

Topic Match: The paper studies the geometric structure of internal representations during inference, making representation structure the best fit.

Relevance: 8 Novelty: 8

30. What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching

ArXiv ID: 2605.08344

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Alec Helbling, Sebastian Gutierrez Hernandez, Benjamin Hoover, Duen Horng Chau, Parikshit Ram

Abstract: Recent work has shown that models flow matching models can be trained without explicit time conditioning, challenging the standard view that the interpolation time is needed to disambiguate velocity targets. But why should a time-blind model work at all? Decomposing the time-blind flow matching loss, we identify two sources of irreducible error: a coupling variance, which arises from ambiguous velocity targets induced by how noise and data points are paired, and the time-blindness gap, which is the additional error caused by ignoring time. This gap shows that time-blind training is strictly harder than conventional training, reinforcing the puzzle that time-blind models work so well in practice. We resolve this tension by showing that the geometry of high-dimensional data makes time identifiable directly from noisy observations. When data concentrates near a $k$-dimensional subspace, time can be recovered from the statistical structure of noisy interpolants in directions orthogonal to the data; under a spiked-covariance model, this yields a closed-form estimator that recovers $t$ from a single observation $z$ at rate $O(1/\sqrt{d-k})$ for ambient dimension $d$. As a consequence, we prove that the time-blindness gap is asymptotically negligible relative to the coupling variance. We empirically demonstrate our identifiability result on real-world data and show that changing the coupling has a much larger effect on loss and sample quality than removing time conditioning across CIFAR-10, CelebA-HQ, and FFHQ. These results explain why time-blind flow matching works and show that the main practical lever is the choice of coupling, not explicit time conditioning.

Comment: Explains why time-blind flow matching can work by showing high-dimensional data geometry makes interpolation time identifiable from noisy observations.

Topic Match: The strongest match is the theoretical analysis of what information the learned data geometry encodes and how that affects training objectives.

Relevance: 8 Novelty: 8

31. Deterministic Decomposition of Stochastic Generative Dynamics

ArXiv ID: 2605.08794

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Xingyu Song, Yuan Mei, Naoya Takeishi

Abstract: Modern generative models can be understood as probability transport from a simple base distribution to a target data distribution. Deterministic transport models offer tractable velocity-field parameterizations, whereas stochastic generative models capture richer density evolution through drift and diffusion. Yet when stochastic dynamics are described through deterministic velocity fields, the effects of drift and diffusion are often compressed into a single effective field, obscuring the distinct roles of deterministic evolution and stochastic fluctuation. In this work, we show that the deterministic field (b_t) of a stochastic generative process admits a natural transport--osmotic decomposition that separates deterministic transport from stochastic, diffusion-induced effects: (b_t = u_t + d_t), where (u_t) governs marginal probability transport and (d_t) captures an osmotic effect induced by diffusion and determined by the marginal score. Based on this decomposition, we propose Bridge Matching, a flow-based framework for learning decomposed generative dynamics through both marginal and conditional formulations. In generative modeling experiments, we recombine the learned components as (b_t = u_t + \lambda_d d_t), showing that the proposed decomposition enables interpretable and controllable sampling by adjusting the osmotic contribution in probability transport.

Comment: Decomposes stochastic generative dynamics into transport and osmotic components, enabling interpretable control of sampling behavior.

Topic Match: The paper is mainly about a structural and mechanistic understanding of learned generative dynamics rather than application performance.

Relevance: 8 Novelty: 8

32. Diagnosing Spectral Ceilings in Equivariant Neural Force Fields

ArXiv ID: 2605.08286

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Hyunmog Kim

Abstract: We introduce a spectral-injection diagnostic for measuring which angular frequencies a trained equivariant force-field backbone preserves: inject a controlled angular-frequency perturbation into a molecular force field, attach a lightweight Spectral Prediction Network (SPN) to the frozen backbone, and read off which frequencies are recoverable. On aspirin, a quadratic SPN attached to an L = 2 NequIP backbone recovers the boundary signal at l = 4 but collapses at l = 5: a 11.7x cliff at the predicted drL boundary, with p dropping from 0.913 to 0.078. The same boundary-vs-above contrast persists across n = 4 independently trained backbones (raw-gain delta contrast, hierarchical cluster bootstrap) and is corroborated by a denominator-free injected-residual metric (R2_inj(4) = 0.374 versus R2_inj(5) = 0.006). A finite-degree span theorem calibrates the diagnostic: for a single marked direction, degree-d polynomials of degree-L spherical-harmonic features span exactly H less than or equal to dL with multiplicity-one saturation at the boundary (scoped to single-direction degree-bounded probes, not a function-class upper bound on multi-atom MPNNs). A synthetic C5 calibration plus capacity, activation, and cross-architecture controls rule out parameter count alone as the explanation.

Comment: Introduces a spectral-injection diagnostic that measures which angular frequencies an equivariant backbone preserves and identifies sharp spectral ceilings.

Topic Match: Best fit is representation structure because it probes what information classes are retained in learned equivariant representations and where they fail.

Relevance: 8 Novelty: 8

33. Generalization Error Bounds for Picard-Type Operator Learning in Nonlinear Parabolic PDEs

ArXiv ID: 2605.10277

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Koichi Taniguchi, Sho Sonoda

Abstract: Operator learning for partial differential equations (PDEs) aims to learn solution operators on infinite-dimensional function spaces from finite-resolution data. In this setting, it is important for the learned model to be discretization-invariant, or resolution-robust, and to reflect PDE-specific structure. It is therefore natural to ask how such structure should be encoded in the model architecture, hypothesis class, or learning procedure. In this paper, we study operator learning for solution operators of nonlinear parabolic PDEs based on Duhamel--Picard iteration. We formulate Picard iteration as an abstract state-transition model and present a theoretical framework for Picard-type operator learning. We derive implementation-agnostic generalization error bounds that separate the implementation error from the estimation error associated with the abstract state-transition model induced by Picard iteration. A key consequence is that increasing the Picard depth reduces the Picard truncation error without causing an unbounded growth of the entropy-based estimation error. We also extend the analysis to long-time prediction by rolling out the same learned local model over successive time blocks. Finally, we illustrate the theory for nonlinear heat equations on the torus using a Picard-type Fourier neural operator as a concrete implementation.

Comment: Gives implementation-agnostic generalization bounds for Picard-type operator learning, separating truncation, implementation, and estimation errors.

Topic Match: Primary fit is representation structure because it analyzes how iterative operator representations generalize in infinite-dimensional PDE settings.

Relevance: 8 Novelty: 8

34. A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

ArXiv ID: 2605.08513

Primary Topic: Representation Learning Theory and Structure

Authors: Hamid Kazemi, Atoosa Chegini, Maria Safi

Abstract: Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron in each system, we demonstrate both directions of failure -- bypassing safety on explicit harmful requests via suppression, and inducing harmful content from innocent prompts via amplification -- across seven models spanning two families and 1.7B to 70B parameters, without any training or prompt engineering. Our findings suggest that safety alignment is not robustly distributed across model weights but is mediated by individual neurons that are each causally sufficient to gate refusal behavior -- suppressing any one of the identified refusal neurons bypasses safety alignment across diverse harmful requests.

Comment: Shows single identified neurons can causally gate refusal behavior, revealing an unexpectedly localized mechanism for safety behavior in LLMs.

Topic Match: Primary fit is representation structure because the paper isolates neuron-level features that mechanistically control a learned behavioral representation.

Relevance: 8 Novelty: 8

35. Physical probes expose and alleviate chemical-environment collapse in molecular representations

ArXiv ID: 2605.10429

Primary Topic: Representation Learning Theory and Structure

Authors: Jiebin Fang, Zidi Yan, Churu Mao, Yongjun Jiang, Xinyi Tang, Lei Miao, Dan Lu, Yun Huang, Wanjing Ding, Zhongjun Ma

Abstract: Nuclear magnetic resonance (NMR) spectroscopy provides an experimental readout of local chemical environments, but its use in molecular representation learning has been constrained by heterogeneous data and incomplete atom-level assignments. Here we construct complementary high-fidelity experimental and computational 13C NMR resources, which reveal a recurrent form of representational collapse: atoms that are equivalent in molecular topology can remain experimentally distinct in their real chemical environments, whereas explicit 3D descriptions are further limited by static conformations in dynamic regimes. To alleviate this bottleneck, we develop CLAIM (Contrastive Learning for Atom-to-molecule Inference of Molecular NMR), a framework that aligns efficient topological molecular inputs with atom-resolved NMR observables. Through hierarchical chemical priors and cross-level contrastive learning, CLAIM restores lost chemical resolution and markedly improves atom-level molecule-spectrum retrieval. CLAIM remains robust in flexible and tautomeric systems for 13C NMR prediction, improves stereoisomer discrimination without explicit 3D modelling, and transfers to broader molecular property tasks including ADMET prediction and fluorescence estimation. These results establish physically grounded spectral alignment as an effective strategy for alleviating chemical-environment collapse and for guiding experimentally grounded molecular representation learning.

Comment: Uses atom-resolved NMR supervision to expose and alleviate chemical-environment collapse in molecular representations.

Topic Match: The paper directly studies representational collapse and proposes a physically grounded learning scheme to recover lost structure in learned features.

Relevance: 8 Novelty: 8

36. A Deep Risk Estimator for Known Operator Learning

ArXiv ID: 2605.08517

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Andreas Maier, Md Hasan, Paulina Conrad, Paula Andrea Perez-Toro

Abstract: We describe an approach for estimating the statistical risk of deep networks that contain a mix of learned and known operators. Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron's classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced. As an application, we use computed tomography as an example and compare an operator-aware filtered backprojection network with a fully connected substitute that collapses the entire reconstruction pipeline into a single learned dense matrix. The predicted parameter ratio coincides with the structural sparsity that the analytic decomposition into a circulant filter and a sparse backprojection exposes. We confirm the predicted scaling on CPU at small image scale and on GPU at medium image scale, all on the same scaling law. Beyond CT reconstruction, the estimator applies to physics-informed neural networks that hardcode a known physical operation in its architecture, and we expect the result to be of interest for a broad community working on operator-aware deep learning. Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error.

Comment: Derives a layerwise risk estimator showing how replacing learned layers with known operators reduces sample complexity and expected error.

Topic Match: Best fit is representation/theory because the core result is a principled generalization-risk decomposition for mixed learned/known-operator networks.

Relevance: 8 Novelty: 8

37. Mistake-Bounded Language Generation

ArXiv ID: 2605.10809

Primary Topic: Representation Learning Theory and Structure

Authors: Jon Kleinberg, Charlotte Peale, Omer Reingold

Abstract: We investigate the learning task of language generation in the limit, but shift focus from the traditional time-of-last-mistake metric of a generator's success to a new notion of "mistake-bounded generation." While existing results for language generation in the limit focus on guaranteeing eventual consistency, they are blind to the cumulative error incurred during the learning process. We address this by shifting the goal to minimizing the total number of invalid elements output by a generation algorithm. We establish a formal reduction to the Learning from Correct Demonstrations framework of Joshi et al. (2025), enabling a general recipe for deriving mistake bounds via weighted update rules. For finite classes, we provide an algorithm that simultaneously achieves an optimal last-mistake time of $\mathsf{Cdim}(L)$ and a mistake bound of $\lfloor \log_2 |L| \rfloor$, whereas for the non-uniform setting of countably infinite streams of languages, we prove a fundamental trade-off: achieving logarithmic mistakes $O(\log i)$ necessarily precludes convergence guarantees established in prior work. Finally, we show that our framework can be extended to accommodate noisy adversaries and guarantee mistake bounds that scale with the adversary's suboptimality.

Comment: Introduces a new mistake-bounded objective for language generation with formal reductions and tight bounds, directly targeting learning dynamics in generative models.

Topic Match: Best fit is foundational learning theory for generative behavior, with formal analysis of cumulative errors rather than an application or benchmark.

Relevance: 8 Novelty: 8

38. In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

ArXiv ID: 2605.08295

Primary Topic: Representation Learning Theory and Structure

Authors: Ming Liu

Abstract: While random demonstration labels barely hurt in-context learning (Min et al., 2022), we show that homogeneous labels--even semantically valid ones--collapse accuracy to <=12% across six models (Pythia, Llama, Qwen; 0.8B--8B) and four tasks. The trigger is label-slot content: the model treats tokens occupying the label position as an exhaustive answer vocabulary, with homogeneity as the maximally collapsed case. A novel set-level fixation finding confirms this: when demonstrations carry varied nonsense tokens from {foo,bar,vex,nit,orb}, the model places 42--67% of probability on the demonstrated set while P(dog) remains below 0.2%. This is inconsistent with latent-concept Bayesian accounts (Xie et al., 2022) and reveals that ICL output is constrained vocabulary retrieval--the model binds its output to the demonstrated token inventory regardless of semantic plausibility. The effect generalizes to 4-way classification (0% accuracy across three models, 1B--8B) and multi-token verbalizers ("very positive"), where we decompose fixation into format-level (template adoption) and content-level (polarity override) components that are experimentally dissociable. Mechanistically, per-item paired activation patching on Pythia-1B recovers 98.4% of the gap (95% CI [84%, 112%]), localizing fixation to a layer-7-centered circuit (rank 2/560, 99.8th percentile; 4-fold CV mean 103%). Cross-architecture logit lens on Llama-3.2-1B replicates the encode-then-override trajectory with causal confirmation (top-5 layers: 89% recovery).

Comment: Shows in-context learning can collapse into demonstrated-label vocabulary retrieval and localizes the effect with activation patching to a specific circuit.

Topic Match: This is fundamentally about mechanistic structure in learned representations and inference circuits, not downstream classification performance.

Relevance: 8 Novelty: 8

39. Measuring and Decomposing Mode Separation via the Canonical Diffusion

ArXiv ID: 2605.08777

Primary Topic: Representation Learning Theory and Structure

Authors: Shaul Tolkovsky, Ori Meidler, Or Zuk

Abstract: Mode separation, namely how sharply a distribution fragments into barrier-separated clusters, is a fundamental geometric property of densities, difficult to quantify in high dimensions. It is structurally distinct from dispersion, yet existing tools fall short: differential entropy rises with spread regardless of fragmentation, PCA orders directions by variance regardless of barriers, and mutual information requires a mixture decomposition one usually does not have. We measure mode separation through a single stochastic process intrinsic to the density: a unique reversible diffusion with $f$ as its stationary distribution and constant scalar diffusion coefficient. We extract two readouts from its autocovariance matrix: SSA (Sum of Squared Autocorrelations), a scalar barrier-sensitive measure; and DA (Dominant Autocorrelation directions), linear projections ordered by metastability rather than variance. Under an isotropic-Gaussian null, we derive a closed-form spectrum for the empirical autocovariance that generalizes Marchenko--Pastur, with an analytic upper edge that selects the lag at which DA is read off. Both readouts use only samples and a score function, scaling to high dimensions through pretrained score-based generative models via Tweedie's identity. We apply our framework to three settings: (i) synthetic Gaussian mixtures, where SSA tracks mutual information; (ii) SDXL text-to-image generations, where SSA and DA capture structure that entropy and PCA miss; and (iii) molecular dynamics of alanine dipeptide, where DA recovers the known slow backbone dihedrals from static samples alone.

Comment: Measures mode separation using a canonical reversible diffusion, yielding barrier-sensitive scalar and directional statistics recoverable from samples plus scores.

Topic Match: The paper is centrally about quantifying high-dimensional representation/distribution structure beyond variance or entropy, with a new theoretical lens.

Relevance: 8 Novelty: 8

40. Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

ArXiv ID: 2605.10633

Primary Topic: Representation Learning Theory and Structure

Authors: Krishak Aneja, Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri

Abstract: Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.

Comment: Uses stable semantic personality directions as causal controls over emergent misalignment, revealing conserved latent geometry.

Topic Match: Its strongest match is representation structure because it studies stable latent directions, their geometry, and causal interventions on internal semantics.

Relevance: 8 Novelty: 8

41. Optimality of Sub-network Laplace Approximations: New Results and Methods

ArXiv ID: 2605.09075

Primary Topic: Representation Learning Theory and Structure

Authors: Swarnali Raha, Kshitij Khare, Rohit K Patra

Abstract: Although the Laplace approximation offers a simple route to uncertainty quantification in deep neural networks, its reliance on inverting large Hessian matrices has motivated a range of computationally feasible low-dimensional or sparse approximations. A prominent class of such methods - sub-network Laplace approximations, constructs surrogates by restricting attention to a small subset of parameters. Existing approaches in this family typically rely on diagonal, layer-wise, or other architectural heuristics for subset selection, which ignore cross-parameter interactions and lack formal optimality guarantees. In this paper, we provide a rigorous theoretical analysis of the sub-network Laplace paradigm. We prove that all sub-network Laplace methods systematically underestimate the predictive variance of the full Laplace posterior, and that this bias decreases monotonically as the retained sub-matrix expands. Leveraging this insight, we propose two principled, analytically grounded sub-network Hessian approximations: \textit{Gradient-Laplace} selects parameters with the largest average squared gradients of the model output with respect to the parameters over a reference dataset; while \textit{Greedy-Laplace} iteratively refines this selection by accounting for off-diagonal interactions in the precision matrix. We establish theoretical guarantees characterizing their optimality properties and show that Gradient-Laplace provably outperforms existing heuristic approaches. Extensive numerical studies across diverse settings indicate that these methods perform strongly relative to existing benchmarks.

Comment: Proves variance underestimation in sub-network Laplace approximations and gives principled parameter-selection methods with optimality guarantees.

Topic Match: The paper is fundamentally about principled approximation of model uncertainty structure and parameter interactions, not a downstream application.

Relevance: 8 Novelty: 8

42. Embeddings for Preferences, Not Semantics

ArXiv ID: 2605.08360

Primary Topic: Representation Learning Theory and Structure

Authors: Carter Blair, Ariel D. Procaccia, Milind Tambe

Abstract: Modern AI is opening the door to collective decision-making in which participants express their views as free-form text rather than voting on a fixed set of candidates. A natural idea is to embed these opinions in a vector space so that the substantial literature on facility location problems and fair clustering can be brought to bear. But standard text embeddings measure semantic similarity, whereas distances in facility location problems and fair clustering require what we call \textit{preferential similarity}: a participant's agreement with a piece of text should be inversely related to their distance from it. Off-the-shelf embeddings inherit a coarse preference signal through a correlation between semantic and preferential similarity, but fail to capture preferences when the correlation breaks. We formalize this as an invariance problem: text embedding models encode both a preference-relevant signal (stance and values) and semantic nuisance (style and wording), and the two are observationally correlated, so a geometry that relies on nuisance can appear preference-correct even when it is not. We show that synthetic training data designed to break this correlation provably shifts the optimal scorer away from nuisance-dominated cosine and significantly improves preference prediction across 11 online deliberation datasets.

Comment: Argues standard text embeddings conflate semantic and preferential similarity, and constructs synthetic data to learn preference-relevant geometry instead.

Topic Match: The paper is fundamentally about representation geometry—what embeddings encode and how to separate task-relevant structure from nuisance semantics.

Relevance: 8 Novelty: 8

43. Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation

ArXiv ID: 2605.10290

Primary Topic: Representation Learning Theory and Structure

Authors: Lucas Morisset, Alain Durmus, Adrien Hardy

Abstract: This paper aims at analyzing the regularization effect that data augmentation induces on supervised regression methods in the proportional regime, where the number of covariates grows proportionally to the number of samples. We provide a tight characterization of the test error, measured in mean squared error, in terms only of the population quantities of the true data, as well as first and second order statistics of the augmentation scheme. Our results are valid under misspecified feature maps, and for any network architecture where only the last readout layer is trained, and the rest of the network is either frozen or randomly initialized. We specify our results in the case of Gaussian data, and show that our asymptotic characterization is tight in this setting.

Comment: Characterizes test error of random-feature regression with arbitrary data augmentation in the proportional regime.

Topic Match: This squarely fits representation_structure by theoretically analyzing how augmentation shapes learned regression representations and generalization.

Relevance: 8 Novelty: 8

44. Non-Parametric Rehearsal Learning via Conditional Mean Embeddings

ArXiv ID: 2605.08999

Primary Topic: Representation Learning Theory and Structure

Authors: Wen-Bo Du, Tian-Zuo Wang, Han-Jia Ye, Zhi-Hua Zhou

Abstract: In machine learning, a critical class of decision-related problems concerns preventing predicted undesirable outcomes, referred to as the \textit{avoiding undesired future} (AUF) problem. To address this, the \textit{rehearsal learning} framework has been proposed to model influence relations for effective decisions. However, existing rehearsal methods rely on restrictive parametric assumptions such as linear systems or additive noise, limiting their practical applicability. In this paper, we propose the first non-parametric rehearsal learning approach for AUF without assuming specific functional forms of data generation processes. Specifically, we use kernel machinery to reformulate the AUF objective into a unified representation that disentangles desirability modeling from action-induced distributional changes. To handle the discontinuity of desirability indicator, we present a smooth Probit surrogate and provide an approximation error bound. Meanwhile, we capture the action-induced changes via conditional mean embeddings, and develop a kernel ridge regression based nested estimator for AUF objective with consistency guarantees. Such a formulation naturally accommodates nonlinear systems and non-additive noise, and empirical results on synthetic and real-data-derived semi-synthetic benchmarks demonstrate the effectiveness and flexibility of our approach.

Comment: Presents the first non-parametric rehearsal learning method using conditional mean embeddings, with consistency guarantees.

Topic Match: The core contribution is a kernel-based formulation of how action-conditioned representations and desirability structure can be learned and analyzed.

Relevance: 8 Novelty: 8

45. The Pok\'emon Theorem and other Fairness Impossibility Results

ArXiv ID: 2605.09221

Primary Topic: Representation Learning Theory and Structure

Authors: Daniel Matsui Smola, Alex Smola

Abstract: Fairness impossibility results often look like distinct scalar incompatibility statements. We show that several share one RKHS geometry: fairness criteria are linear constraints on conditional mean embeddings, and unequal base rates make the law of total expectation overdetermine those constraints. This view yields four results. The Kleinberg--Mullainathan--Raghavan dichotomy needs only group-conditional unbiasedness, not full calibration. The \emph{Pok\'emon theorem} shows that a distinct group pair satisfying any finite collection of linear mean-fairness criteria leaves a residual violation witnessed by the MMD, decaying at the Kolmogorov $m$-width rate under spectral regularity. The same tools prove an impossibility for fair feature learning: parity and class-conditional separation in representation space force class collapse under unequal base rates. The approximate relaxations yield signal and error frontiers, allowing a trade-off between real-world estimators and fairness goals. Experiments on standard fairness benchmarks are consistent with our bounds.

Comment: Unifies several fairness impossibility results through RKHS geometry and conditional mean embeddings, extending them to representation learning.

Topic Match: Primary fit is representation structure because the paper gives a mechanistic theoretical account of constraints on learned representations via embeddings.

Relevance: 8 Novelty: 8

46. Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

ArXiv ID: 2605.08277

Primary Topic: Representation Learning Theory and Structure

Authors: Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li, Jian Lou, Zunlei Feng, Mingli Song, Ruoxi Jia, Tianwei Zhang

Abstract: Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at https://github.com/Thecommonirin/SafeEnd.

Comment: Provides a mechanistic account of many-shot jailbreaks as inference-time SGD-like representation drift and uses that insight for a parameter-free counter-update defense.

Topic Match: The core contribution is a representation-level analysis of how demonstrations shift model states, with a theory of activation drift rather than a benchmark-only safety result.

Relevance: 8 Novelty: 8

47. Prospective Compression in Human Abstraction Learning

ArXiv ID: 2605.09985

Primary Topic: Representation Learning Theory and Structure

Authors: Leonardo Hernandez Cano, Ivan Zareski, Luisa El Amouri, Pinzhe Zhao, Max Mascini, Emanuele Sansone, Yewen Pu, Bonan Zhao, Marta Kryven

Abstract: A core challenge in program synthesis is online library learning: the incremental acquisition of reusable abstractions under uncertainty about future task demands. Existing algorithms treat library learning as retrospective compression over a static task distribution, where the learned library is determined by the corpus of past tasks. However, real-world learning domains are often non-stationary, with tasks arising from a generative process that evolves over time. We propose and test the hypothesis that in non-stationary domains human library learning selects abstractions prospectively: targeting compression of future tasks. We study this question using the Pattern Builder Task, a visual program synthesis paradigm in which participants construct increasingly complex geometric patterns from a small set of primitives, transformations, and custom helpers that carry forward across trials. Using this task, we conduct two experiments with complementary latent curricula, designed to dissociate between behaviors consistent with prospective compression, and alternative library learning accounts. Using six computational models spanning online library learning strategies, we show that human abstraction behavior reflects sensitivity to latent, non-stationary structure in the task-generating process. This behavior is consistent with prospective compression, and cannot be captured by existing retrospective compression-based algorithms, or inductive biases modeled by LLM-based program synthesis.

Comment: Shows human abstraction learning in non-stationary program-synthesis domains is better explained by prospective compression of future tasks than retrospective library compression.

Topic Match: Primary fit is representation structure because it studies how reusable abstractions form and organize under changing task distributions, giving mechanistic insight into learned structure.

Relevance: 8 Novelty: 8

48. Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

ArXiv ID: 2605.08526

Primary Topic: Representation Learning Theory and Structure

Also Matches: Memory Structures and Agent Memory Systems

Authors: Zihan Huang, Junda Wu, Tong Yu, Qianqi Yan, Rohan Surana, Uttaran Bhattacharya, Lina Yao, Xin Eric Wang, Julian McAuley

Abstract: While LLM-based agents excel at planning and executing long action sequences, their execution often remains inconsistent across trials, limiting reliability. Consolidating agent consistency requires distilling trial-error trajectories into reusable skills that preserve task-relevant invariants while discarding trajectory-specific noise. However, in multimodal settings, the key challenge is not only that useful invariants are distributed across vision and language information, but that different modalities support different kinds of reusable skill content: while some skills are verbalizable and interpretable, others reside in perceptual evidence beyond text. Text-only skills may lose perceptual cues, whereas storing text and perception naively introduces redundancy and noise. Existing inference-time methods, such as self-consistency, improve reliability through costly multi-sample decoding, while internalization strategies lack a way to separate verbalizable skill content from residual perceptual information. To address this, we introduce Conditional Multimodal Information Bottleneck (CMIB), a method for multimodal skill construction. CMIB begins with a joint bottleneck over multimodal skills and derives an exact sequential decomposition: (1) a text-stage bottleneck distilling interpretable skill cards, and (2) a conditional multimodal bottleneck compressing only residual information in perception that remains predictive beyond text. Unlike naive two-stream formulations, CMIB explicitly conditions the multimodal latent on the text skill, thus structurally reducing cross-modal redundancy and enabling independent control over textual and perceptual compression. We instantiate CMIB with a variational objective that makes its conditional decomposition tractable to optimize, yielding reusable multimodal skills that improve execution stability without incurring multi-sample inference overhead.

Comment: Introduces a conditional multimodal information bottleneck that separates verbalizable skill content from residual perceptual information for reusable agent skills.

Topic Match: Best fit is representation structure because the central advance is a principled bottleneck decomposition for how multimodal skills are represented and compressed.

Relevance: 8 Novelty: 8

Memory Structures and Agent Memory Systems (12)

1. HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing

ArXiv ID: 2605.08143

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Yuan Fang, Yi Xie, Xuming Ran

Abstract: Large language models encode vast factual knowledge that inevitably becomes outdated or incorrect after deployment, yet retraining is costly prohibitive, motivating model editing in lifelong settings that updates targeted behavior without harming the rest of the model. One line of work installs new facts by directly modifying base weights through locate-then-edit procedures, but accumulated edits progressively disrupt originally preserved knowledge, even with constraint-based projections. A complementary line leaves base weights intact and routes edits through external memory, but it faces routing challenges and its performance degrades at scale. We propose HoReN, a codebook-based parameter-preserving editor with enhanced routing built on three ideas. First, HoReN wraps a single MLP layer with a discrete key-value codebook, where each entry is interpreted simultaneously as a knowledge-memory key and a modern Hopfield stored pattern. Second, both keys and queries are projected onto the unit hypersphere so retrieval is governed by angular similarity, removing magnitude-driven mismatches between an edit prompt and its rephrasings. Third, the query is refined through damped Hopfield attractor dynamics, so paraphrases relax into the correct stored pattern's basin of attraction while unrelated queries remain undisturbed. HoReN achieves well-edited performance with consistent gains across diverse benchmarks spanning standard ZsRE, structured WikiBigEdit, and unstructured UnKE evaluations. Moreover, HoReN scales to 50K sequential edits on ZsRE with stable overall performance above 0.9, while prior editors collapse or degrade severely before reaching 10K. Our code is available at https://github.com/ha11ucin8/HoReN.

Comment: Introduces normalized Hopfield retrieval with attractor refinement as an external memory mechanism for scalable sequential model editing.

Topic Match: Primary fit is memory systems because the core contribution is a learned retrieval memory architecture with codebook storage, routing, and attractor-based recall.

Relevance: 9 Novelty: 8

2. VORT: Adaptive Power-Law Memory for NLP Transformers

ArXiv ID: 2605.08966

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics

Authors: Nabil Mlaiki

Abstract: Standard Transformers impose near-exponential decay on the influence of distant tokens, conflicting with the power-law structure of long-range dependencies in natural language. We introduce the \emph{Variable-Order Retention Transformer} (\VORT{}), a memory architecture in which each ingested token is assigned a learnable fractional order \alpha_i\in[\delta,1] that governs a Gr\"unwald--Letnikov power-law retention kernel. Because the fractional weighted sum is non-Markovian, we approximate it through a sum-of-exponentials (SOE) decomposition computed by Gauss--Laguerre quadrature on a Laplace-type integral representation of the kernel weights. Each exponential component admits a one-step Markovian recurrence at O(Sd_v) per step, where S=O(\log(T/\varepsilon)) terms suffice for \varepsilon-uniform accuracy on horizon [1,T]. Retrieval is keyed and associative via a linear-attention accumulator with an exact O(KSd_\phi d_v) -per-step recurrence. Four results are established: (i) an SOE approximation theorem with geometric convergence rate from the analyticity of the integrand after a log-change of variables; (ii) a quantisation bound valid on [\delta,1] with correct analysis near \alpha=0; (iii) a direct L^2 energy argument (Proposition) showing that for \alpha>1/2 any mixture with fixed minimum decay rate \Lambda>0 incurs L^2([1,T]) error at least N_\alpha(T)-C(\Lambda)\to\infty, with the \Lambda-dependence made explicit; and (iv) linear convergence of a gradient plasticity rule under the Polyak--\L{}ojasiewicz condition. Two synthetic experiments confirm the architectural advantage: a Zipf-distributed retrieval benchmark and an entity label-copy task with uniform lag distribution, the latter ruling out prior-matching as an explanation for the power-law kernel's advantage.

Comment: Proposes a transformer memory mechanism with learnable fractional-order power-law retention and efficient sum-of-exponentials recurrences.

Topic Match: Primary fit is memory systems because the paper's central idea is a new long-range memory kernel and retention mechanism, even though it is also an architectural contribution.

Relevance: 9 Novelty: 8

3. Continuous Latent Contexts Enable Efficient Online Learning in Transformers

ArXiv ID: 2605.09867

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics, World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Emile Anand, Abdullah Ateyeh, Xinyuan Cao, Max Dabagia

Abstract: Large language models (LLMs) exhibit a strong capacity for in-context learning: Given labeled examples, they can generate good predictions without parameter updates. However, many interactive settings go beyond static prediction to online decision-making, in which effective behavior demands adaptation over long multi-turn horizons in response to feedback, and efficient algorithms in these domains must use compact representations of what they have learned. Recently, continuous transformer architectures with latent chain of thought have shown promise for offline iterative tasks such as directed graph-reachability. Motivated by this, we study whether continuous latent context tokens equip transformers to more effectively realize online learning. We give explicit constructions of constant-depth transformers that implement two foundational online decision-making procedures -- the weighted majority algorithm and $Q$-learning -- by storing their algorithmic state as linear combinations of feature embeddings, using a small number of latent context tokens. We further train a small GPT-2-style transformer with latent contexts using a multi-curriculum objective that does not directly supervise the latent states. On long synthetic online prediction sequences, this model outperforms larger and more complex LLMs, including Qwen-3-14B and DeepSeek-V3. Our results suggest that continuous latent contexts provide a simple and effective persistent state for transformers to implement online learning algorithms.

Comment: Shows latent context tokens can explicitly implement persistent online-learning algorithms like weighted majority and Q-learning inside transformers.

Topic Match: Primary fit is memory systems because the central idea is compact persistent latent state as a memory substrate for updating and carrying algorithmic information across long interaction horizons.

Relevance: 9 Novelty: 8

4. Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

ArXiv ID: 2605.10795

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Alessio Giorlandino, Sebastian Goldt, Antoine Maillard

Abstract: Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a na\"ive Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.

Comment: Derives sharp storage-capacity asymptotics for linear associative memories and explains the mechanism by which optimal learning surpasses Hebbian storage.

Topic Match: The paper is centrally about associative memory capacity and retrieval mechanisms, making memory systems the best fit.

Relevance: 9 Novelty: 8

5. Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

ArXiv ID: 2605.10640

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics

Authors: Haoyu Wang, Yifan Shang, Zhongxiang Sun, Weijie Yu, Xiao Zhang, Jun Xu

Abstract: Continual Pre-Training (CPT) is essential for enabling Language Models (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbf{S}electing \textbf{T}okens via attenti\textbf{O}n \textbf{C}ontribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.

Comment: Presents a theory of continual factual knowledge acquisition showing why replay changes forgetting dynamics, then derives an attention-guided generative replay method.

Topic Match: The strongest match is continual retention and forgetting of factual knowledge, i.e. memory update and preservation mechanisms in language models.

Relevance: 9 Novelty: 8

6. Workspace Optimization: How to Train Your Agent

ArXiv ID: 2605.09650

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Elad Sarafian, Gal Kaplun, Ron Banner, Daniel Soudry, Boris Ginsburg

Abstract: Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.

Comment: Treats the agent workspace as the trainable object, with explicit analogs of parameters, losses, and gradients for learning through external artifacts.

Topic Match: Best fit is memory systems because the core idea is to optimize a structured external workspace that stores and updates artifacts used across interaction steps.

Relevance: 8 Novelty: 9

7. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

ArXiv ID: 2605.09278

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Yuqiao Meng, Sakshi Sunil Narvekar, Luoxi Tang, Rupali Rajendra Vaje, Yingxue Zhang, Muchao Ye, Zhaohan Xi

Abstract: Multi-agent debate (MAD) systems increasingly rely on shared memory to support long-horizon reasoning, but this convenience opens a critical vulnerability: a single corrupted entry can contaminate the downstream memory-augmented reasoning, and debate alone fails to filter such errors. Existing safeguards filter entries via heuristics or LLM-based validation, yet they rely on AI judgments that share the same failure modes and overlook the cross-agent dynamics of MAD. We address this gap by formulating memory updating in MAD as a zero-trust memory game, in which no agent is assumed honest and the game's equilibrium serves as an indicator of optimal memory trust. Guided by this equilibrium, we propose EquiMem, an inference-time calibration mechanism that quantifies each update algorithmically against the shared memory state, using agents' existing retrieval queries and traversal paths as evidence rather than soliciting any LLM judgment. EquiMem instantiates calibration for both embedding- and graph-based memory, and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead.

Comment: Uses game-theoretic equilibrium to calibrate trust in shared memory updates for multi-agent debate without relying on another LLM judge.

Topic Match: The core contribution is a new memory update and trust mechanism for shared multi-agent memory, rather than a generic agent benchmark.

Relevance: 8 Novelty: 8

8. SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

ArXiv ID: 2605.08693

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Min Yang, Jinghua Piao, Xu Xia, Xiaochong Lan, Jiaju Chen, Yongshun Gong, Yong Li

Abstract: Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.

Comment: Trains agents to create, refine, and select reusable skills from trajectory evidence, with counterfactual utility supervising skill memory updates.

Topic Match: The paper is best viewed as agent memory/skill-system research because the core mechanism is how skills are stored, revised, and reused over time.

Relevance: 8 Novelty: 8

9. The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

ArXiv ID: 2605.08611

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Jared Glover

Abstract: Current language model memory systems store what happened but not how it felt. This distinction -- between semantic memory (knowing about a past event) and episodic memory (re-experiencing it) -- was identified by Tulving as the difference between noetic and autonoetic consciousness. Damasio demonstrated that humans with intact knowledge but absent emotional markers exhibit impaired decision-making. We bridge this gap for language models. Using Gemma 3 1B-IT with pretrained Gemma Scope 2 sparse autoencoders, we identify 310 emotion-exclusive features at layer 22 with psychologically valid geometry. We construct distinctive-feature emotion vectors during experience and partially re-inject them during recall, triggered by context similarity at layer 7. We test four conditions paralleling Damasio's framework: A (no memory), B (semantic labels), C (emotion echo), and BC (semantic + echo). For emotional orientation, the echo alone steepens the threat-safety gradient: the regression slope of threat rating on contextual similarity is 0.80 for C vs 0.56 for A ($p$=0.011, permutation test). For decisions, the echo amplifies knowledge into action: BC=80% good choices vs B=52% ($z$=+2.60, $p$<0.01), while the echo alone has no effect (C=22%, n.s.). The echo changes how the model feels independently, but changes what it does only when combined with knowledge -- replicating Damasio's core finding. The echo amplifies knowledge. It does not replace it.

Comment: Implements a recall-time emotion-vector re-injection mechanism that augments semantic memory with an episodic-like affective trace.

Topic Match: Primary fit is memory systems because the paper proposes a concrete storage-and-recall mechanism, with representation analysis serving that memory design.

Relevance: 8 Novelty: 8

10. HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations

ArXiv ID: 2605.09523

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics

Authors: Lennon J. Shikhman

Abstract: Neural operators provide fast surrogate models for time-dependent partial differential equations, but their standard autoregressive use usually assumes that the instantaneous field $u(t,\cdot)$ is a complete state. This assumption fails for delay equations, distributed-memory systems, and other non-Markovian dynamics: two trajectories may agree at time $t$ and nevertheless have different futures because their histories differ. We introduce the History-Space Fourier Neural Operator (HS-FNO), a neural operator for delay and memory-driven PDEs formulated on the lifted state $u_t(\theta,x)=u(t+\theta,x)$, $\theta\in[-\tau,0]$. The key computational step is to decompose one history-state update into a learned predictor for the newly exposed future slice and an exact shift-append transport for the portion of the history window already known from the previous state. This avoids learning deterministic history coordinates, reduces the learned output dimension, and enforces the natural discrete history update. We test HS-FNO on five benchmark families covering delayed reaction--diffusion, spatial epidemiology, nonlocal neural-field dynamics, delayed waves, and distributed-memory closures. Across ten random seeds, HS-FNO attains the lowest aggregate one-step, history-space, and rollout errors among the principal baselines. The largest gain occurs in autoregressive prediction, where aggregate rollout error decreases from $0.241$, $0.188$, and $0.185$ for current-state, lag-stack, and unconstrained history-to-history operators, respectively, to $0.094$. The same model uses fewer parameters than unconstrained history prediction. These results indicate that enforcing the discrete shift structure of history-state evolution is an effective inductive bias for non-Markovian PDE surrogate modeling.

Comment: Models non-Markovian PDEs in lifted history space and enforces exact shift-append dynamics as an architectural inductive bias.

Topic Match: Primary fit is memory systems because the paper explicitly represents and updates history-state memory rather than relying on a Markov state.

Relevance: 8 Novelty: 8

11. Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

ArXiv ID: 2605.09968

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics

Authors: Debashis Guha

Abstract: Every adaptive learning system must alternate between two operations: consolidating what it already knows and expanding into new evidence. We propose \emph{Consolidation-Expansion Operator Mechanics} (OpMech), a framework that makes this structure precise. The central object is the \emph{order-gap} $\Ogap(\theta; e)$, the degree to which a consolidation operator~$Q$ and an expansion operator~$P_e$ fail to commute at a given knowledge state. Because the order-gap is computable from the system's own trajectory, it serves as a real-time control signal: large values indicate that the system is still sensitive to the ordering of consolidation and expansion; once the order-gap falls and stays small, further processing is unlikely to change the outcome. Three results give the signal precise meaning: the order-gap decays along convergent trajectories; a persistently large order-gap implies the system is far from its settled state; and an order-gap-based stopping rule terminates with provable guarantees in both noiseless and bounded-noise settings. The framework applies across five domains: bandits, reinforcement learning, stochastic optimization, continual learning, and recursive language models. We give conditions under which the order-gap reliably tracks convergence in three representative cases. We develop the recursive language model application in detail, showing how OpMech replaces heuristic stopping rules and fixed recursion budgets with principled, evidence-driven alternatives.

Comment: Defines an order-gap between consolidation and expansion operators as a computable control signal for stopping and convergence in adaptive learners.

Topic Match: The framework is centered on consolidation versus expansion as a memory-like adaptive process, including explicit discussion of continual learning and recursive language models.

Relevance: 8 Novelty: 8

12. CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

ArXiv ID: 2605.08399

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Ziyang Yu, Qiyue Li, Liang Zhao

Abstract: Tool-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget. Existing tool-use and skill-library methods typically treat tools as flat or text-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code. We propose CoCoDA, a framework that co-evolves the planner and tool library through a single code-native structure: a compositional code DAG. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post-condition specification, and worked examples. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG-induced reward that credits composites by their primitive expansion size. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool-use and library-learning baselines.

Comment: Introduces a compositional code DAG that serves as both evolving tool library and structured retrieval memory, with typed retrieval and learned consolidation into composite tools.

Topic Match: Although framed around tool use, the core contribution is a structured memory mechanism for storing, composing, retrieving, and consolidating reusable skills.

Relevance: 8 Novelty: 8

World Models, Exploration, and Open-Ended Reinforcement Learning (26)

1. Latent Geometry Beyond Search: Amortizing Planning in World Models

ArXiv ID: 2605.08732

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Hoang Nguyen, Xiaohao Xu, Xiaonan Huang

Abstract: Modern vision-based world models can represent observations as compact yet expressive latent manifolds, but fast goal-oriented planning in these spaces remains challenging. This raises a central question: when does a learned representation simplify control, rather than merely enabling prediction? We study this question in a pretrained LeWorldModel, whose latent geometry is regularized for smoothness and uniformity. Our key insight is that, under such geometry, planning can be amortized into a latent inverse-dynamics mapping instead of requiring online search. We therefore replace iterative planning with a lightweight Goal-Conditioned Inverse Dynamics Model (GC-IDM) that maps the current latent state, goal latent state, and remaining horizon directly to the next action. Empirically, across four benchmark environments spanning navigation, contact-rich manipulation, and continuous control, our controller matches or exceeds CEM in seven of eight environment-protocol settings while reducing per-decision cost by 100-130x. A broader sweep over test-time planners (CEM, MPPI, iCEM, and gradient-based methods) shows that this result is not specific to a particular optimizer. These findings suggest that much of the structure recovered by test-time planning is already locally encoded in the latent representation. More broadly, our results indicate that sufficiently structured latent spaces can shift part of the planning burden from online optimization to learned inference.

Comment: Shows that sufficiently structured world-model latents can amortize planning into a goal-conditioned inverse-dynamics map, replacing online search.

Topic Match: The paper is fundamentally about world-model-based control and when latent geometry supports planning, making world models/RL the clearest fit.

Relevance: 10 Novelty: 8

2. LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

ArXiv ID: 2605.08279

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Qixin Xiao, Maani Ghaffari

Abstract: Learning predictive world models from visual observations is a core problem in embodied AI, with applications to model-based reinforcement learning and robotic planning. Existing latent world models typically generate future states with unconstrained neural transition functions, while modern video generation systems often prioritize perceptual plausibility or introduce physical structure through auxiliary losses, external guidance, or separate dynamics modules. As a result, long-horizon rollouts can remain weakly grounded in the physical principles that govern real dynamics, leading to compounding error, energy drift, and physically inconsistent futures. We propose Least Action World Models (LaWM), a latent world-modeling framework that operationalizes the Principle of Least Action in learned visual latent space: future rollouts are governed by a learned Lagrangian action functional rather than produced only by an unconstrained transition predictor. Our main technical realization is a latent variational integrator: LaWM encodes observations into learned generalized coordinates, learns a latent discrete Lagrangian over consecutive latent states, constructs a discrete action functional, and advances prediction by solving the corresponding discrete integration condition. Thus, physical structure is not merely used to score, regularize, or constrain a completed trajectory; it defines the latent transition rule itself. Because the transition is induced by a discrete variational principle, LaWM provides a structure-preserving bias for long-horizon visual prediction. Across physics-clean synthetic dynamics and embodied robot interaction benchmarks, LaWM improves physical invariance, background consistency, motion smoothness, and appearance and geometric prediction metrics over video-generation and world-model baselines.

Comment: Defines world-model transitions through a learned latent variational integrator based on least action, making physical structure determine rollout dynamics rather than just regularize them.

Topic Match: This is directly about foundational world-model design, with a new action-principle-based transition mechanism for long-horizon prediction.

Relevance: 9 Novelty: 8

3. From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

ArXiv ID: 2605.09419

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Memory Structures and Agent Memory Systems

Authors: Yanan Xiao, Yixiang Tang, Zechen Feng, Lu Jiang, Minghao Yin, Pengyang Wang

Abstract: While experience replay is essential for data efficiency in reinforcement learning (RL), standard methods treat the replay buffer as a passive memory system, prioritizing samples based on numerical prediction errors rather than their semantic significance. This approach stands in contrast to human learning, which accelerates mastery by actively abstracting fragmented experiences into behavioral rules. To bridge this gap, we propose Neuro-Symbolic Experience Replay (NSER), a framework that transforms experience replay from a passive sample reuse mechanism into an active engine for knowledge construction. Specifically, NSER addresses the incompatibility between linguistic reasoning and numerical optimization through a novel neuro-symbolic grounding pipeline. It leverages Large Language Models (LLMs) in a zero-shot manner to induce candidate behavioral rules from accumulated trajectories, grounds these insights into differentiable first-order logic representations, and utilizes the resulting symbolic structures to dynamically reweight the replay distribution. By allowing abstract knowledge to directly shape policy optimization, NSER achieves consistent superior sample efficiency and convergence speed across reactive, rule-based, and procedural benchmarks.

Comment: Turns replay into an active knowledge-construction process by inducing symbolic behavioral rules from trajectories and using them to reweight sampling.

Topic Match: The paper directly rethinks experience replay as a learning mechanism for transferable behavioral structure in RL.

Relevance: 9 Novelty: 8

4. Path-Coupled Bellman Flows for Distributional Reinforcement Learning

ArXiv ID: 2605.08253

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Boyang Xu, Qing Zou, Siqin Yang, Hao Yan

Abstract: Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $\lambda$-parameterized control-variate target: $\lambda{=}0$ recovers an unbiased sample Bellman target, while $\lambda{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.

Comment: Learns return distributions with flow matching using Bellman-coupled paths and shared-noise control variates to reduce variance.

Topic Match: The work contributes a new foundational RL learning principle for distributional value modeling rather than LLM post-training.

Relevance: 9 Novelty: 8

5. Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

ArXiv ID: 2605.10909

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Alex DeWeese, Guannan Qu

Abstract: This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_\mu^{\pi^} / d_\mu^{\pi}||\infty$ and $||d\mu^{\pi^} / \mu||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.

Comment: Introduces k-step policy gradients to escape myopic local optima in restricted policy classes, with exponential improvement in k.

Topic Match: This is foundational RL optimization theory about policy improvement and local-optima escape, a strong match to the RL category.

Relevance: 9 Novelty: 8

6. ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

ArXiv ID: 2605.10819

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo, Yandan Yang, Bin Liu, Zhejia Cai, Feng Xiong, Mu Xu, jiachen Luo, De Ma, Zhiheng Ma, Gang Pan

Abstract: Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

Comment: Learns algebraically consistent latent action transitions from videos and transfers that structured transition geometry into VLA policy learning.

Topic Match: The paper is fundamentally about learning reusable latent world-transition structure for action and policy generation, making world models the clearest fit.

Relevance: 9 Novelty: 8

7. Do multimodal models imagine electric sheep?

ArXiv ID: 2605.09693

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Kr\"ahenb\"uhl, Vladlen Koltun

Abstract: Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

Comment: Shows action-trained VLMs internally encode intermediate visual states, effectively forming a latent visual world model without explicit supervision.

Topic Match: The strongest fit is world models because the paper directly studies emergent internal simulation of state transitions during sequential problem solving.

Relevance: 9 Novelty: 8

8. The Reciprocity Gradient

ArXiv ID: 2605.08323

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Yue Lin, Pascal Poupart, Shuhui Zhu, Dan Qiao, Wenhao Li, Yuan Liu, Hongyuan Zha, Baoxiang Wang

Abstract: Communication is fundamental to sustaining reciprocity and cooperation in strategic interactions. We identify and formulate the influence attribution problem as the central optimization difficulty inherent in such dynamics for a learning agent: any action or signal the agent emits reshapes the reputations of many third parties along combinatorially branching paths before feeding back into its own future rewards, forcing the agent to account for all of these indirect channels at once when choosing every action. To address this, we introduce the reciprocity gradient, which explicitly backpropagates reward gradients through private estimators of opponents' policies trained from public observations. The gradient flows through the reputation chain itself analytically, rather than being estimated from sampled returns. It jointly optimizes actions and evaluative signals without intrinsic rewards or reward shaping. Empirically, the method recovers near-optimal context-sensitive policies, while sample-based baselines collapse into constant-output policies.

Comment: Introduces the reciprocity gradient to backpropagate through learned opponent-policy estimators over long influence chains in strategic interaction.

Topic Match: This is foundational RL for strategic multi-agent learning, centered on a new gradient estimator for long-horizon reciprocal effects.

Relevance: 8 Novelty: 9

9. Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

ArXiv ID: 2605.08515

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Michael Groom, Victor-Alexandru Darvariu, Lars Kunze, James Wilson, Nick Hawes

Abstract: Unlike standard expected-return Reinforcement Learning (RL), Distributional RL (DRL) models the full return distribution, making it better-suited for uncertainty-aware and risk-sensitive decision-making. Conditional Flow Matching (CFM) critics have recently attracted attention for modelling continuous, multi-modal return distributions. Despite this interest, there remains a substantial metric mismatch: DRL theory relies on the distributional Bellman operator being contractive in the $p$-Wasserstein distance, yet existing CFM critics are trained with arbitrary source-target couplings, so their flow-matching losses are not Wasserstein-aligned surrogates for matching Bellman target return distributions. In this work, we address this mismatch by proposing FlowIQN, a CFM critic that sorts source and Bellman target samples within each mini-batch to approximate the monotone optimal transport coupling, replacing arbitrary pairings with quantile-aligned flow paths. We prove that the loss of our quantile-coupled CFM critic yields a Wasserstein-aligned approximate projection compatible with the foundations of DRL. To our knowledge, FlowIQN is the first flow-matching distributional critic with an explicit Wasserstein-aligned projection guarantee. We further extend FlowIQN with shortcut models for efficient inference. Empirical results show that FlowIQN improves Wasserstein return-distribution accuracy over other CFM critics. It also yields competitive performance on offline RL benchmarks across multiple policy extraction methods, providing a theoretically grounded CFM critic that is readily compatible with DRL pipelines. Code: https://github.com/ori-goals/flowIQN.

Comment: Makes flow-matching critics Wasserstein-aligned in distributional RL by quantile-coupling source and Bellman target samples.

Topic Match: This is a foundational RL-methods paper: it improves distributional critic learning with theory tied to Bellman contraction structure.

Relevance: 8 Novelty: 8

10. Generative Actor-Critic with Soft Bridge Policies

ArXiv ID: 2605.08733

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Ke He, Le He, Shunpu Tang, Yafei Wang, Lisheng Fan

Abstract: Expressive generative policies such as diffusion and flow models are appealing for MaxEnt online reinforcement learning because of their ability to model multimodal and highly non-Gaussian action distributions. However, training effective soft generative policies faces two obstacles that often arise together. First, marginal action densities are often unavailable, so existing methods typically rely on entropy bounds, heuristic proxies or approximations. Second, iterative shared-parameter samplers raise inference cost and require backpropagation through time over repeated network evaluations, increasing memory cost and destabilizing policy optimization. These obstacles motivate us to seek a generative policy that exposes a tractable MaxEnt objective while requiring only a single sampled actor forward pass for action generation. To this end, we propose soft generative actor-critic (SoftGAC), whose actor defines a stochastic bridge from a fixed base latent to a terminal action latent in pre-tanh space. This structured bridge allows us to lift the MaxEnt objective as an analytically tractable path-wise relative-entropy objective against a high-entropy reference process. In practical finite-step implementation, this relative entropy reduces exactly to sampled transition control energy and thus provides principled soft regularization. Moreover, we keep the single-pass actor lightweight by using small step-specific bridge transitions, each evaluated only once per sampled action, while maintaining a parameter budget comparable to strong actor baselines. Extensive experiments on challenging continuous-control benchmarks show that SoftGAC attains higher or competitive returns than strong generative policy baselines, including diffusion and flow-matching policies, while staying in the low-latency regime of one-pass actors and showing considerable improvements in the compute-return tradeoff.

Comment: Introduces a one-pass generative actor with tractable path-wise entropy regularization via stochastic bridge policies for MaxEnt RL.

Topic Match: The contribution is a new policy class and objective for online RL, not LLM post-training or benchmark tuning.

Relevance: 8 Novelty: 8

11. Zero-shot Imitation Learning by Latent Topology Mapping

ArXiv ID: 2605.08450

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Memory Structures and Agent Memory Systems

Authors: Maxwell J. Jacobson, Yexiang Xue

Abstract: Imitation learning is effective for training agents when expert demonstrations are available, but collecting demonstrations for every complex task in an environment is costly. We study the long-horizon, goal-conditioned setting where a fixed demonstration dataset contains useful behavior, but not complete examples for every task the agent must solve. Existing imitation learning methods can learn strong policies from demonstrations, but when solving long-horizon tasks, small errors accumulate over long primitive-action trajectories and make zero-shot adaptation to new tasks unreliable. We introduce Zero-shot Agents from Latent Topologies (ZALT), an imitation-learning method that solves unseen start-goal tasks beyond those demonstrated during training. ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions -- combined, these enable ZALT to perform zero-shot adaptation. In a complex 3D maze environment, ZALT achieves 55% zero-shot success on unseen tasks, compared to 6% for the strongest baseline.

Comment: Builds a latent hub-state topology and plans over hub-to-hub transitions, making demonstrated behaviors composable for zero-shot imitation.

Topic Match: Its core is a transferable interaction model over latent transitions for unseen tasks, fitting foundational agent learning and planning.

Relevance: 8 Novelty: 8

12. One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

ArXiv ID: 2605.09727

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Bowen He, Juncheng Dong, Lin Lin, Xiang Cheng

Abstract: A central challenge in reinforcement learning (RL) is to learn models that generalize beyond the tasks on which they are trained, a goal traditionally pursued through multi-task and meta RL. Recently, transformer architectures have emerged as a promising approach, enabling adaptation to new tasks via in-context learning without explicit parameter updates. From a functional perspective, a transformer can be viewed as a functional operator that maps a context to a task-specific function. It is thus fundamental to understand and design this operator to support stronger generalization in RL. In this work, we address this resulting question of generalization from a kernel-based perspective by establishing a connection between non-linear transformers and kernel-based temporal difference learning. By interpreting the transformer as performing regression in a Reproducing Kernel Hilbert Space (RKHS), we show that value functions from different domains can be represented using a shared set of weights, provided they lie within the same RKHS. Experiments on multiple MetaWorld domains support this interpretation, demonstrating convergence of the temporal-difference objective.

Comment: Connects non-linear transformers for in-context RL to RKHS temporal-difference learning, explaining cross-domain generalization via shared weights in function space.

Topic Match: The paper is fundamentally about generalization in in-context RL and provides a theoretical account for it.

Relevance: 8 Novelty: 8

13. PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

ArXiv ID: 2605.08982

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Yaniv Oren, Viliam Vadocz, Joery A. de Vries, Wendelin B\"ohmer, Matthijs T. J. Spaan, Hendrik Baier

Abstract: Monte Carlo Tree Search (MCTS) is a widely used approach for policy improvement through search with increasing popularity for real world applications. Due to the sequential and deterministic nature of its search, runtime-scaling of MCTS with parallel compute remains a major challenge. We introduce Particle MCTS (PMCTS), to our knowledge the first principled parallel MCTS algorithm which is suited for neural network evaluations and can preserve formal policy improvement guarantees. Empirically, PMCTS scales well with parallel compute and significantly outperforms the popular heuristic-based baselines across domains.

Comment: Gives a principled parallel MCTS variant that preserves policy-improvement guarantees while scaling neural search with parallel compute.

Topic Match: Best fit is world-models/open-ended RL because this is a foundational search and planning contribution for RL-style decision making, not LLM post-training.

Relevance: 8 Novelty: 8

14. Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

ArXiv ID: 2605.10671

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Phalguni Nanda, Zaiwei Chen

Abstract: In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-\gamma)^{-1}\log((1-\gamma)^{-1}\epsilon^{-1}))$ for computing an $\epsilon$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.

Comment: Recasts natural policy gradient exactly as doubly smoothed policy iteration and proves global geometric convergence via Bellman operators.

Topic Match: Primary fit is foundational RL theory: it gives a new operator view and convergence result for a core RL algorithm rather than LLM alignment-style RL.

Relevance: 8 Novelty: 8

15. Policy Gradient Methods for Non-Markovian Reinforcement Learning

ArXiv ID: 2605.10816

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Memory Structures and Agent Memory Systems

Authors: Avik Kar, Siddharth Chandak, Rahul Singh, Soumitra Sinhahajari, Eric Moulines, Shalabh Bhatnagar, Nicholas Bambos

Abstract: We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

Comment: Extends policy gradient theory to non-Markovian RL by jointly optimizing recurrent agent-state dynamics and policy under a reward-centric formulation.

Topic Match: The paper is fundamentally about RL with learned internal state for non-Markovian environments, which fits the foundational RL/exploration category best.

Relevance: 8 Novelty: 8

16. Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

ArXiv ID: 2605.09183

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Surbhi Goel, Jonathan Pei, James Wang

Abstract: Behavior cloning provides strong imitation learning guarantees when training and test environments share the same dynamics. However, in many deployment settings the test environment's transitions differ from training, and classical offline IL offers no recourse: the learner must commit to an action at every state, even when its demonstrations are uninformative and could lead to arbitrary degradation of performance. This motivates the study of selective imitation, where the learner may choose to stop when it cannot act reliably. We introduce a model for selective imitation under arbitrary dynamics shift: given labeled expert demonstrations from a training environment and unlabeled state trajectories from the same expert in a test environment, the learner outputs a selective policy that is complete (rarely stops in training) and sound (incurs low regret before stopping in test). Our algorithm, SeqRejectron, constructs a stopping rule using a small set of validator policies whose size is independent of the horizon or policy class. For deterministic policies, this yields horizon-free $\tilde{O}(\log|\Pi|/\epsilon^2)$ sample complexity, assuming sparse costs. For stochastic policies, we obtain analogous horizon-free guarantees using a cumulative Hellinger stopping time. We extend the framework to misspecified experts and different expert policies across train and test and obtain results that gracefully degrade with the amount of misspecification.

Comment: Formulates selective imitation under arbitrary dynamics shift and gives horizon-free guarantees for when an imitator should stop acting.

Topic Match: The main contribution is a new imitation/RL-theoretic framework for robust behavior under environment shift, not LLM post-training.

Relevance: 8 Novelty: 8

17. On Characterizing Learnability for Adversarial Noisy Bandits

ArXiv ID: 2605.09200

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Steve Hanneke, Kun Wang

Abstract: We study adversarial noisy bandits given a known function class $\mathcal{F}$. In each round, the adversary selects a function $f \in \mathcal{F}$, the learner chooses an arm, and then observes a noisy reward determined by the chosen arm and the function $f$. The goal is to minimize the cumulative regret $R(T)$, defined as the difference between the learner's performance and that of the best fixed arm in hindsight over $T$ rounds. We say that a function class $\mathcal{F}$ is learnable if there exists an algorithm achieving sublinear regret. Our main results concern characterizing learnability. The main quantity appearing in our characterization is a convexified variant of the generalized maximin volume introduced by Hanneke and Wang (2025). For oblivious adversaries, we characterize learnability in terms of this convexified generalized maximin volume. For adaptive adversaries, we show that the same quantity characterizes learnability when the arm space is countable. Our analysis builds on a connection between convexified generalized maximin volume and the existence of simple hitting sets. We further conjecture that the same quantity also characterizes learnability when the arm space is uncountable, via its relation to a new complexity measure, which we call the distribution covering number. This notion can be viewed as a strengthened form of the hitting set that still admits efficient learning via the multiplicative weights algorithm. We also pose a number of relevant open questions regarding this problem.

Comment: Characterizes adversarial noisy bandit learnability via a convexified generalized maximin volume and related covering notions.

Topic Match: This is foundational online learning theory directly about when exploration under adversarial noise is possible.

Relevance: 8 Novelty: 8

18. Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL

ArXiv ID: 2605.08417

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Shengbo Wang, Zexi Zhang

Abstract: Designing model-free algorithms for distributionally robust reinforcement learning (DRRL) poses fundamental challenges. The robust Bellman operator is nonlinear in the transition kernel, which makes one-sample Bellman updates biased, while the adversarial optimization underlying robustness makes robust evaluation computationally demanding. To address these difficulties, we consider the natural small-ambiguity regime under Kullback--Leibler ambiguity sets and propose an approximate DRRL framework based on a first-order expansion of the relevant robust functional. This yields an approximate robust Bellman equation that removes the adversarial optimization while remaining first-order accurate in the ambiguity radius. To learn the fixed point of this approximate equation, we propose Mean-Variance Stochastic Approximation (MVSA), a model-free algorithm that uses only one-sample updates. This is achieved via a lifted stochastic approximation dynamics and a two-time-scale design. We then prove convergence and a central limit theorem for MVSA: its main iterate satisfies a central limit theorem at the canonical $n^{-1/2}$ scale, with explicitly characterized asymptotic covariances. Finally, we validate our theoretical findings with a numerical experiment.

Comment: Provides a two-time-scale one-sample algorithm for approximate distributionally robust RL together with a central limit theorem.

Topic Match: The paper is a foundational model-free RL theory contribution on robust Bellman learning and stochastic approximation behavior.

Relevance: 8 Novelty: 8

19. Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

ArXiv ID: 2605.09363

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Soumita Hait, Ping Li, Haipeng Luo, Mengxiao Zhang

Abstract: Last-iterate convergence of learning dynamics in games has attracted significant recent attention. In two-player zero-sum games with bandit feedback, where only the loss of the selected action pair is observed, Fiegel et al. (2025) show a separation between average-iterate and last-iterate convergence in duality gap: while the optimal t^(-1/2) rate after t rounds is achievable for the former via standard no-regret algorithms, the latter cannot converge faster than t^(-1/3) in expectation or t^(-1/4) with high probability. However, in many practical settings, such as preference learning, the players observe not only their loss but also the opponent's action. This raises a natural question: can such additional information enable faster last-iterate convergence? We answer this question affirmatively, showing that t^(-1/2) last-iterate convergence is achievable with high probability in this setting, via an efficient algorithm that updates its strategy infrequently by solving an estimated log-barrier-regularized game. We identify fundamental obstacles preventing standard analysis for multi-armed bandits, the single-player case, from generalizing to games, and develop a novel analysis to overcome them. Experiments confirm that our algorithm indeed converges faster than naive baselines and prior methods that do not exploit opponent-action feedback. Finally, we note that our results also improve those for dueling bandits, a special case with skew-symmetric game matrices.

Comment: Achieves near-optimal last-iterate convergence in zero-sum games by exploiting opponent-action feedback under bandit losses.

Topic Match: This is a foundational learning-in-games result on convergence under partial feedback, squarely within RL/online learning theory.

Relevance: 8 Novelty: 8

20. Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies

ArXiv ID: 2605.08558

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Muyun Lu, Haoyang Hong, Huazheng Wang, Ying Lin

Abstract: As an extension of the classical multi-armed bandit problem, multi-fidelity multi-armed bandits (MF-MAB) enable individual arms to be evaluated using diverse feedback sources that vary in both cost and accuracy. Prior stochastic models typically assume fixed low-to-high fidelity discrepancies, whereas modern proxy sources, such as learning-based simulators and Large Language Models (LLMs), can be improved using additional calibration. We investigate adaptive MF-MAB with improving proxy sources, and focus on the canonical two-fidelity case in which the low-fidelity source becomes more informative with repeated use. To capture this dynamic, we introduce a selected-average mismatch bound that converts dynamic low-fidelity observations into improvement-aware confidence bounds for the high-fidelity target. We propose the Threshold-Based Adaptive Continuation Companion (TACC), an optimistic algorithm that uses a bounded continuation rule to decide when low-fidelity sampling remains cost-effective and when to escalate. We prove an instance-dependent regret bound showing that, for detected intermediate arms, adaptive continuation replaces logarithmic high-fidelity confirmation with bounded low-fidelity continuation. Experiments on synthetic bandits and an LLM-as-a-judge policy-evaluation task examine when continuation improves cost-weighted regret.

Comment: Models low-fidelity feedback that improves with reuse and gives an adaptive continuation algorithm with regret guarantees.

Topic Match: This is a foundational bandit paper on exploration with adaptive proxy fidelities, not an application of bandits to a task domain.

Relevance: 8 Novelty: 8

21. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

ArXiv ID: 2605.09423

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Haoqiang Kang, Xiaokang Ye, Yuhan Liu, Siddhant Hitesh Mantri, Lingjun Mao, James Fleming, Drishti Regmi, Lianhui Qin

Abstract: LLM/VLM-based digital agents have advanced rapidly thanks to scalable sandboxes for coding, web navigation, and computer use, which provide rich interactive training grounds. In contrast, embodied agents still lack abundant, diverse, and automatically generated 3D environments for interactive learning. Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning interfaces. We introduce SimWorld Studio, an open-source platform built on Unreal Engine 5 for generating evolving embodied learning environments. At its core is SimCoder, a tool/skill-augmented coding agent that writes and executes engine-level code to construct physically grounded 3D worlds from language/image instructions. SimCoder self-evolves by using verifier feedback (e.g., compilation errors, physics checks, VLM critiques) to revise environments and autonomously add reusable tools and skills to its library. Generated worlds are exported as Gym-style environments for embodied agent learning. SimWorld Studio further enables co-evolution between environment generation and embodied learning: agent performance feedback guides SimCoder to generate adaptive curricula near the learner's capability frontier, so that environments become increasingly challenging as the embodied agent improves. Three case studies on embodied navigation show that self-evolution improves generation reliability, generated environments substantially improve embodied agent performance that generalizes to unseen benchmarks, and co-evolution yields an 18-point success-rate gain over fixed-environment learning and a 40-point gain over an untrained agent.

Comment: Generates embodied learning environments through co-evolution between a world-building coding agent and the learner's capability frontier.

Topic Match: The key idea is automatic environment generation and adaptive curriculum for embodied agents, directly matching open-ended RL and exploration themes.

Relevance: 8 Novelty: 8

22. Continual Harness: Online Adaptation for Self-Improving Foundation Agents

ArXiv ID: 2605.09998

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Memory Structures and Agent Memory Systems

Authors: Seth Karten, Joel Zhang, Tersoo Upaa Jr, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli

Abstract: Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

Comment: Introduces a reset-free online adaptation harness where embodied agents refine prompts, skills, subagents, and memory within a single ongoing run.

Topic Match: The core is continual self-improvement in an embodied interactive setting, closer to continual RL/open-ended learning than to standard agent tooling.

Relevance: 8 Novelty: 8

23. When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

ArXiv ID: 2605.09860

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Chen Li, Zhantao Yang, Fangyi Chen, Han Zhang, Anudeepsekhar Bolimera, Marios Savvides

Abstract: Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

Comment: Treats commitment depth as a state-conditioned policy variable, learning when to replan versus execute open-loop in long-horizon reasoning.

Topic Match: This is fundamentally temporal abstraction discovery for sequential decision-making, a classic open-ended RL concern even though implemented with VLM policies.

Relevance: 8 Novelty: 8

24. The Value of Mechanistic Priors in Sequential Decision Making

ArXiv ID: 2605.10018

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Itai Shufaro, Gal Benor, Shie Mannor

Abstract: Hybrid mechanistic models, physical priors with learned residuals, promise to reduce the data required for good decisions, but have no computable criterion to test this. We characterize the value of mechanistic priors in sequential decision-making within both asymptotic and burn-in regimes. To formalize this, we introduce the mechanistic information of a model -- the mutual information between the model's recommended policy $\hat{\pi}$ and the true optimal policy $\pi^*$ -- quantified via an occupancy-weighted bias $B_\mu$. In the asymptotic regime (large $N$), matched bounds reveal that Bayesian regret scales with the residual entropy $H_{\mathrm{mech}}$, delivering a theoretical sample complexity reduction of $H(\mu)/H_{\mathrm{mech}}$ compared to an uninformed baseline. Furthermore, we provide a model certificate to determine empirical sample efficiency. Complementarily, in the clinically relevant burn-in regime (small $N$), we establish a lower bound on the penalty incurred by confidently wrong priors. We demonstrate both the asymptotic and burn-in bounds across 5-fluorouracil (5-FU) dosing simulations motivated by published FOLFOX pharmacokinetic data, where a hybrid prior yields large sample-efficiency gains in the burn-in regime. Finally, we contrast these grounded models with LLM priors, demonstrating that LLMs can suffer severe losses in mechanistic information, thereby motivating the exclusive use of physically-grounded priors for safety-critical applications.

Comment: Defines mechanistic information between a prior model's recommended policy and the true optimal policy to quantify sample-efficiency gains in decision making.

Topic Match: The paper is about foundational theory for sequential decision making with structured priors, not LLM post-training or application-specific RL.

Relevance: 8 Novelty: 8

25. Shields to Guarantee Probabilistic Safety in MDPs

ArXiv ID: 2605.10888

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Linus Heck, Filip Mac\'ak, Roman Andriushchenko, Milan \v{C}e\v{s}ka, Sebastian Junges

Abstract: Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is allowed to happen with an acceptable probability, has proven to be more intricate. This paper presents a formal framework that conservatively extends classical shields to probabilistic safety. In this framework, we (i) demonstrate the impossibility of preserving the strong guarantees on safety and permissiveness, (ii) provide natural shields with weaker guarantees, and (iii) introduce offline and online shield constructions ensuring strong safety guarantees. The empirical evaluation highlights the practical advantages of the new shields, as well as their computational feasibility.

Comment: Develops a formal framework for probabilistic safety shielding in MDPs, with impossibility results and new offline/online shield constructions.

Topic Match: This is foundational RL/control theory about safe decision-making in MDPs, not LLM post-training.

Relevance: 8 Novelty: 8

26. Switching-Geometry Analysis of Deflated Q-Value Iteration

ArXiv ID: 2605.10811

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Donghwan Lee

Abstract: This paper develops a joint spectral radius (JSR) framework for analyzing rank-one deflated Q-value iteration (Q-VI) in discounted Markov decision process control. Focusing on an all-ones residual correction, we interpret the resulting algorithm through the geometry of switching systems and, to the best of our knowledge, give the first JSR-based convergence analysis of deflated Q-VI for policy optimization problems. Our analysis reveals that the standard Q-VI switching system model has JSR exactly the discount factor $\gamma\in (0,1)$, since all admissible subsystems share the all-ones vector as an invariant direction. By passing to the quotient space that removes this direction, we obtain a projected switching system model whose JSR governs the relevant error dynamics and may be strictly smaller than $\gamma$. Therefore, the deflated Q-VI admits a potentially sharper convergence-rate characterization than the ambient-space $\gamma$-bound. Finally, we prove that the correction is equivalent to a scalar recentering of standard Q-VI. Hence, the projected trajectory, and therefore the greedy-policy sequence, is unchanged relative to standard Q-VI initialized from the same point. The benefit of deflation is not a change in the induced decision-making problem, but a more precise JSR-based description of the convergence geometry after the redundant all-ones component is removed.

Comment: Gives a joint-spectral-radius analysis of deflated Q-value iteration, showing convergence geometry after removing the invariant all-ones direction.

Topic Match: This is a foundational RL-theory paper about value-iteration dynamics and convergence structure, squarely within the world-models/RL bucket.

Relevance: 8 Novelty: 8

Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.

7-8: substantially related, but partly peripheral or focused on a narrower aspect.

5-6: touches the target topics, but the main contribution is elsewhere.

3-4: largely outside the target topics, often application-focused or domain-specific.

1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

9-10: new paradigm, theory, or major methodological breakthrough.

7-8: substantial methodological advance or strong new insight.

5-6: meaningful but incremental extension or refinement.

3-4: minor, narrow, or mostly engineering or domain-specific improvement.

1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.

Personalized Daily ArXiv Papers 2026-05-12

Architecture and Training Dynamics (66)

Efficiency, Compression, and Large-Scale Training (31)

Representation Learning Theory and Structure (48)

Memory Structures and Agent Memory Systems (12)

World Models, Exploration, and Open-Ended Reinforcement Learning (26)

Architecture and Training Dynamics (66)

1. Hierarchical Mixture-of-Experts with Two-Stage Optimization

2. ELF: Embedded Language Flows

3. Priming: Hybrid State Space Models From Pre-trained Transformers

4. Attention Drift: What Autoregressive Speculative Decoding Models Learn

5. Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

6. A Single-Layer Model Can Do Language Modeling

7. TIDES: Implicit Time-Awareness in Selective State Space Models

8. FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences

9. Complex-Valued Phase-Coherent Transformer

10. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

11. Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

12. Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

13. A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

14. SDG-MoE: Signed Debate Graph Mixture-of-Experts

15. Key-Value Means

16. Continuity Laws for Sequential Models

17. Muown: Row-Norm Control for Muon Optimization

18. Kaczmarz Linear Attention

19. Scaling Limits of Long-Context Transformers

20. Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

21. Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

22. Mixture of Layers with Hybrid Attention

23. Sparse Layers are Critical to Scaling Looped Language Models

24. Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

25. Kinetic theory for Transformers and the lost-in-the-middle phenomenon

26. Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective

27. Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

28. Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

29. Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models

30. Teaching LLMs to See Graphs: Unifying Text and Structural Reasoning

31. Lattice Deduction Transformers

32. Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling

33. bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

34. Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

35. Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising

36. The Power of Second Order Methods for Sequence Preconditioning

37. NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

38. Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

39. On Variance Reduction in Learning Mean Flows

40. Infinite Mask Diffusion for Few-Step Distillation

41. Phases of Muon: When Muon Eclipses SignSGD

42. Controlling Transient Amplification Improves Long-horizon Rollouts

43. Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

44. Optimizer-Induced Mode Connectivity: From AdamW to Muon

45. Fitting Multilinear Polynomials for Logic Gate Networks

46. Hyperparameter Transfer for Dense Associative Memories

47. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

48. Dimension-Free Saddle-Point Escape in Muon

49. Parameterized Complexity of Stationarity Testing for Piecewise-Affine Functions and Shallow CNN Losses

50. Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

51. CATO: Charted Attention for Neural PDE Operators

52. RAwR: Role-Aware Rewiring via Approximate Equitable Partition

53. When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

54. Exactness Matters for Physical Rule Enforcement

55. Exact Fixed-Point Constraints in Neural-ODEs with Provable Universality

56. Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

57. Elucidating Representation Degradation Problem in Diffusion Model Training

58. A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models

59. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

60. Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

61. Recovering Physical Dynamics from Discrete Observations via Intrinsic Differential Consistency

62. RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

63. Improving Generalization by Permutation Routing Across Model Copies

64. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

65. HyperTransport: Amortized Conditioning of T2I Generative Models

66. Don't Fix the Basis -- Learn It: Spectral Representation with Adaptive Basis Learning for PDEs

Efficiency, Compression, and Large-Scale Training (31)

1. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

2. BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

3. PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

4. AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

5. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

6. Test-Time Speculation