Previous Day 2026-05-07
Monthly Overview 2026-05
Next Day 2026-05-11

Personalized Daily ArXiv Papers 2026-05-08

Model Metric Usage Papers
Prompt Completion Total Total arXiv Scanned Relevant
gpt-5.4 Tokens 418992 33915 452907 1086 702 75
Cost $1.05 $0.51 $1.56

Topic Coverage:

TopicPapers
Architecture and Training Dynamics23
Efficiency, Compression, and Large-Scale Training16
Representation Learning Theory and Structure13
Memory Structures and Agent Memory Systems9
World Models, Exploration, and Open-Ended Reinforcement Learning14

Table of contents by topic:

Architecture and Training Dynamics (23)

  1. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention Authors: Yulong Huang, Xiang Liu, Hongxiang Huang, Xiaopeng Lin, Zunchang Liu, Xiaowen Chu, Zeke Xie, Bojun Cheng

  2. Continuous Latent Diffusion Language Model Authors: Hongcan Guo, Qinyu Zhao, Yian Zhao, Shen Nie, Rui Zhu, Qiushan Guo, Feng Wang, Tao Yang, Hengshuang Zhao, Guoqiang Wei, Yan Zeng

  3. Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers Authors: Pengqi Lu

  4. Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent Authors: Chenyang Zhang, Yuan Cao

  5. TIDE: Every Layer Knows the Token Beneath the Context Authors: Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho

  6. Cubit: Token Mixer with Kernel Ridge Regression Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Liangchen Tan, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

  7. Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks Authors: Ying Chen, Aoxi Li, Jihun Kim, Javad Lavaei

  8. Estimating Implicit Regularization in Deep Learning Authors: Joseph H. Rudoler, Kevin Tan, Giles Hooker, Konrad P. Kording

  9. Layer Collapse in Diffusion Language Models Authors: Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu

  10. Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes Authors: Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

  11. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less Authors: Yuxing Liu, Jianyu Wang, Tong Zhang

  12. Von Neumann Networks Authors: Shekhar S. Chandra

  13. Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles Authors: Daniel Grimmer

  14. A Testable Certificate for Constant Collapse in Teacher-Guided VAEs Authors: Zegu Zhang, Jianhua Peng, Jian Zhang

  15. Large Vision-Language Models Get Lost in Attention Authors: Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang

  16. Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization Authors: Abhijit Das, Sayantan Dutta

  17. On the Blessing of Pre-training in Weak-to-Strong Generalization Authors: Wei Yao, Wang Zhaoyang, Gengze Xu, Chen Qian, Dongrui Liu, Ziqiao Wang, Yong Liu, Yunbei Xu

  18. Crafting Reversible SFT Behaviors in Large Language Models Authors: Yuping Lin, Pengfei He, Yue Xing, Yingqian Cui, Jiayuan Ding, Subhabrata Mukherjee, Hui Liu, Zhen Xiang

  19. Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS Authors: Laurent Guigues

  20. On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR Authors: Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua

  21. Full-Spectrum Graph Neural Network: Expressive and Scalable Authors: Xiaohan Wang, Deyu Bo, Longlong Li, Kelin Xia

  22. Recursive Agent Optimization Authors: Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig

  23. A Regime Theory of Controller Class Selection for LLM Action Decisions Authors: Zhaoyang Jiang, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Xuanqi Peng, Honghan Wu

Efficiency, Compression, and Large-Scale Training (16)

  1. Normalized Architectures are Natively 4-Bit Authors: Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry, Boris Ginsburg

  2. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility Authors: Jungsuk Oh, Hyeseo Jeon, Hyunjune Ji, Kyongmin Kong, Jay-Yoon Lee

  3. Federation of Experts: Communication Efficient Distributed Inference for Large Language Models Authors: Muhammad Shahir Abdurrahman, Chun Deng, Azalia Mirhoseini, Philip Levis

  4. Rethinking Adapter Placement: A Dominant Adaptation Module Perspective Authors: Suoxin Zhang, Run He, Di Fang, Xiang Tan, Kaixuan Chen, Huiping Zhuang

  5. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving Authors: Mikhail Shirokikh, Sergey Nikolenko

  6. Nearly Optimal Attention Coresets Authors: Edo Liberty, Alexandr Andoni, Eldar Kleiner

  7. Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference Authors: Saksham Rathi, Preeti, Mythili Vutukuru

  8. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds Authors: Hongyi Tao, Dingzhi Yu, Lijun Zhang

  9. Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven Authors: Ran Ben-Basat, William Kuszmaul, Michael Mitzenmacher, Amit Portnoy, Shay Vargaftik

  10. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving Authors: Bole Ma, Jan Eitzinger, Harald K\"ostler

  11. Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization Authors: Ruotong Sun, Ermin Wei

  12. VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading Authors: Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li

  13. PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization Authors: Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

  14. Accelerating LMO-Based Optimization via Implicit Gradient Transport Authors: Won-Jun Jang, Si-Hyeon Lee

  15. SymDrift: One-Shot Generative Modeling under Symmetries Authors: Samir Darouich, Vinh Tong, Llu\'is Pastor-P\'erez, Tanja Bien, Loay Mualem, Mathias Niepert

  16. P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference Authors: Xin Peng, Ang Gao

Representation Learning Theory and Structure (13)

  1. The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks Authors: Taehun Cha, Daniel Beaglehole, Adityanarayanan Radhakrishnan, Donghun Lee

  2. Structural Instability of Feature Composition Authors: Yunpeng Zhou

  3. Topological Signatures of Grokking Authors: Yifan Tang, Qiquan Wang, In\'es Garc\'ia-Redondo, Anthea Monod

  4. When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias Authors: Ye Su, Jian Li, Yong Liu

  5. End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems Authors: Carles Balsells-Rodas, Zhengrui Xiang, Xavier Sumba, Yingzhen Li

  6. MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series Authors: Shicheng Fan, Nour Elhendawy, Jianle Sun, Ke Fang, Kun Zhang, Yihang Wang, Lu Cheng

  7. Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow Authors: Bowen Zheng, Yihong Luo, Tianyang Hu

  8. Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors Authors: Richard Bergna, Stefan Depeweg, Jos\'e Miguel Hern\'andez-Lobato

  9. Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation Authors: Chengyu Cui, Gongjun Xu

  10. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization Authors: Adhiraj Banerjee, Vipul Arora

  11. Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective Authors: Meira Iske, Carola-Bibiane Sch\"onlieb

  12. When Graph Language Models Go Beyond Memorization Authors: Masatsugu Yamada, Mahito Sugiyama

  13. TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering Authors: Yuan Sui, Yulin Chen, Yibo Li, Xue Jiang, Yufei He, Yihong Dong, Xiaoxin He, Tianyu Gao, Bryan Hooi

Memory Structures and Agent Memory Systems (9)

  1. Belief Memory: Agent Memory Under Partial Observability Authors: Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, Xiuying Chen

  2. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination Authors: Qiyao Liang, Risto Miikkulainen, Ila Fiete

  3. SkillOS: Learning Skill Curation for Self-Evolving Agents Authors: Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, Maohao Shen, Vishy Tirumalashetty, George Lee, Jiawei Han, Tomas Pfister, Chen-Yu Lee

  4. Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning Authors: David Leeftink, Max Hinne, Marcel van Gerven

  5. Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs Authors: Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous

  6. Retrieval from Within: An Intrinsic Capability of Attention-Based Models Authors: Elad Hoffer, Yochai Blau, Ron Banner, Daniel Soudry, Boris Ginsburg

  7. More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding Authors: Ming Liu

  8. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG Authors: Yijia Zheng, Marcel Worring

  9. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work Authors: Josh Rosen, Seth Rosen

World Models, Exploration, and Open-Ended Reinforcement Learning (14)

  1. HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning Authors: Haoyun Tang, Haodong Cui, Keyao Xu, Kun Wang, Zhandong Mei

  2. Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement Authors: Roussel Desmond Nzoyem, Mauro Comi

  3. Prediction and Empowerment: A Theory of Agency through Bridge Interfaces Authors: Richard Csaky

  4. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key Authors: Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov

  5. Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing Authors: Leon Halgryn (University of Twente), Sophie Langer (Ruhr-Universit\"at Bochum), Janusz M. Meylahn (University of Twente), E. Moritz Hahn (University of Twente)

  6. A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration Authors: Manuel Haussmann, Mustafa Mert \c{C}elikok, Melih Kandemir

  7. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities Authors: Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn

  8. Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL Authors: Dillon Sandhu, Ronald Parr

  9. Operator-Guided Invariance Learning for Continuous Reinforcement Learning Authors: Zuyuan Zhang, Fei Xu Yu, Tian Lan

  10. Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching Authors: Xiang Li, Nan Jiang

  11. Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies Authors: Magnus Victor Boock, Abdullah Akg\"ul, Mustafa Mert \c{C}elikok, Melih Kandemir

  12. Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics Authors: Philip Jordan, Maryam Kamgarpour

  13. Bandit Learning in General Open Multi-agent Systems Authors: Mengfan Xu

  14. Differential Privacy in the Extensive-Form Bandit Problem Authors: Stephen Pasteris, Rahul Savani, Theodore Turocy


Architecture and Training Dynamics (23)

1. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

ArXiv ID: 2605.05838

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Yulong Huang, Xiang Liu, Hongxiang Huang, Xiaopeng Lin, Zunchang Liu, Xiaowen Chu, Zeke Xie, Bojun Cheng

Abstract: Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks. Code: https://github.com/HuuYuLong/MomentumDeltaNet .

Comment: Derives a chunkwise parallel algorithm for momentum-based delta linear attention and analyzes its stability as a second-order dynamical system.

Topic Match: This squarely targets core sequence-model mechanism design and training stability for linear-attention-style recurrent updates.

Relevance: 10 Novelty: 8


2. Continuous Latent Diffusion Language Model

ArXiv ID: 2605.06548

Primary Topic: Architecture and Training Dynamics

Authors: Hongcan Guo, Qinyu Zhao, Yian Zhao, Shen Nie, Rui Zhu, Qiushan Guo, Feng Wang, Tao Yang, Hengshuang Zhao, Guoqiang Wei, Yan Zeng

Abstract: Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.

Comment: Introduces a hierarchical continuous latent diffusion language model that separates global semantic prior modeling from local text realization.

Topic Match: Best fit is architecture_training because the paper proposes a new core generative architecture and modeling decomposition for text.

Relevance: 9 Novelty: 9


3. Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

ArXiv ID: 2605.06169

Primary Topic: Architecture and Training Dynamics

Authors: Pengqi Lu

Abstract: Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.

Comment: Identifies a deep diffusion-transformer collapse mechanism and proposes split residual design to stabilize training up to 1000 layers.

Topic Match: This is directly about architectural/training-stability mechanisms at extreme depth, a core target topic.

Relevance: 9 Novelty: 8


4. Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

ArXiv ID: 2605.06609

Primary Topic: Architecture and Training Dynamics

Authors: Chenyang Zhang, Yuan Cao

Abstract: Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.

Comment: Shows transformers can exactly implement in-context logistic regression as normalized gradient descent across layers, with training and OOD guarantees.

Topic Match: This is squarely about mechanistic understanding of transformer computation and training dynamics for in-context learning, which is the core of the architecture/training topic.

Relevance: 9 Novelty: 8


5. TIDE: Every Layer Knows the Token Beneath the Context

ArXiv ID: 2605.06216

Primary Topic: Architecture and Training Dynamics

Also Matches: Memory Structures and Agent Memory Systems

Authors: Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho

Abstract: We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

Comment: Introduces a new transformer mechanism that reinjects token identity at every layer through routed embedding memory to address rare-token undertraining and contextual collapse.

Topic Match: This is fundamentally an architectural proposal for token processing and training behavior in transformers, even though it uses a memory-like module.

Relevance: 9 Novelty: 8


6. Cubit: Token Mixer with Kernel Ridge Regression

ArXiv ID: 2605.06501

Primary Topic: Architecture and Training Dynamics

Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Liangchen Tan, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

Abstract: Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.

Comment: Replaces attention’s Nadaraya-Watson token mixer with a kernel-ridge-regression formulation and adds a stabilization mechanism.

Topic Match: This is a direct proposal for a new token-mixing mechanism in transformers, with theory and stability considerations at the architecture level.

Relevance: 9 Novelty: 8


7. Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks

ArXiv ID: 2605.05659

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Ying Chen, Aoxi Li, Jihun Kim, Javad Lavaei

Abstract: The massive computational costs of scaling modern deep learning architectures have driven the widespread use of parameter-efficient low-rank structures, such as LoRA and low-rank factorization. However, theoretical guarantees for their expressive power are less explored, often relying on restrictive priors like a pretrained base matrix, ReLU activations or non-verifiable singularity conditions. We first investigate the limits of neural networks constrained strictly to low-rank manifolds without pretrained dense priors. We demonstrate a theoretical paradox: while purely rank-1 layers can exactly interpolate arbitrary scalar datasets, they collapse for function approximations. To overcome this bottleneck without surrendering parameter efficiency, we introduce a unified \textit{Structural Correspondence} framework. We prove that augmenting low-rank layers with only a minimal sparse diagonal component, say a Diagonal plus Low-Rank (DLoR) structure, is sufficient to reach Universal Approximation. We show that any full-rank transformation can be exactly reconstructed using these DLoR components by trading off network width (additive decomposition) or depth (multiplicative decomposition). By tracking asymptotic Taylor remainders, we prove that DLoR neural networks fully restore the Universal Approximation Theorem for general activation functions. Finally, we establish that multiplicative depth provides superior parameter-to-expressivity scaling compared to additive width. Our results show that dense matrices and specific activation functions are not topological prerequisites for universal expressivity.

Comment: Shows that adding a minimal diagonal component to low-rank layers restores universal approximation, clarifying expressivity of DLoR/LoRA-style structures.

Topic Match: Primary fit is architecture_training because the main result is a foundational expressivity theorem about a core architectural parameterization.

Relevance: 9 Novelty: 8


8. Estimating Implicit Regularization in Deep Learning

ArXiv ID: 2605.05436

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Joseph H. Rudoler, Kevin Tan, Giles Hooker, Konrad P. Kording

Abstract: Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.

Comment: Presents gradient-matching methods to empirically estimate implicit regularization induced by training procedures like early stopping and dropout.

Topic Match: Best matched to architecture_training because it directly studies training dynamics and the implicit biases induced by optimization and regularization choices.

Relevance: 9 Novelty: 8


9. Layer Collapse in Diffusion Language Models

ArXiv ID: 2605.06366

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training, Representation Learning Theory and Structure

Authors: Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu

Abstract: Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: github.com/Conzel/super-outlier-dlm.

Comment: Mechanistic analysis of layer collapse in diffusion language models with direct implications for quantization and sparsity allocation.

Topic Match: The main contribution is a novel analysis of activation and redundancy dynamics induced by the DLM objective, a strong architecture/training-dynamics fit.

Relevance: 9 Novelty: 8


10. Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

ArXiv ID: 2605.06152

Primary Topic: Architecture and Training Dynamics

Authors: Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

Abstract: Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

Comment: Shows periodic loss spikes arise from finite-precision gradient absorption that breaks the zero-sum softmax gradient structure during late training.

Topic Match: This is directly about training dynamics and stability mechanisms, giving a mechanistic explanation for an observed optimization pathology in neural network training.

Relevance: 9 Novelty: 8


11. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

ArXiv ID: 2605.06654

Primary Topic: Architecture and Training Dynamics

Authors: Yuxing Liu, Jianyu Wang, Tong Zhang

Abstract: Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.

Comment: Shows optimizer-model consistency: using the pretraining optimizer during full finetuning reduces forgetting and changes update geometry.

Topic Match: The paper directly analyzes training dynamics and optimizer-induced model geometry across pretraining and finetuning.

Relevance: 9 Novelty: 8


12. Von Neumann Networks

ArXiv ID: 2605.05780

Primary Topic: Architecture and Training Dynamics

Authors: Shekhar S. Chandra

Abstract: In the mid-twentieth century, mathematician and polymath John von Neumann created a computational system on an array of cells as a simple model of the human brain, where each cell had one of a finite set of roles or states that he predicted would be modelled by a diffusion process. In this work, we show that such a system, when developed in a modern deep learning setting, enables the construction of an artificial neuron having specialized roles that can be learnt. We refer to this neuron as the Von Neumann neuron, and the resulting neural network from such neurons result in a self-engineered design whose architecture is only dependent on the structure and locations of its inputs and outputs on this cellular array. The mathematical framework for these Von Neumann Networks (VNNs) is also constructed and shows that they are based on the extension of neural operators and the learning of Green's functions with convolutions on a cellular topology having a diffusion signature. We also prove that these VNNs are part of a more general computational system called Cellular Machines that are computationally universal. Initial experiments show that VNN based multi-layered perceptrons outperform their equivalent deep learning variant on basic tasks, while being more parameter efficient and are capable of learning new types of tasks. This includes the ability to solve for and construct an extension of the Von Neumann (hardware) architecture common to all modern computers to cells and suggests new opportunities that could be explored.

Comment: Introduces Von Neumann neurons and cellular-array networks that learn specialized roles through diffusion-like computation, with a universality argument and operator-based formulation.

Topic Match: This is a new architectural proposal centered on a nonstandard computational mechanism and neuron design, squarely matching foundational architecture research.

Relevance: 8 Novelty: 9


13. Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles

ArXiv ID: 2605.05284

Primary Topic: Architecture and Training Dynamics

Authors: Daniel Grimmer

Abstract: Evolutionary computation has long promised to deliver both high-performance optimization tools as well as rigorous scientific simulations of Darwinian evolution. However, modern algorithms frequently abandon evolutionary fidelity for physics-inspired heuristics or superficial biological metaphors. This paper derives a suite of advanced gradient-based optimization algorithms directly from evolutionary first principles. We introduce Darwinian Lineage Simulations (DLS) to prove that, in an asexual context, Fisher's and Wright's historically opposed views of evolution are actually formally equivalent. This unification requires carefully partitioning Fisher's deterministically-evolving total population into Wright's randomly-drifting sub-populations. We prove that proper bookkeeping requires introducing a specific kind of structured noise (the DLS noise relation). Crucially, however, any bookkeeping choices which satisfy this relation will result in a faithful simulation of evolution. Using this vast representational freedom, we prove that a broad family of battle-tested optimization algorithms are already perfectly compatible with evolutionary dynamics. These include: Stochastic Gradient Descent, Natural Gradient Descent, and the Damped Newton's method among many others. By simply adding DLS noise (i.e., evolutionarily faithful genetic drift), these algorithms become scientifically valid in silico simulations of Darwinian evolution. Finally, we demonstrate that even the state-of-the-art Adam optimizer can be brought into evolutionary compliance through a minor mathematical surgery.

Comment: Derives gradient optimizers from Darwinian lineage simulations, offering a novel first-principles view of optimizer dynamics and noise structure.

Topic Match: Its strongest match is optimizer and training-dynamics theory, with a conceptual reframing of standard optimization algorithms from evolutionary first principles.

Relevance: 8 Novelty: 9


14. A Testable Certificate for Constant Collapse in Teacher-Guided VAEs

ArXiv ID: 2605.05813

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Zegu Zhang, Jianhua Peng, Jian Zhang

Abstract: Posterior collapse in variational autoencoders is often diagnosed by its symptoms: a small KL term, a strong decoder, or weak use of the latent code. These signals are useful, but they do not define a collapse boundary. We study a concrete failure mode, input-independent constant collapse, and show that this case admits an exact threshold. For any fixed nonconstant teacher distribution (T(\cdot\mid x)), the best constant student is the dataset-average teacher distribution, and its alignment cost is the teacher mutual information (I_T(X;T)). Therefore, if a strictly latent-only raw witness achieves alignment loss below this value, with a safety margin, the witness cannot be constant in the input. This identity turns a qualitative failure mode into a measurable one. In CIFAR-100 experiments with per-seed teacher search, full training stays on the certified side of the boundary, removing alignment drives the raw witness into the constant-student regime, and restarting from a collapsed checkpoint with alignment enabled restores the certificate. Tiny-ImageNet-200 fixed-target runs show the same prevention--collapse--rescue pattern across three independently searched teachers. Standard VAE-style baselines, including methods that preserve reconstruction quality or post-hoc predictability, remain negative under the raw certificate. The guarantee is intentionally narrow: it certifies that the matched nonconstant teacher-relative variation passes through the latent pathway, rather than claiming that all forms of posterior collapse have been ruled out.

Comment: Provides an exact information-theoretic threshold certifying when teacher-guided VAEs avoid input-independent constant posterior collapse.

Topic Match: The main contribution is a principled training-dynamics certificate for a concrete collapse failure mode in latent-variable models.

Relevance: 8 Novelty: 8


15. Large Vision-Language Models Get Lost in Attention

ArXiv ID: 2605.05668

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang

Abstract: Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.

Comment: Provides a mechanistic account of attention vs. FFN roles using information-geometric analysis and shows strong redundancy in learned attention weights.

Topic Match: Its central contribution is understanding internal Transformer module function and architectural misallocation, squarely in architecture/mechanism analysis.

Relevance: 8 Novelty: 8


16. Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

ArXiv ID: 2605.06599

Primary Topic: Architecture and Training Dynamics

Authors: Abhijit Das, Sayantan Dutta

Abstract: Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with $L^2$ regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss $\mathcal{F}$ is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition $-\Delta\mathcal{F} + \tfrac{1}{s}|\nabla\mathcal{F}|^{2} \to \infty$ as $|\theta| \to \infty$ for all $s>0$. From this structure, we derive explicit log-Sobolev and Poincar\'e constants $C_{\mathrm{LS}} \leq \lambda^{-1} + d/\lambda^{2}$, linking the regularization strength $\lambda$ and model dimension $d$ to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing $\lambda$. To validate our theory, we introduce a scalable Villani diagnostic $\Psi_s(\theta) = -\Delta \mathcal{F} + s^{-1}|\nabla \mathcal{F}|^2$ and estimate it efficiently using Hutchinson trace probes in models with over 100M parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of $\Psi_s$, spectral inflation of the Hessian, and exponential convergence behavior consistent with our log-Sobolev analysis. These results demonstrate that weight decay not only improves generalization empirically but also establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning.

Comment: Gives functional-analytic theory for how weight decay shapes transformer loss landscapes and convergence/generalization behavior.

Topic Match: Its main contribution is a theoretical account of a core training-stability mechanism in transformers, not an application or benchmark result.

Relevance: 8 Novelty: 8


17. On the Blessing of Pre-training in Weak-to-Strong Generalization

ArXiv ID: 2605.05710

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Wei Yao, Wang Zhaoyang, Gengze Xu, Chen Qian, Dongrui Liu, Ziqiao Wang, Yong Liu, Yunbei Xu

Abstract: The paradigm of Weak-to-Strong Generalization (W2SG) suggests that a pre-trained strong model can surpass its weak supervisor, yet the decisive role of pre-training remains theoretically and empirically under-explored. In this work, we identify pre-training as the essential prerequisite for the emergence of W2SG. Theoretically, we formalize the W2SG problem within a high-dimensional single-index model framework using spiked Gaussian data, modeling pre-training as a spectral initialization step. Building upon prior impossibility results regarding the failure of learning under random initialization, we prove that W2SG is achievable when pre-training provides a geometric warm start that places the model within an "effective region" characterized by a perturbed strong-convexity geometry. Within this region, we derive a rigorous generalization bound that naturally captures the optimization dynamics: an initial performance improvement followed by a saturation bottleneck dictated by the weak supervisor's bias. Empirically, we first validate all our assumptions and theoretical insights through controlled synthetic simulations. Finally, through a massive-scale evaluation of hundreds of intermediate pre-training checkpoints from large language models, we demonstrate that W2SG is not an innate capability but emerges via a phase transition tightly coupled with the progression of pre-training.

Comment: Theoretically identifies pre-training as the condition enabling weak-to-strong generalization, via optimization geometry and phase-transition analysis.

Topic Match: The main value is foundational theory about training dynamics and initialization geometry for LLM generalization, rather than alignment post-training itself.

Relevance: 8 Novelty: 8


18. Crafting Reversible SFT Behaviors in Large Language Models

ArXiv ID: 2605.06632

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Yuping Lin, Pengfei He, Yue Xing, Yingqian Cui, Jiayuan Ding, Subhabrata Mukherjee, Hui Liu, Zhen Xiang

Abstract: Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identify sparse subnetworks correlated with SFT-induced behaviors post-hoc. However, such correlations do not imply causal necessity, limiting the ability to selectively control SFT-induced behaviors at inference time. We pursue an alternative by asking: can an SFT-induced behavior be deliberately compressed into a sparse, mechanistically necessary subnetwork, termed a carrier, while remaining controllable at inference time without weight modification? We propose (a) Loss-Constrained Dual Descent (LCDD), which constructs such carriers by jointly optimizing routing masks and model weights under an explicit utility budget, and (b) SFT-Eraser, a soft prompt optimized via activation matching on extracted carrier channels, to reverse the SFT-induced behavior. Across safety, fixed-response, and style behaviors on multiple model families, LCDD yields sparse carriers that preserve target behaviors while enabling strong reversion when triggered by SFT-Eraser. Ablations further establish that the sparse structure is the key precondition for reversal: the same trigger optimization fails on standard SFT models, confirming that structure rather than trigger design is the operative factor. These results provide direct evidence that the learned carriers are causally necessary for the behaviors, pointing to a new direction for systematically localizing and selectively suppressing SFT-induced behaviors in deployed models.

Comment: Learns sparse mechanistically necessary subnetworks that carry SFT-induced behaviors and introduces a trigger that selectively reverses those behaviors via the extracted carrier.

Topic Match: This is primarily about internal architectural organization of learned behaviors under fine-tuning and how to enforce sparse causal carriers, a training-dynamics and mechanistic-control contribution.

Relevance: 8 Novelty: 8


19. Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS

ArXiv ID: 2605.05330

Primary Topic: Architecture and Training Dynamics

Authors: Laurent Guigues

Abstract: We introduce Graph Normalization (GN), a principled dynamical system on graphs that serves as a differentiable approximation engine for the NP-hard Maximum Weight Independent Set (MWIS) problem. MWIS encompasses many combinatorial challenges, including optimal assignment, scheduling, set packing, and MAP inference in discrete Markov Random Fields. Unlike Belief Propagation, we prove GN always converges to a binary indicator of a Maximum Independent Set. GN realizes a fast quasi-Newton descent through an exact Majorization-Minimization step, systematically improving the MWIS relaxed primal objective. We establish an equivalence between GN and the Replicator Dynamics of a nonlinear evolutionary game, where vertices compete for inclusion in an independent set. While a non-potential game, the GN game follows Fisher's Fundamental Theorem of Natural Selection, where the average fitness equals the MWIS primal objective and strictly increases. This connection leads to a weighted extension of the Motzkin-Straus theorem, showing MISes are in bijection with the local minima of a quadratic form over a tilted simplex. For the Assignment Problem, GN acts as a variant of the Sinkhorn algorithm that naturally converges to a hard assignment while generalizing to arbitrary constraint graphs. We demonstrate GN's performance as a fast binarization engine for the state-of-the-art Bregman-Sinkhorn relaxed MWIS solver. On real-world benchmarks with up to 1M edges, GN identifies solutions within 1% of the best known results in seconds on a CPU. GN opens new avenues for deep learning architectures requiring differentiable, "hard" decisions under constraints, with applications in structured sparse attention, dynamic network pruning, and Mixture-of-Experts. Beyond core AI, the GN framework enables end-to-end learning of constrained optimization in computer vision, computational biology, and resource allocation.

Comment: Proposes Graph Normalization, a convergent binarizing dynamical system for differentiable MWIS with links to replicator dynamics and hard decision-making under constraints.

Topic Match: The paper introduces a new computational mechanism for hard structured decisions that could serve as a foundational building block for sparse attention, pruning, and MoE-like routing settings.

Relevance: 8 Novelty: 8


20. On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

ArXiv ID: 2605.06523

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua

Abstract: Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.

Comment: Analyzes low-rank parameter dynamics in RLVR and argues reasoning gains concentrate in rank-1 components with implicit reward overfitting.

Topic Match: Best fit is architecture_training because the core contribution is mechanistic analysis of training dynamics rather than an application or benchmark gain.

Relevance: 8 Novelty: 8


21. Full-Spectrum Graph Neural Network: Expressive and Scalable

ArXiv ID: 2605.05759

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Xiaohan Wang, Deyu Bo, Longlong Li, Kelin Xia

Abstract: It is well established that spectral graph neural networks (GNNs) can universally approximate node signals; however, their expressive power remains bounded by the 1-dimensional Weisfeiler-Lehman test, which is mirrored in their lack of universality for higher-order signals. To go beyond this bound, we propose the Full-Spectrum GNN (FSpecGNN), a second-order generalization of classical spectral GNNs. FSpecGNN advances spectral filtering in two perspectives: (1) it lifts the signal from the node domain to the node-pair domain; and (2) it extends the univariate spectral filter over eigenvalues to a bivariate filter over eigenvalue pairs. We show that classical spectral GNNs arise as a diagonal special case of FSpecGNN, and prove that FSpecGNN can be at most as expressive as Local 2-GNN while universally approximating node-pair signals, the latter being particularly beneficial for heterophilic graph learning. Moreover, FSpecGNN admits scalable implementations that avoid explicit node-pair-level computations; combined with a low-rank approximation that reduces full-spectrum convolution to a combination of polynomial spectral filters, it enables learning on large graphs. Empirically, FSpecGNN validates the predicted expressivity and delivers strong performance on heterophilic benchmarks.

Comment: Extends spectral GNNs to node-pair signals with bivariate spectral filters, increasing expressive power beyond classical spectral formulations.

Topic Match: Its main contribution is a new architectural mechanism and expressivity result for graph neural computation.

Relevance: 8 Novelty: 8


22. Recursive Agent Optimization

ArXiv ID: 2605.06639

Primary Topic: Architecture and Training Dynamics

Authors: Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig

Abstract: We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model's context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.

Comment: Trains recursively delegating agents that learn when and how to spawn sub-agents, yielding inference-time scaling beyond context limits.

Topic Match: The key idea is a new computational architecture for recursive delegation and its training, not standard agent post-training alone.

Relevance: 8 Novelty: 8


23. A Regime Theory of Controller Class Selection for LLM Action Decisions

ArXiv ID: 2605.06339

Primary Topic: Architecture and Training Dynamics

Authors: Zhaoyang Jiang, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Xuanqi Peng, Honghan Wu

Abstract: Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribution-dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance-level controllers, and prior-gated controllers, ordered by complexity. We prove a regime theory that turns three data-estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance-level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance-level signal is unreliable. The resulting Bernstein-tight threshold has a matching information-theoretic lower bound, and strict nested cross-validation provably selects a near-best class. Across SMS-Spam, HallusionBench, A-OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior-gated controller wins on TextVQA when OCR tokens supply a label-free prediction-time prior. Code is available at https://github.com/Anonymous-Awesome-Submissions/Regime-Theory.

Comment: Provides a regime theory for when to use fixed, partitioned, instance-level, or prior-gated controllers for answer/retrieve/defer/abstain decisions.

Topic Match: This is fundamentally about controller architecture selection and finite-sample decision regimes, a strong match to core computational mechanism analysis.

Relevance: 8 Novelty: 8


Efficiency, Compression, and Large-Scale Training (16)

1. Normalized Architectures are Natively 4-Bit

ArXiv ID: 2605.06067

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry, Boris Ginsburg

Abstract: Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the need for interventions-such as applying random Hadamard transforms and performing per-tensor scaling calculations-to preserve model quality, and it enables stable end-to-end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba-Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element-wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal-to-noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at https://github.com/anonymous452026/ngpt-nvfp4

Comment: Shows normalized architectures are intrinsically robust to end-to-end 4-bit training and explains the SNR mechanism behind it.

Topic Match: Primary fit is efficiency and scaling because the main result is a materially different route to stable 4-bit training, with architectural analysis in service of low-precision efficiency.

Relevance: 10 Novelty: 8


2. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

ArXiv ID: 2605.06105

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Jungsuk Oh, Hyeseo Jeon, Hyunjune Ji, Kyongmin Kong, Jay-Yoon Lee

Abstract: Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.

Comment: Proposes layer-asymmetric KV visibility to cut long-context prefill/decode cost by dropping upper-layer prompt KV states.

Topic Match: The core idea is a new KV-cache/inference policy that materially changes long-context cost and behavior, squarely matching efficiency and scaling.

Relevance: 10 Novelty: 8


3. Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

ArXiv ID: 2605.06206

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Muhammad Shahir Abdurrahman, Chun Deng, Azalia Mirhoseini, Philip Levis

Abstract: Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.

Comment: Restructures MoE inference into KV-head-specific clusters to confine or eliminate all-to-all communication during distributed inference.

Topic Match: The core contribution is a new distributed MoE architecture that materially changes communication cost and inference behavior at scale.

Relevance: 9 Novelty: 8


4. Rethinking Adapter Placement: A Dominant Adaptation Module Perspective

ArXiv ID: 2605.06183

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Suoxin Zhang, Run He, Di Fang, Xiang Tan, Kaixuan Chen, Huiping Zhuang

Abstract: Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA's trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.

Comment: Identifies a dominant adaptation module for LoRA via gradient-energy probing and shows one strategically placed adapter can beat broad placement.

Topic Match: This is a parameter-efficiency result about where adaptation capacity should live, with a clear new mechanistic placement criterion.

Relevance: 9 Novelty: 8


5. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

ArXiv ID: 2605.05219

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: Mikhail Shirokikh, Sergey Nikolenko

Abstract: Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a single stored state rather than requiring the entire token history. This asymmetry opens a new design point between no reuse and dense caching: store exact recurrent states at a sparse set of checkpoint positions and, on a cache hit, resume from the deepest stored checkpoint and recompute the remaining suffix exactly. We formalize sparse prefix caching as checkpoint placement under a distribution over overlap depths, yielding an exact O(NM) dynamic program. For use cases where requests share a non-trivial prefix (e.g. asking different questions about a single long document), we show that our method consistently improves the Pareto frontier traced by standard heuristics on real-world data. Across QuALITY and System Prompts, distribution-aware placement dominates every fixed-budget baseline on the measured layer-group Pareto frontier and matches or outperforms the strongest heuristic (block caching) while typically using substantially fewer checkpoints, with the largest gains at low checkpoint budgets where the overlap distribution is most non-uniform. The method is most relevant when many requests share a substantial but not identical prefix within a retained cache entry. It preserves exact outputs, does not change the recurrent computation itself or require new recurrent update kernels, applies to recurrent/SSM layers whose hidden state can be extracted and restored exactly, and for hybrid models can be combined with existing KV-cache compression techniques.

Comment: Formulates sparse prefix caching for recurrent and hybrid LLMs as checkpoint placement under overlap distributions, with an exact dynamic program.

Topic Match: Primary fit is efficiency and scaling because the paper introduces a new exact caching policy for recurrent-state reuse that improves serving tradeoffs.

Relevance: 9 Novelty: 8


6. Nearly Optimal Attention Coresets

ArXiv ID: 2605.05602

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Edo Liberty, Alexandr Andoni, Eldar Kleiner

Abstract: We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{\rho+o(\rho)}/\varepsilon})$ such that [ \left| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right| \le \varepsilon ] simultaneously for all queries whose norm is bounded by $\rho$. This outperforms the best known results for this problem. We also offer an improved lower bound showing that $\varepsilon$-coresets must have size $\Omega({\sqrt{d} e^{\rho}/\epsilon})$.

Comment: Proves nearly optimal coreset size bounds for approximating attention uniformly over bounded-norm queries.

Topic Match: The paper directly targets memory/space-efficient approximation of attention with strong theoretical guarantees, making efficiency the clearest fit.

Relevance: 9 Novelty: 8


7. Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

ArXiv ID: 2605.06046

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Saksham Rathi, Preeti, Mythili Vutukuru

Abstract: Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches -- where all requests share a common prefix -- can achieve higher decode throughput than larger, heterogeneous batches, due to better spatial and temporal locality during KV cache accesses. However, prefix-aware schedulers in state-of-the-art inference engines maximize prefix reuse within a batch only to reduce KV cache memory footprint, but do not stop batch formation at smaller homogeneous batches that could have performed better. Further, we show that shared prefix detection in existing schedulers relies on radix-tree traversals, incurring substantial CPU overhead that is often comparable to GPU execution time. This paper presents Feather, a prefix-aware scheduler that uses reinforcement learning (RL) to learn the optimal tradeoff between batch size and prefix homogeneity. We also introduce Chunked Hash Tree (CHT), a lightweight data structure that enables fast prefix detection and efficient request selection for the RL scheduler, avoiding expensive tree traversals. We integrate Feather into vLLM and SGLang, and our evaluation shows that Feather achieves 2--10$\times$ higher end-to-end throughput as compared to existing schedulers, while doing no worse than the status quo when the workload does not have enough prefix sharing. Feather achieves these gains by reducing the total number of KV cache accesses, surpassing the performance of prefix-aware attention kernels that have the same goal.

Comment: Shows that prefix-homogeneous batching can beat larger heterogeneous batches and introduces a scheduler/data structure to exploit KV-cache locality.

Topic Match: This is a clear large-model inference systems contribution about scheduling and KV-cache efficiency, with a nontrivial algorithmic systems idea.

Relevance: 9 Novelty: 8


8. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

ArXiv ID: 2605.06615

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Hongyi Tao, Dingzhi Yu, Lijun Zhang

Abstract: Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emph{upper bound of SignSGD} with the \emph{lower bound of SGD}, illustrating that SignSGD effectively reduces the complexity by a factor of $d$ under \emph{sparse noise}, where $d$ is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.

Comment: Theoretical lower/upper bounds precisely characterize when SignSGD and Muon beat SGD under sparse-noise, coordinate-wise geometry.

Topic Match: Best fit is efficiency_scaling because the paper studies optimizer behavior that can materially change large-scale training efficiency, with rigorous conditions for sign-based methods' gains.

Relevance: 9 Novelty: 8


9. Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

ArXiv ID: 2605.06014

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Ran Ben-Basat, William Kuszmaul, Michael Mitzenmacher, Amit Portnoy, Shay Vargaftik

Abstract: Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate is individually distributed as a shifted beta distribution, which converges to a Gaussian distribution in high dimensions. Generally, one RHT is not suitable in the worst case, as individual coordinates can be far from these distributions. We show that after composing two RHTs on any $d$-sized input vector, the marginal distribution of every fixed coordinate of the normalized rotated vector is within $O(d^{-1/2})$ of a standard Gaussian both in Kolmogorov distance and in $1$-Wasserstein distance. We then plug these bounds into the analyses of modern compression schemes, namely DRIVE and QUIC-FL, and show that two RHTs achieve performance that asymptotically matches URRs. However, we show that two RHTs may not be sufficient for Vector Quantization (VQ), which often requires weak correlation across fixed-size blocks of coordinates (as opposed to only marginal distribution convergence for single coordinates). We prove that a composition of three RHTs leads to decaying coordinate covariance. This ensures that any fixed, bounded, multi-dimensional VQ codebook optimized for URRs has the same expected error when using three RHTs, up to an additive term that vanishes with the dimension. Finally, because practical inputs are rarely adversarial, we propose a linear-time ${O}(d)$ check on the input's moments to dynamically adapt the number of RHTs used at runtime to improve performance.

Comment: Proves why composing randomized Hadamard transforms matches uniform random rotations for modern quantization and compression schemes.

Topic Match: The paper directly targets quantization/compression theory and justifies a widely used practical heuristic with new worst-case guarantees.

Relevance: 9 Novelty: 8


10. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

ArXiv ID: 2605.05696

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Bole Ma, Jan Eitzinger, Harald K\"ostler

Abstract: Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free $c_{KV}$ and a 64-dim $k_r$ correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a $\delta$-rotation rule for $k_r$. We evaluate three native MLA-MoE deployments - DeepSeek-V2-Lite (16B/2.4B), Kimi Moonlight-16B-A3B, and JoyAI-Flash (48B/3B) - with output-consistency on all three and recovery measured on the two endpoints; Irminsul recovers up to ~83% of prompt tokens above exact-prefix on agentic traffic while delivering 63% prefill energy savings per cache hit. We argue that content-addressed caching belongs in the serving stack as a first-class primitive, not a retrofit over prefix matching.

Comment: Exploits MLA factorization to enable position-independent content-addressed KV caching with closed-form correction for shifted tokens in agentic workloads.

Topic Match: Best fit is efficiency_scaling because the key idea is a new cache design that materially improves LLM serving efficiency by leveraging MLA structure.

Relevance: 9 Novelty: 8


11. Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

ArXiv ID: 2605.06316

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Ruotong Sun, Ermin Wei

Abstract: Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning -- most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimization -- and orthogonalization of the gradient momentum, exemplified by Muon and analyzed as steepest descent under the spectral norm. The two routes are typically developed in isolation. We make a structural observation about KL-Shampoo's Kronecker preconditioners: their eigenvalue spectra exhibit a \emph{spike-and-flat} shape -- a few dominant eigenvalues followed by an approximately uniform tail -- across layers and training stages, holding exactly under a rank-$\rho$ signal-plus-noise gradient model. We exploit this structure by restricting one of KL-Shampoo's Kronecker factors to a parametric family aligned with the spike-and-flat shape: full spectral structure on a tracked $r$-dimensional subspace, single shared eigenvalue across the remaining $n-r$ directions. On these directions, we apply orthogonalization. An identity shows that this orthogonalization recovers the algebraic form of full KL-Shampoo's preconditioner. On four pre-training scales (GPT-2 124M / 350M, LLaMA 134M / 450M), Pro-KLShampoo consistently outperforms KL-Shampoo at every subspace rank we test in validation loss, peak per-GPU memory, and wallclock time to reach each loss level.

Comment: New optimizer that unifies KL-Shampoo-style preconditioning with orthogonalization using a spike-and-flat spectral structure.

Topic Match: The core contribution is a materially new large-scale training optimizer that improves memory, speed, and loss in pretraining.

Relevance: 9 Novelty: 8


12. VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

ArXiv ID: 2605.05899

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li

Abstract: Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redundant visual tokens can improve offloading not only by reducing computation, but also by reshaping expert demand. We refer to this effect as \textit{visual-expert affinity}: token pruning makes expert accesses more concentrated within layers and more stable across layers, producing a smaller and more predictable expert working set. Guided by this insight, VisMMoE combines affinity-aware token compression, lookahead expert prediction, and cache/pipeline orchestration to improve expert locality and prefetch effectiveness under tight memory budgets. We implement VisMMoE on multiple frameworks and evaluate it on representative VL-MoE models and benchmarks. VisMMoE improves end-to-end inference performance by up to 2.68x and 1.61x, respectively, over strong baselines for today's VL-MoE deployments while maintaining competitive accuracy.

Comment: Improves VL-MoE offloading by exploiting how visual token pruning reshapes expert locality and cache behavior.

Topic Match: The main idea is a systems/efficiency contribution for MoE deployment through expert-access-aware memory and routing optimization.

Relevance: 9 Novelty: 8


13. PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

ArXiv ID: 2605.06505

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

Abstract: We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.

Comment: Introduces PAC-private zeroth-order fine-tuning where sign-quantized subset updates can achieve zero mutual-information leakage with usable utility.

Topic Match: The main contribution is a new optimization/fine-tuning mechanism with unusual privacy-efficiency tradeoffs, best grouped with training methods at scale.

Relevance: 8 Novelty: 9


14. Accelerating LMO-Based Optimization via Implicit Gradient Transport

ArXiv ID: 2605.05577

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Won-Jun Jang, Si-Hyeon Lee

Abstract: Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles (LMOs). While variance reduction has been explored to accelerate LMO-based methods, it typically incurs substantial computational overhead due to additional gradient evaluations. At the same time, the theoretical understanding of LMO-based methods remains fragmented across unconstrained and constrained formulations. Motivated by these limitations, we propose \emph{LMO-IGT}, a new class of stochastic LMO-based methods leveraging implicit gradient transport (IGT). We further introduce a unified framework for stochastic LMO-based optimization together with a new stationarity measure, the \emph{regularized support function} (RSF), which bridges gradient-norm and Frank--Wolfe-gap notions within a common framework. By evaluating stochastic gradients at transported points, LMO-IGT accelerates convergence while retaining the single-gradient-per-iteration structure of standard stochastic LMO. Our analysis establishes that stochastic LMO achieves an iteration complexity of $\mathcal{O}(\varepsilon^{-4})$, variance-reduced LMO achieves $\mathcal{O}(\varepsilon^{-3})$ at the cost of additional gradient evaluations, and LMO-IGT achieves $\mathcal{O}(\varepsilon^{-3.5})$ using only a single stochastic gradient per iteration. Empirically, LMO-IGT consistently improves over stochastic LMO counterparts with negligible overhead. Among its instantiations, Muon-IGT achieves the strongest overall performance across evaluated settings, demonstrating that IGT provides an effective and practical acceleration mechanism for modern LMO-based optimization.

Comment: Introduces implicit gradient transport to accelerate LMO-based optimizers and unifies their theory with a shared stationarity measure.

Topic Match: The paper is about optimizer design and convergence/computation tradeoffs, which fits efficient large-scale training methods best.

Relevance: 8 Novelty: 8


15. SymDrift: One-Shot Generative Modeling under Symmetries

ArXiv ID: 2605.06140

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Samir Darouich, Vinh Tong, Llu\'is Pastor-P\'erez, Tanja Bien, Loay Mualem, Mathias Niepert

Abstract: Generative modeling of physical systems, such as molecules, requires learning distributions that are invariant under global symmetries, such as rotations in three-dimensional space. Equivariant diffusion and flow matching models can incorporate such invariances effectively, even when trained on a non-invariant empirical distribution, but they typically rely on costly multi-step sampling. Recently, drifting models have emerged as an efficient alternative, enabling single-step generation and achieving state-of-the-art performance in generative modeling tasks. However, we show that drifting models face a symmetry-specific challenge, since an equivariant generator does not generally produce the same drifting field as the one obtained from the symmetrized target distribution. Addressing this issue would require expensive symmetrization of the empirical distribution. To avoid this cost, we propose SymDrift, a framework that makes the drifting field itself symmetry-aware. We introduce two complementary strategies: (i) a symmetrized drift in coordinate space based on optimal alignment, and (ii) a $G$-invariant embedding that removes symmetry ambiguity by construction. Empirically, SymDrift outperforms existing one-shot methods on standard benchmarks for conformer and transition state generation, while remaining competitive with significantly more expensive multi-step approaches. By enabling one-shot inference, SymDrift reduces computational overhead by up to 40$\times$ compared to existing baselines, making it promising for high-throughput applications such as virtual drug screening and large-scale reaction network exploration.

Comment: Builds symmetry-aware one-shot generative modeling by modifying the drift field itself, enabling 40x cheaper sampling under invariances.

Topic Match: Although architectural ideas matter, the standout contribution is a principled reduction in generative sampling cost via one-shot symmetry-aware design.

Relevance: 8 Novelty: 8


16. P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference

ArXiv ID: 2605.06124

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Xin Peng, Ang Gao

Abstract: Classifier-Free Guidance (CFG) is essential for high-fidelity conditional generation in flow matching, yet it imposes significant computational overhead by requiring dual forward passes at each sampling step. In this work, we address this bottleneck by introducing \textbf{P-Guide}, a framework that achieves high-quality guidance through a single inference pass by modulating only the initial latent state. We further show that, under a first-order approximation, P-Guide is equivalent to CFG in the sense that it steers generation from the prior space, without requiring explicit velocity field extrapolation during sampling. We consider both homoscedastic and \textbf{heteroscedastic} priors, and find that jointly modeling the mean and variance enables adaptive loss attenuation and improved robustness to data uncertainty. Extensive experiments demonstrate that P-Guide reduces inference latency by approximately 50\% while maintaining fidelity and prompt alignment competitive with standard dual-pass CFG baselines.

Comment: Replaces dual-pass classifier-free guidance with prior-space steering from the initial latent, giving a single-pass inference mechanism with a first-order equivalence argument.

Topic Match: The main payoff is materially cheaper inference through a new guidance mechanism, so efficiency is the clearest primary fit.

Relevance: 8 Novelty: 8


Representation Learning Theory and Structure (13)

1. The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks

ArXiv ID: 2605.06258

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Taehun Cha, Daniel Beaglehole, Adityanarayanan Radhakrishnan, Donghun Lee

Abstract: Understanding how deep neural networks learn representations remains a central challenge in machine learning theory. In this work, we propose a feature-centric framework for analyzing neural network training by relating weight updates to feature evolution. We introduce a simple identity, the Feature Learning Equation, which identifies the weight Gram matrix as the key object capturing feature dynamics. This enables us to interpret gradient descent as implicitly inducing a hypothetical evolution of features, whose covariance structure - termed the Virtual Covariance - characterizes how representations evolve during training. Building on this perspective, we introduce Target Linearity, a measure quantifying the linear alignment between features and targets. By analyzing the training and layer-wise dynamics, we show that deep networks learn to sequentially transform representations toward target-linear structure. This linearization perspective provides a unified interpretation of several empirical phenomena, including Neural Collapse and linear interpolation in generative models.

Comment: Introduces the Feature Learning Equation, using the weight Gram matrix to explain sequential feature linearization during deep-network training.

Topic Match: The paper directly studies how internal features evolve during optimization and links that to phenomena like neural collapse.

Relevance: 9 Novelty: 8


2. Structural Instability of Feature Composition

ArXiv ID: 2605.05223

Primary Topic: Representation Learning Theory and Structure

Authors: Yunpeng Zhou

Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful paradigm for disentangling feature superposition in transformer-based architectures, enabling precise control via activation steering. However, the theoretical foundations of compositional steering -- the simultaneous activation of distinct semantic latents -- remain under-explored. The prevailing Linear Representation Hypothesis often abstracts away non-linear interference effects that arise in overcomplete dictionaries. We present a geometric framework for analyzing the instability of feature unions. Modeling the activation space as a high-dimensional sparse cone manifold, we derive an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width (statistical dimension) of the signal cone. We further show that, in the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. We validate the predicted scaling trends on structured semantic features extracted from CLEVR, where hierarchical correlations accelerate the transition relative to random baselines. Together, our results highlight geometric constraints on the scalability of union-based steering and motivate composition mechanisms that explicitly manage interference beyond naive linear superposition.

Comment: Provides a geometric theory for instability and collapse when composing sparse autoencoder features, directly addressing representation structure and steering limits.

Topic Match: The paper is primarily about the structure and compositional behavior of learned sparse features, with theoretical analysis of interference in representation space.

Relevance: 9 Novelty: 8


3. Topological Signatures of Grokking

ArXiv ID: 2605.06352

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Yifan Tang, Qiquan Wang, In\'es Garc\'ia-Redondo, Anthea Monod

Abstract: We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.

Comment: Topological analysis of grokking reveals persistent-homology signatures tied to emergence of generalization.

Topic Match: The core contribution is mechanistic understanding of representation formation during training via a new topological lens.

Relevance: 9 Novelty: 8


4. When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias

ArXiv ID: 2605.06314

Primary Topic: Representation Learning Theory and Structure

Authors: Ye Su, Jian Li, Yong Liu

Abstract: Benign overfitting is well-characterized in $\ell_2$ geometries, but its behavior under the $\ell_1$ implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which invalidates standard spectral resolvent tools. To isolate this algorithmic bias, we characterize the high-dimensional risk of continuous-time $\ell_2$-Boosting over $p$ features and $n$ samples. By coupling the Convex Gaussian Minimax Theorem with delicate asymptotic expansions of double-sided truncated Gaussian moments, we analytically resolve the non-smooth $\ell_1$ interpolant. Under an isotropic pure-noise model, we prove that benign overfitting fails at the linear rate: greedy selection localizes noise into sparse active sets, and the excess variance decays at a logarithmic rate $\Theta(\sigma^2/\log(p/n))$ for noise variance $\sigma^2$. We remark that while this localization mechanism should persist in the presence of signals, the exact signal-noise decomposition remains an open problem. For spiked-isotropic designs with $k^$ head eigenvalues and $r_2 = p - k^$ tail dimensions, the risk converges to zero when $r_{2} \gg n$, but only at a logarithmic rate $\Theta(\sigma^2/\log(r_2/n))$, which is slower than the linear decay observed in $\ell_2$ geometries. To avoid this slow convergence, we analyze the non-smooth subdifferential dynamics of the boosting flow. This yields a tuning-free early stopping rule that, under a bounded $\ell_1$-path condition, recovers the Lasso basic inequality and attains the minimax-optimal empirical prediction rate for $\ell_1$-bounded signals.

Comment: Characterizes high-dimensional risk and benign overfitting failure for continuous-time l2-Boosting under its l1 implicit bias.

Topic Match: The paper is about training dynamics and implicit bias in overparameterized learning from a theoretical perspective, fitting mechanistic understanding of learned structure.

Relevance: 8 Novelty: 8


5. End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems

ArXiv ID: 2605.06315

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Carles Balsells-Rodas, Zhengrui Xiang, Xavier Sumba, Yingzhen Li

Abstract: Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limitations of this setting. First, we establish identifiability of a broad class of recurrent nonlinear switching dynamical systems under flexible assumptions, significantly extending prior results. Second, we introduce $\Omega$SDS, a flow-based estimator that enables exact likelihood optimization using expectation-maximisation. Through empirical validation on both synthetic and real-world data, our results demonstrate that $\Omega$SDS achieves improved disentanglement compared to VAE-based estimators and more accurate forecasting of underlying dynamics.

Comment: Establishes broader identifiability results for recurrent nonlinear switching dynamical systems and introduces an exact-likelihood flow-based estimator.

Topic Match: The paper is primarily about identifiable latent structure in sequential generative models, a strong fit for representation-learning theory.

Relevance: 8 Novelty: 8


6. MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series

ArXiv ID: 2605.05524

Primary Topic: Representation Learning Theory and Structure

Authors: Shicheng Fan, Nour Elhendawy, Jianle Sun, Ke Fang, Kun Zhang, Yihang Wang, Lu Cheng

Abstract: Causal representation learning (CRL) seeks to recover latent variables with identifiability guarantees, typically up to permutation and component-wise reparameterization under appropriate assumptions. However, identifiability does not imply interpretability: latent semantics are typically assigned post hoc by alignment with known ground-truth factors. This limitation is particularly acute in scientific time series, where underlying mechanisms are unknown and discovering interpretable structure is a primary goal. In contrast, scientific observations (such as residue-pair distances, climate indices, or process sensors) are inherently semantic, as they correspond to named physical quantities. This raises a key question: can the interpretability of observations be transferred to the identifiable latent space? We propose MOSAIC (Module discovery via Sparse Additive Identifiable Causal learning), a sparse temporal VAE that integrates temporal CRL identifiability with support recovery over observed variables. MOSAIC identifies latent variables via regime-conditioned temporal variation, and recovers for each latent a sparse set of associated observations through an additive decoder, yielding module-level interpretability. We show that ANOVA main-effect supports are identifiable under general smooth mixing functions, and provide finite-sample recovery guarantees for a tractable sparse-additive variant. Empirically, MOSAIC recovers domain-consistent variable groups across RNA molecular dynamics, solar wind, ENSO climate, the Tennessee Eastman process, and a synthetic tokamak benchmark, enabling interpretable discovery of latent mechanisms in scientific time series.

Comment: Combines identifiable temporal causal representation learning with sparse support recovery to discover interpretable latent modules in scientific time series.

Topic Match: This is directly about identifiable latent structure and interpretability of learned representations, matching representation-learning theory/structure.

Relevance: 8 Novelty: 8


7. Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

ArXiv ID: 2605.06148

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Bowen Zheng, Yihong Luo, Tianyang Hu

Abstract: Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a distribution-level prior-matching signal during tokenizer training, while keeping the reconstruction objective unchanged. We optimize this signal with a Wasserstein-gradient-flow update. For hard categorical tokens, the update reduces to a token-level contrast between an auxiliary AR model that tracks the tokenizer's current token distribution and the target AR prior. It requires only forward passes through the two AR models and does not backpropagate through either of them. The resulting tokenizer, wAR-Tok, reduces AR loss and improves generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality.

Comment: Uses tripartite variational consistency to train discrete tokenizers with a prior-matching Wasserstein-flow signal so AR priors can model tokens more easily.

Topic Match: The main idea is about how learned discrete representations should be structured to align with downstream autoregressive priors, making representation formation the best fit.

Relevance: 8 Novelty: 8


8. Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors

ArXiv ID: 2605.06413

Primary Topic: Representation Learning Theory and Structure

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Richard Bergna, Stefan Depeweg, Jos\'e Miguel Hern\'andez-Lobato

Abstract: Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic--aleatoric split is not identifiable in general from the posterior predictive distribution alone, even when that distribution is known exactly. We then exploit a distinctive advantage of PFNs: because the synthetic data-generating process is under our control, each task can contain an explicit latent signal and noise function, and the generator can provide query-level labels for both the noiseless target and the observation-noise variance. We use these labels to train a decoupled PFN with separate latent-signal and aleatoric heads. The observation-level predictive is induced by convolving the latent signal distribution with the learned noise model. Empirically, epistemic-only acquisition mitigates the failure mode of total-variance exploration in noisy and heteroscedastic settings. In matched comparisons, decoupled models usually improve over tuned observation-level baselines, with the clearest gains in HPO; in broader sweeps, a decoupled model obtains the best average rank in both HPO and synthetic BO.

Comment: Uses structured synthetic priors in PFNs to make epistemic and aleatoric uncertainty separately identifiable and directly learnable.

Topic Match: The key contribution is identifiability and structure of learned predictive representations, even though the motivation includes downstream sequential decision-making.

Relevance: 8 Novelty: 8


9. Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation

ArXiv ID: 2605.05446

Primary Topic: Representation Learning Theory and Structure

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Chengyu Cui, Gongjun Xu

Abstract: Nonconvex methods have emerged as a dominant approach for low-rank matrix estimation, a problem that arises widely in machine learning and AI for learning and representing high-dimensional data. Existing analyses for these methods often require additional regularization to mitigate nonconvexity, even though such regularization is often unnecessary in practice. Moreover, most analyses rely on problem-specific arguments that are difficult to generalize to more complex settings. In this paper, we develop a theoretical framework for studying nonconvex procedures across a broad class of low-rank matrix estimation problems. Rather than focusing on a specific model, we reveal a fundamental mechanism that explains why nonconvex procedures can behave well in low-rank estimation. Our key device is a {\it benign regularizer} that does not alter the original update rule, but yields an equivalent locally strongly convex formulation of the algorithm. This perspective uncovers a disguised convexity inherent in the nonconvex procedure and provides a new route to theoretical guarantees for nonconvex low-rank matrix estimation.

Comment: Provides a general theory explaining why nonconvex low-rank estimation behaves like a locally strongly convex problem via a benign regularizer that leaves updates unchanged.

Topic Match: The core contribution is mechanistic theory for low-rank representation learning, explaining optimization behavior across a broad class of matrix estimation problems rather than introducing an application-specific solver.

Relevance: 8 Novelty: 8


10. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

ArXiv ID: 2605.06582

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Adhiraj Banerjee, Vipul Arora

Abstract: Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On TIMIT retrieval, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

Comment: Treats tokenization as sequence-level self-alignment rather than local quantization, learning compact symbolic sequences that preserve cross-view edit structure.

Topic Match: The paper is mainly about how discrete symbolic representations should be learned and organized at the sequence level, making representation structure the best fit.

Relevance: 8 Novelty: 8


11. Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective

ArXiv ID: 2605.06172

Primary Topic: Representation Learning Theory and Structure

Authors: Meira Iske, Carola-Bibiane Sch\"onlieb

Abstract: Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Lipschitz normalizing flows and, conversely, derive deterministic convergence guarantees for diffusion-based transport. Our key idea is to use the probability flow ODE to link regularity of the score to regularity of the induced transport maps. We verify score regularity for broad target densities, including compactly supported densities, Gaussian convolutions of compactly supported measures and finite Gaussian mixtures. We obtain a universal distributional approximation result: Gaussian pullbacks induced by bi-Lipschitz variance-preserving transport maps are $L^1$-dense among all probability densities. For Gaussian convolution targets, we further obtain convergence in Kullback-Leibler divergence without early stopping.

Comment: Characterizes the expressivity of bi-Lipschitz normalizing flows via the probability-flow ODE of diffusion models.

Topic Match: The work is foundational theory on what classes of distributions structured generative representations can realize.

Relevance: 8 Novelty: 8


12. When Graph Language Models Go Beyond Memorization

ArXiv ID: 2605.06239

Primary Topic: Representation Learning Theory and Structure

Authors: Masatsugu Yamada, Mahito Sugiyama

Abstract: It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to disentangle memorization from structural alignment. Using this framework, we show that graph language models can acquire structural regularities beyond memorization at scale, primarily in the high-frequency regime. This is supported by the following empirical evidence: On five TU benchmarks, LLaMA-style graph language models reach high subgraph-rank correlation, yet their alignment is matched or exceeded by the memorization bootstrap in most cases. At small scale, under our bootstrap diagnostic, fidelity is largely indistinguishable from verbatim recall. In contrast, at large scale with 3.75M graphs, verbatim memorization drops sharply while rank correlation remains near ceiling. Crucially, in a separate fixed-subsample analysis, frequent subgraph mining restricted to the novel-only subset closely tracks the corresponding all-generation Spearman correlation, providing evidence that the alignment is not driven solely by verbatim recall. Across all scales, high-frequency patterns are well reproduced, while rare patterns remain poorly covered, and this deficit narrows only marginally as capacity increases. We observe the same scale-dependent crossover under two distinct graph serializations (canonical DFS code and action sequences), providing evidence of robustness in our analysis.

Comment: Separates memorization from structural learning in graph language models using calibrated subgraph-frequency diagnostics across scale.

Topic Match: This is fundamentally about diagnosing what structural regularities learned representations capture beyond memorization.

Relevance: 8 Novelty: 8


13. TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

ArXiv ID: 2605.05980

Primary Topic: Representation Learning Theory and Structure

Also Matches: Memory Structures and Agent Memory Systems

Authors: Yuan Sui, Yulin Chen, Yibo Li, Xue Jiang, Yufei He, Yihong Dong, Xiaoxin He, Tianyu Gao, Bryan Hooi

Abstract: When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as agent drift. We focus on two recurring failure modes overthinking and overacting, i.e., where the agent repeatedly reasons over information it already has, and where it issues tool calls without integrating recent observations or acquiring new evidence. In this paper, we introduce TACT (Think-Act Calibration via activation Steering), to detect and mitigate agent drift in the residual stream before it surfaces as a behavioral failure. In specific, we label trajectory steps as overthinking, overacting, or calibrated, and find that their hidden states can separate linearly along two drift axes, pointing from calibrated behavior toward each failure mode (AUC $\approx$ 0.9). To mitigate agent drift, we project each step's activation onto these axes at test time and pull drifted ones back toward the calibrated region. Experiments show that TACT outperforms unsteered baselines across SWE-bench Verified, Terminal-Bench 2.0, and CLAW-Eval, lifting average resolve rate by $+5.8$ pp on Qwen3.5-27B and $+4.8$ pp on Gemma-4-26B-A4B-it while cutting steps-to-resolve by up to $26\%$. These gains frame agent drift as a steerable direction in the residual stream, and position TACT as a viable handle for reliable long-horizon agents.

Comment: Finds linear residual-stream directions for agent overthinking/overacting and mitigates them by activation steering at test time.

Topic Match: The core contribution is mechanistic: identifying steerable internal activation directions tied to failure modes in long-horizon agent behavior.

Relevance: 8 Novelty: 8


Memory Structures and Agent Memory Systems (9)

1. Belief Memory: Agent Memory Under Partial Observability

ArXiv ID: 2605.05583

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, Xiuying Chen

Abstract: LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy-OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well-known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.

Comment: Stores multiple candidate conclusions with probabilities and updates them via Noisy-OR, rather than collapsing each observation into one memory entry.

Topic Match: The central contribution is a new memory update and retrieval principle for agents under partial observability.

Relevance: 10 Novelty: 8


2. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

ArXiv ID: 2605.05686

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Qiyao Liang, Risto Miikkulainen, Ila Fiete

Abstract: Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task--entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\bar\Delta)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.

Comment: Models transformer factual recall and hallucination as hidden-state attractor geometry across parametric and working memory interactions.

Topic Match: The paper is fundamentally about internal memory organization, recall failure, and memory-state geometry in transformers.

Relevance: 9 Novelty: 8


3. SkillOS: Learning Skill Curation for Self-Evolving Agents

ArXiv ID: 2605.06614

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, Maohao Shen, Vishy Tirumalashetty, George Lee, Jiawei Han, Tomas Pfister, Chen-Yu Lee

Abstract: LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.

Comment: Trains a separate curator to store, update, and retrieve reusable skills in an external repository for self-evolving agents.

Topic Match: The contribution is a new principle for agent memory as curated external skill storage, update, and reuse over task streams.

Relevance: 9 Novelty: 8


4. Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

ArXiv ID: 2605.05373

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: David Leeftink, Max Hinne, Marcel van Gerven

Abstract: A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this by encoding history into latent state representations, their internal dynamics remain uninterpretable black boxes. This paper establishes a formal link between these hidden states and the Pontryagin minimum principle (PMP) from optimal control. We demonstrate that for standard recurrent architectures, latent representations map directly to PMP co-states, which allows the readout layer to be interpreted as performing Hamiltonian minimization. Because standard reward maximization does not naturally discover this alignment, we introduce a PMP-derived co-state loss to explicitly structure the internal dynamics. Empirically, this approach matches or improves performance on partially observable DMControl tasks, and is robust against zero-shot out-of-distribution sensor masking. By framing recurrent networks as dynamic processes governed by the minimum principle, we provide a principled approach to designing robust continuous control policies.

Comment: Structures recurrent RL hidden states using Pontryagin co-states, yielding a principled recurrent memory design under partial observability.

Topic Match: Primary fit is memory systems because the core contribution is a new principle for organizing and supervising recurrent latent memory states in RL agents.

Relevance: 8 Novelty: 8


5. Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

ArXiv ID: 2605.06225

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous

Abstract: Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control--drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject$\times$mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118$\times$. These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.

Comment: Steers LLMs by injecting text-derived KV banks only into selected layers, treating guidance as latent memory allocation.

Topic Match: Primary fit is memory systems because the core idea is a new mechanism for storing and updating guidance in the model's latent memory substrate rather than the visible context.

Relevance: 8 Novelty: 8


6. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

ArXiv ID: 2605.05806

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics

Authors: Elad Hoffer, Yochai Blau, Ron Banner, Daniel Soudry, Boris Ginsburg

Abstract: Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.

Comment: Turns decoder attention over pre-encoded evidence into an intrinsic retrieval mechanism, unifying retrieval and generation inside the model architecture.

Topic Match: The main idea is a learned internal memory/retrieval mechanism rather than standard external RAG plumbing, making memory systems the clearest fit.

Relevance: 8 Novelty: 8


7. More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

ArXiv ID: 2605.05716

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics

Authors: Ming Liu

Abstract: LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self-reflection, retrieval) assuming more is better. We study cross-component interference (CCI): degradation when components interact destructively. We run a full factorial experiment over all 2^5=32 subsets of five components on HotpotQA and GSM8K with Llama-3.1-8B/70B (96 conditions, up to 10 seeds). The All-In system is consistently suboptimal: on HotpotQA, a single-tool agent surpasses All-In by 32% (F1 0.233 vs 0.177, p=0.023); on GSM8K, a 3-component subset beats All-In by 79% (0.43 vs 0.24, p=0.010). Optimal component count is task-dependent (k*=1-4) and scale-sensitive: at 70B, combinations that hurt at 8B provide gains, though All-In still trails the best subset. We fit a main-effects regression (R^2=0.916, adj-R^2=0.899, LOOCV=0.872), compute exact Shapley values, and find 183/325 submodularity violations (56.3%), showing greedy selection is unreliable. A three-body synergy among Tool Use, Self-Reflection, and Retrieval (INT_3=+0.175, 95% CI [+0.003,+0.351]) is reported as exploratory. CCI replicates across model families (Qwen2.5) and is robust to prompt paraphrasing. Our findings suggest maximally-equipped agent defaults should be replaced by task-specific subset selection via interaction-aware analysis.

Comment: Full-factorial study of planning, tools, memory, retrieval, and reflection shows destructive cross-component interference rather than monotonic gains in agent scaffolds.

Topic Match: While broader than memory, it directly analyzes how memory interacts with other agent components and when adding memory harms system behavior.

Relevance: 8 Novelty: 8


8. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

ArXiv ID: 2605.06285

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Yijia Zheng, Marcel Worring

Abstract: Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.

Comment: Moves multi-step agentic retrieval and reasoning into continuous latent space, treating retrieval as a learned latent memory operation with large latency reduction.

Topic Match: The core idea is a new memory/retrieval mechanism in latent space rather than a standard RAG pipeline, making memory systems the best fit.

Relevance: 8 Novelty: 8


9. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

ArXiv ID: 2605.06365

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Josh Rosen, Seth Rosen

Abstract: Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preserve stable work products, isolate irrelevant updates, or propagate changes through intermediate artifacts. We introduce execution lineage: an execution model in which AI-native work is represented as a directed acyclic graph (DAG) of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. The goal is not to make the model a better one-shot writer, but to make evolving AI-generated work maintainable under change. We compare execution-lineage replay against loop-centric update baselines on two controlled policy-memo update tasks. In an unrelated-branch update, DAG replay preserved the final memo exactly in all runs, with zero churn and zero unrelated-branch contamination, while loop baselines regenerated the memo and frequently imported unrelated context. In an intermediate-artifact edit, all systems reflected the new constraint in the final memo, but only DAG replay achieved perfect upstream preservation, downstream propagation, unaffected-artifact preservation, and cross-artifact consistency. These results show that final answer quality and maintained-state quality are distinct. Strong loop baselines can remain competitive at producing polished final outputs when the task is a bounded synthesis/update problem and all current sources fit in context, but immediate task success can mask partial state inconsistency that may compound over future revisions. Execution lineage provides stronger guarantees about what should change, what should remain stable, and how work evolves across revisions.

Comment: Introduces execution lineage as a DAG-based model of artifact production with explicit dependencies and replay, separating maintained-state quality from final-answer quality.

Topic Match: The contribution is fundamentally about how agent systems store, preserve, and update intermediate state over revisions, which is a strong match to memory systems.

Relevance: 8 Novelty: 8


World Models, Exploration, and Open-Ended Reinforcement Learning (14)

1. HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning

ArXiv ID: 2605.05951

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics, Memory Structures and Agent Memory Systems

Authors: Haoyun Tang, Haodong Cui, Keyao Xu, Kun Wang, Zhandong Mei

Abstract: World models enable model-based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner-facing latents: history-conditioned memory for approximate Markov completeness, and geometric organization that separates configuration, momentum, and task semantics. We propose HaM-World (HMW), a structured world model that decomposes the latent state into a canonical (q, p) subspace and a context subspace c, while using Mamba selective state-space memory as the history-conditioned input to the same latent dynamics. Within this interface, (q, p) evolves through an energy-derived Hamiltonian vector field plus learnable residual/control dynamics, while c captures semantic, dissipative, and non-conservative factors. This gives the planner a single latent state shared by dynamics prediction, reward/value estimation, imagined rollouts, and CEM action search. On four DeepMind Control Suite tasks, HaM-World reaches the highest Avg. AUC (117.9, +9.5%), reduces long-horizon rollout error to 45% of a strong baseline model, and wins 11/12 k in {3,5,7} MSE cells. Under 12 OOD perturbations spanning dynamics shifts, action delay, and observation masking, HaM-World achieves the highest return in every condition, with average OOD-return gains of 10.2% on Finger Spin and 13.6% on Reacher Easy. Mechanism diagnostics further show bounded action-free Hamiltonian-energy drift, structured energy variation under policy rollouts, and coherent control-induced energy transfer, supporting the intended Soft-Hamiltonian dynamics design.

Comment: Combines selective state-space memory with structured Hamiltonian latent dynamics to stabilize long-horizon world-model planning under shift.

Topic Match: This is a direct hit on planner-facing world models: it proposes a new latent dynamics structure plus memory mechanism to improve model-based RL.

Relevance: 10 Novelty: 8


2. Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

ArXiv ID: 2605.06298

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Roussel Desmond Nzoyem, Mauro Comi

Abstract: Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.

Comment: Introduces weight-space world models using INR weights as structured latent state, avoiding heavy decoders and enabling disentangled controllable dynamics.

Topic Match: The central idea is a new structured latent-state representation for world modeling and controllable forecasting, making world models the best fit.

Relevance: 9 Novelty: 9


3. Prediction and Empowerment: A Theory of Agency through Bridge Interfaces

ArXiv ID: 2605.06346

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Memory Structures and Agent Memory Systems

Authors: Richard Csaky

Abstract: We study agency under partial observability in deterministic physical or simulated worlds, where apparent randomness arises from uncertainty over initial conditions, fixed law bits, and unrolled exogenous noise. We model sensing and actuation as bridge interfaces split between agent-controlled parameters and environment-controlled channel state, inducing a deterministic POMDP through a prior over latent microstates and many-to-one observation coarsening. Within this framework, we prove a separation between prediction, compression, and empowerment. Perfect prediction can be achieved either by identifying the hidden quotient relevant to the target family or by overwrite control that makes the future target action-determined; high empowerment alone is insufficient. Under refinable interfaces and sufficient memory, action-conditioned observation-compression progress reduces posterior uncertainty about the latent quotient, and when refinement requires steering world-side channel conditions, this creates target-conditioned interface empowerment. A bit-string specialization with a conserved information budget makes the resulting tradeoff explicit: prediction by identification requires internal capacity at least the relevant latent entropy, whereas overwrite control requires terminal action capacity over the controlled quotient. For modern AI agents, the results suggest a design principle rather than a theorem of inevitability: objectives should distinguish hidden-state identification, interface refinement, task-relevant controllability, and mere overwrite or distractor control. Human--AI alignment is partly an interface-design problem, where the relevant bridge is between human intent, agent internal state, external tools, and world-side channel conditions. This is a working draft: feedback and criticism is most welcome.

Comment: Provides a theory separating prediction, compression, and empowerment under partial observability, with direct implications for world-model-based agency.

Topic Match: The paper is a conceptual theory of agency, hidden-state identification, and controllability under partial observability, making it a strong world-models fit.

Relevance: 8 Novelty: 9


4. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

ArXiv ID: 2605.06638

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov

Abstract: Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^{\gamma}$, $R^{2} > 0.99$), and that the scaling exponent $\gamma$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.

Comment: Uses a controlled logic environment to show RL reasoning compute scales with horizon and that logical expressiveness strongly changes the scaling law.

Topic Match: Despite being on LLM reasoning, the paper's real contribution is foundational RL scaling analysis in a controlled interactive reasoning environment.

Relevance: 8 Novelty: 8


5. Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing

ArXiv ID: 2605.06373

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Leon Halgryn (University of Twente), Sophie Langer (Ruhr-Universit\"at Bochum), Janusz M. Meylahn (University of Twente), E. Moritz Hahn (University of Twente)

Abstract: Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $\tau$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $\tau$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $\tau$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.

Comment: Provides finite-sample DQN guarantees under temporally dependent replay data via a τ-mixing analysis instead of the usual false independence assumption.

Topic Match: This is foundational RL theory on value learning under realistic dependent data, squarely within the world-models/RL bucket even though it is not about exploration.

Relevance: 8 Novelty: 8


6. A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration

ArXiv ID: 2605.05791

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Manuel Haussmann, Mustafa Mert \c{C}elikok, Melih Kandemir

Abstract: While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error sources of a learning process; (ii) deterministic error propagation analysis addresses the approximation error via concentrability coefficients without a finite-sample analysis of the estimation error; and (iii) PAC generalization bounds characterize the estimation errors of simplified topologies. We bridge these traditions with a unified theoretical framework for fitted Q-iteration (FQI) on general measurable Borel spaces. Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces. These results lay the necessary foundations for the formal analysis of many modern deep RL algorithms.

Comment: Builds a unified finite-sample theory for fitted Q-iteration on general measurable spaces with adaptive data via sequential Rademacher complexity.

Topic Match: This is core reinforcement-learning theory for off-policy value learning under realistic adaptive data collection.

Relevance: 8 Novelty: 8


7. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

ArXiv ID: 2605.05812

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn

Abstract: Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.

Comment: Adds n-step inequality constraints as a hinge-loss backstop against compounding bootstrap error in Q-learning without extra forward passes.

Topic Match: The paper proposes a foundational stabilization mechanism for off-policy value learning, directly within core RL methodology.

Relevance: 8 Novelty: 8


8. Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

ArXiv ID: 2605.05481

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Dillon Sandhu, Ronald Parr

Abstract: We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is unknown and cannot be sampled for the purposes of training the value function. Conservative updates solve this problem, but at the cost of shrinking the policy update. This paper explores an alternative solution, Approximate Next Policy Sampling (ANPS), which addresses the problem by modifying the training distribution rather than constraining the policy update. ANPS is satisfied if the distribution of the training data approximates that of the next policy. To demonstrate the feasibility and efficacy of ANPS, we introduce Stable Value Approximate Policy Iteration (SV-API). SV-API modifies the standard approximate policy iteration loop to hold the target policy fixed while an iteratively updated behavioral policy gathers relevant experience. It only commits to a new policy once a convergence criterion has been met. If certain stability criteria are met, the update is guaranteed to be safe; otherwise, it remains no less safe than standard approximate policy iteration. Applying SV-API to PPO yields Stable Value PPO (SV-PPO), which matches or improves performance on high-dimensional discrete (Atari) and continuous control benchmarks while executing substantially larger target policy updates. These results demonstrate the viability of ANPS as a new solution to this classic challenge in RL.

Comment: New RL principle that replaces conservative policy updates by matching training data to the next-policy state distribution.

Topic Match: This is a foundational deep RL contribution about safe policy improvement and distribution shift, not merely a benchmark tweak.

Relevance: 8 Novelty: 8


9. Operator-Guided Invariance Learning for Continuous Reinforcement Learning

ArXiv ID: 2605.06500

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Zuyuan Zhang, Fei Xu Yu, Tian Lan

Abstract: Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on special cases, such as prescribed symmetries and exact equivariance, without addressing how to discover more general structures that require nonlinear operators to transform and map between continuous state/action systems with isomorphic value functions. We propose \textbf{VPSD-RL} (Value-Preserving Structure Discovery for Reinforcement Learning). It models continuous RL as a controlled diffusion with value-preserving mappings defined through Lie-group actions and associated pullback operators. We show that a value-preserving structure exists exactly when pulling back the value function and pushing forward actions commute with the controlled generator and reward functional. Further, approximate value-preserving structures with rigorous guarantees can be found when the Hamilton--Jacobi--Bellman mismatch is small. This framework discovers exact and approximate value-preserving structures by searching for the associated Lie group operators. VPSD-RL fits differentiable drift, diffusion, and reward models; learns infinitesimal generators via determining-equation residual minimization; exponentiates them with ODE flows to obtain finite transformations; and integrates them into continuous RL through transition augmentation and transformation-consistency regularization. We show that bounded generator/reward mismatch implies quantitative stability of the optimal value function along approximate orbits, with sensitivity governed by the effective horizon, and observe improved data efficiency and robustness on continuous-control benchmarks.

Comment: Defines value-preserving structure discovery in continuous-control RL via Lie-group operators and generator commutation, then uses learned transformations for augmentation and regularization.

Topic Match: Its focus is foundational RL generalization: discovering invariant structure in controlled diffusions to improve data efficiency and robustness in continuous control.

Relevance: 8 Novelty: 8


10. Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

ArXiv ID: 2605.06474

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Xiang Li, Nan Jiang

Abstract: We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^\pi$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.

Comment: Introduces recursive reweighting with moment matching for off-policy evaluation under general function approximation with dimension-free guarantees.

Topic Match: Best fit is world_models_open_ended_rl because it is foundational offline RL theory about off-policy evaluation and coverage.

Relevance: 8 Novelty: 8


11. Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies

ArXiv ID: 2605.06470

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Magnus Victor Boock, Abdullah Akg\"ul, Mustafa Mert \c{C}elikok, Melih Kandemir

Abstract: We present a new operator-theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distances or fails to satisfy the triangle inequality, our framework learns a Hilbert-space displacement geometry where expected hitting times are realized as linear functionals of latent displacements. We prove that this representation exists under latent linear closure and is uniquely identifiable up to a bounded linear isomorphism. For finite-dimensional implementations, we show that global hitting-time error is bounded by one-step transition error amplified by the environment's transient spectral radius. Furthermore, we provide finite-sample guarantees accounting for approximation, statistical complexity, and trajectory-label mismatch. Derived from this theory, we curate Isomorphic Embedding Learning (IEL) as a new goal-agnostic foundation policy learning algorithm that anchors a HILP-style consistency objective with explicit hitting-time regression to ensure that the learned geometry reflects actual decision-time progress. This asymmetric and compositional structure enables robust graph-based multi-stage planning for long-horizon navigation. Our experiments demonstrate that IEL improves the state of the art of learning foundation policy policies from offline maze locomotion data. Our code can be found on https://github.com/MagnusBoock/IEL

Comment: Learns hitting-time-based latent geometry for offline RL with identifiability and finite-sample guarantees, aimed at compositional planning.

Topic Match: Although heavily representation-theoretic, the main use is learning planning-relevant state geometry for foundation policies in RL.

Relevance: 8 Novelty: 8


12. Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

ArXiv ID: 2605.06377

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Philip Jordan, Maryam Kamgarpour

Abstract: We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action-observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near-potential, leading to quasi-polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.

Comment: Independent learning of approximate Nash equilibria in partially observable Markov potential games without communication is a foundational MARL result on learning dynamics under partial observability.

Topic Match: The core contribution is a new theoretical MARL learning algorithm and convergence analysis for partially observable games, fitting foundational RL more than application-driven RL.

Relevance: 8 Novelty: 8


13. Bandit Learning in General Open Multi-agent Systems

ArXiv ID: 2605.06202

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Mengfan Xu

Abstract: Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures the impact of new agents on the system, and \emph{global dynamic regret} compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.

Comment: Defines bandit learning in open multi-agent systems with agent arrival/departure, introducing pre-training degree and stability to characterize regret.

Topic Match: This is foundational online learning/RL theory for open systems and continual interaction, aligned with the open-ended and continual learning criterion.

Relevance: 8 Novelty: 8


14. Differential Privacy in the Extensive-Form Bandit Problem

ArXiv ID: 2605.05266

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Stephen Pasteris, Rahul Savani, Theodore Turocy

Abstract: We consider the extensive-form bandit problem, where on each trial the learner (a user coordinated by a server) plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff/loss. We give an algorithm for this problem that satisfies $\epsilon$-local differential privacy and attains a regret of $\tilde{O}(\sqrt{A\ln(S)T}/\epsilon)$, where $A$ is the total number of actions that the learner can possibly take, $S$ is the number of the learner's possible reduced strategies, and $T$ is the number of trials. On each trial, the time complexity of our algorithm is, up to a factor logarithmic in the maximum number of actions at an infoset, equal to the time required for the server to transmit the reduced strategy to the user. We note that local differential privacy is the strongest version of differential privacy and, to the best of our knowledge, this is the first work to study differential privacy of any form in the extensive-form bandit problem.

Comment: First locally differentially private algorithm for extensive-form bandits with regret guarantees.

Topic Match: This is foundational online decision-making theory for bandit learning in extensive-form games, a strong fit to the RL side of the target topics.

Relevance: 8 Novelty: 8


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

  1. Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

  2. Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

  3. Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

  4. Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

  5. World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

  • 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
  • 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
  • 5-6: touches the target topics, but the main contribution is elsewhere.
  • 3-4: largely outside the target topics, often application-focused or domain-specific.
  • 1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

  • 9-10: new paradigm, theory, or major methodological breakthrough.
  • 7-8: substantial methodological advance or strong new insight.
  • 5-6: meaningful but incremental extension or refinement.
  • 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
  • 1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.