Personalized Daily ArXiv Papers 2026-05-07

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	341011	25108	366119	658	436	52
`gpt-5.4`	Cost	$0.85	$0.38	$1.23	658	436	52

Topic Coverage:

Topic	Papers
Architecture and Training Dynamics	22
Efficiency, Compression, and Large-Scale Training	7
Representation Learning Theory and Structure	8
Memory Structures and Agent Memory Systems	7
World Models, Exploration, and Open-Ended Reinforcement Learning	8

Table of contents by topic:

Architecture and Training Dynamics (22)

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer Authors: Alexander Hsu, Zhaiming Shen, Wenjing Liao, Rongjie Lai
Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize Authors: Sarwan Ali
Layerwise LQR for Geometry-Aware Optimization of Deep Networks Authors: Simon Dufort-Labb\'e, Pierre-Luc Bacon, Razvan Pascanu, Simon Lacoste-Julien, Aristide Baratin
Demystifying Manifold Constraints in LLM Pre-training Authors: Kang An, Jiaxiang Li, Donald Goldfarb, Shiqian Ma
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning Authors: Waleed Razzaq, Yun-Bo Zhao
Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks Authors: Yaobo Zhang
On the Invariants of Softmax Attention Authors: Wonsuk Lee
Training-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine Networks Authors: Xuan Qi, Yi Wei, Fanqi Yu, Furao shen, Vittorio Murino, Cigdem Beyan
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences Authors: Mariia Seleznova
Estimating the expected output of wide random MLPs more efficiently than sampling Authors: Wilson Wu, Victor Lecomte, Michael Winer, George Robinson, Jacob Hilton, Paul Christiano
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs Authors: Zhiyuan Xu, Joseph Gardiner, Sana Belguith, Lichao Wu
Average Attention Transformers and Arithmetic Circuits Authors: Lena Ehrmuth, Laura Strieker
Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks Authors: Jatin Sharma, Dan F. M Goodman, Danyal Akarca
Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization Authors: Sharan Sahu, Abir Sarkar, Cameron J. Hogan, Martin T. Wells
Ortho-Hydra: Orthogonalized Experts for DiT LoRA Authors: Seunghyun Ji
Covariance-Aware Goodness for Scalable Forward-Forward Learning Authors: Xiaoyi Jiang, Bashir M. Al-Hashimi, Kai Xu
Perturbation is All You Need for Extrapolating Language Models Authors: Zetai Cen, Jin Zhu, Xinwei Shen, Chengchun Shi
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion Authors: Tarun Kathuria, Sachin Kumar
Endogenous Regime Switching Driven by Scalar-Irreducible Learning Dynamics Authors: Sheng Ran
Koopman Identification of Nonlinear Systems via Reservoir Liftings Authors: Weibin Gu, Chen Yang, Lu Shi
Exact Dual Geometry of SOC-ICNN Value Functions Authors: Kang Liu, Jianchen Hu, Wei Peng
Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense Authors: Kerri Prinos, Lilianne Brush, Cameron Denton, Zhanqi Wang, Joshua Knox, Snehal Antani, Anton Foltz, Amy Villase\~nor

Efficiency, Compression, and Large-Scale Training (7)

Gated Subspace Inference for Transformer Acceleration Authors: Stephen J. Thomas
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization Authors: Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, Qingyi Gu
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism Authors: Sajal Dash, Feiyi Wang
UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding Authors: Yepeng Weng, Qiao Hu, Takehisa Yairi
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints Authors: Chengyi Nie, Nian Si, Zijie Zhou
Budget-aware Auto Optimizer Configurator Authors: Kang Liu, Wei Peng, Jianchen Hu
Rethinking the Rank Threshold for LoRA Fine-Tuning Authors: Juneyoung Park

Representation Learning Theory and Structure (8)

Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning Authors: Bryan Cheng, Jasper Zhang
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior Authors: Daniel Wurgaft, Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana
Adaptivity Under Realizability Constraints: Comparing In-Context and Agentic Learning Authors: Anastasis Kratsios, A. Martina Neuman, Philipp Petersen
Conceptors for Semantic Steering Authors: Ilias Triantafyllopoulos, Young-Min Cho, Ren Tao, Miranda Muqing Miao, Sunny Rai, Lyle Ungar, Sharath Chandra Guntuku, Neville Ryant, Jo\~ao Sedoc
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation Authors: Francesco Sovrano, Gabriele Dominici, Marc Langheinrich
Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings Authors: Bumjun Kim, Albert No
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models Authors: Ismail Hossain, Sai Puppala, Jannatul Ferdaus, Md Jahangir Alam, Yoonpyo Lee, Syed Bahauddin Alam, Sajedul Talukder
Simultaneous CNN Approximation on Manifolds with Applications to Boundary Value Problems Authors: Hanfei Zhou, Lei Shi

Memory Structures and Agent Memory Systems (7)

Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping Authors: Kaustubh Pethkar, Ziyang Xiong, Zuofeng Shang, Yingcong Li
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis Authors: Xutao Mao, Jinman Zhao, Gerald Penn, Cong Wang
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction Authors: Sihao Liu, YuFan Xiong, Zhonghua Jiang, Zhaode Wang, chengfei lv Shengyu Zhang
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval Authors: Nicholas Barnfield, Juno Kim, Eshaan Nichani, Jason D. Lee, Yue M. Lu
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents Authors: Ishrith Gowda (University of California, Berkeley)
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation Authors: Guangsheng Bao, Hongbo Zhang, Han Cui, Yanbin Zhao, Yue Zhang
Skill Neologisms: Towards Skill-based Continual Learning Authors: Antonin Berthon, Nicolas Astorga, Mihaela van der Schaar

World Models, Exploration, and Open-Ended Reinforcement Learning (8)

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC Authors: Yurui Du, Pinhao Song, Yutong Hu, Renaud Detry
Learning to Theorize the World from Observation Authors: Doojin Baek, Gyubin Lee, Junyeob Baek, Hosung Lee, Sungjin Ahn
Discovering Reinforcement Learning Interfaces with Large Language Models Authors: Akshat Singh Jaswal, Ashish Baghel, Paras Chopra
Structural Equivalence and Learning Dynamics in Delayed MARL Authors: Jules Sintes, Ana Bu\v{s}i\'c, Jiamin Zhu
Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning Authors: Harin Lee, Min-hwan Oh
Counterfactual identifiability beyond global monotonicity: non-monotone triangular structural causal models Authors: Pengcheng Tan, Jiang Chen, Dehui Du
Bilinear Mamba-Koopman Neural MPC for Varying Dynamics Authors: Matan Pagi, Zohar Sorek
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies Authors: Keyu Chen, Nanfei Ye, Yida Wang, Wenchao Sun, Danqi Zhao, Hao Cheng, Sifa Zheng

Architecture and Training Dynamics (22)

1. Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

ArXiv ID: 2605.05176

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Alexander Hsu, Zhaiming Shen, Wenjing Liao, Rongjie Lai

Abstract: Pre-trained transformers are able to learn from examples provided as part of the prompt without any weight updates, a remarkable ability known as in-context learning (ICL). Despite its demonstrated efficacy across various domains, the theoretical understanding of ICL is still developing. Whereas most existing theory has focused on linear models, we study ICL in the nonlinear regression setting. Through the interaction mechanism in attention, we explicitly construct transformer networks to realize nonlinear features, such as polynomial or spline bases, which span a wide class of functions. Based on this construction, we establish a framework to analyze end-to-end in-context nonlinear regression with the constructed features. Our theory provides finite-sample generalization error bounds in terms of context length and training set size. We numerically validate the theory on synthetic regression tasks.

Comment: Provides a theory of transformer in-context learning for nonlinear regression by showing attention can explicitly construct nonlinear feature bases such as polynomials and splines.

Topic Match: The core contribution is mechanistic understanding of how attention implements nonlinear feature construction during ICL, making transformer computation the primary fit.

Relevance: 9 Novelty: 8

2. Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

ArXiv ID: 2605.04396

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Sarwan Ali

Abstract: Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25\%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy ($0.93$ vs $0.91$); (ii)~holding total regularization budget constant, placing it in the middle of training yields $5{-}9\times$ higher OOD accuracy than placing it early; (iii)~the boundary of the critical window is remarkably sharp, window onset shifted by as little as $100$ optimization steps causes mean OOD to jump from chance ($0.15$) to reasoning-regime ($0.61$); (iv)~the window's position depends systematically on initialization scale, but the basin of attraction for reasoning solutions \emph{shrinks} at small initialization, contradicting the prevailing recommendation that smaller initialization is uniformly better. We further show that the critical-window phenomenon is task-specific: it does not appear on grokking with modular arithmetic, where properly tuned constant weight decay matches scheduled weight decay.

Comment: Identifies a sharp training-time critical window in which regularization determines whether transformers learn reasoning-like compositional solutions or memorization.

Topic Match: It is fundamentally about training dynamics and how optimization timing controls solution type in transformers, which best fits architecture and training dynamics.

Relevance: 9 Novelty: 8

3. Layerwise LQR for Geometry-Aware Optimization of Deep Networks

ArXiv ID: 2605.04230

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Simon Dufort-Labb\'e, Pierre-Luc Bacon, Razvan Pascanu, Simon Lacoste-Julien, Aristide Baratin

Abstract: Geometry-aware optimizers such as Newton and natural gradient can improve conditioning in deep learning, but scalable variants such as K-FAC, Shampoo, and related preconditioners usually impose structural approximations early, often discarding cross-layer interactions induced by the network computation. We introduce Layerwise LQR (LLQR), a framework for learning structured inverse preconditioners under a global layerwise optimal-control objective. The starting point is an exact equivalence: the steepest-descent step under a broad class of divergence-induced quadratic models--including Newton, Gauss-Newton, Fisher/natural-gradient, and intermediate-layer metrics--can be written as a finite-horizon Linear Quadratic Regulator (LQR) problem. This formulation serves as a reference that exposes the layerwise dynamics and cost matrices encoding the original dense geometry. We then derive a scalable relaxation that learns diagonal, (E-)Kronecker-factored, or other structured inverse preconditioners by minimizing the LQR objective and reusing them across iterations. The resulting optimizer wraps standard methods while retaining a principled connection to second-order geometry, without forming or inverting the global curvature matrix. Experiments on ResNets and Transformers show that LLQR improves optimization dynamics and often translates these gains into improved final test performance, while adding only modest wall-clock overhead. It establishes LLQR as a practical framework for geometry-aware second-order methods and a reference for evaluating scalable approximations.

Comment: Recasts second-order optimization for deep nets as a finite-horizon LQR problem, yielding a principled way to learn structured cross-layer preconditioners.

Topic Match: The core contribution is a new optimization/training-dynamics framework for deep networks, with efficiency benefits secondary to the training mechanism.

Relevance: 9 Novelty: 8

4. Demystifying Manifold Constraints in LLM Pre-training

ArXiv ID: 2605.04418

Primary Topic: Architecture and Training Dynamics

Authors: Kang An, Jiaxiang Li, Donald Goldfarb, Shiqian Ma

Abstract: The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay. While recent constrained optimization approaches that explicitly restrict weights may improve numerical stability and performance, the mechanism and motivation for adding constraints still remain elusive. This paper systematically demystifies the role of explicit manifold constraints in LLM pre-training. By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO)-a provably convergent, single-loop optimization framework-our study disentangles weight regularization heuristics from interacting mechanisms like RMS normalization and decoupled weight decay. Theoretical analyses and comprehensive empirical evaluations reveal that manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of these heuristic mechanisms. Evaluations on large-scale LLM architectures demonstrate that MACRO achieves highly competitive performance while rigorously preserving the theoretical guarantees of exact Riemannian optimization.

Comment: Explains manifold constraints in LLM pretraining as an independent stabilization mechanism that bounds activation scales and enforces rotational equilibrium.

Topic Match: This is directly about training stability mechanisms and optimizer-design principles for large-model pretraining.

Relevance: 9 Novelty: 8

5. FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

ArXiv ID: 2605.04421

Primary Topic: Architecture and Training Dynamics

Authors: Waleed Razzaq, Yun-Bo Zhao

Abstract: Continuous-time (CT) Transformers improve irregular and long-range modeling over CT-RNNs by exploiting inputs or outputs embeddings with continuous dynamics. However, the core scaled-dot-product-attention (SDPA) mechanism remains inherently discrete. We propose FLUID (Flexible Unified Information Dynamics), a CT Transformer that incorporates continuous dynamics directly into the attention computation by replacing it with Liquid Attention Network (LAN). LAN reinterprets attention logits as continuous dynamical system and reformulates them as the solution to a linear ODE modulated by input-dependent nonlinear recurrent gates. Theoretically, we establish stability guarantees for LAN dynamics and show that it serves as an interpolating middle ground between SDPA and CT-RNNs, recovering each as special case under well-defined parameterization of its gating functions. LAN also introduces an explicit attention-sink gate to eliminate disproportionate attention mass on uninformative nodes. FLUID replaces standard residual connections with input-dependent Liquid Hyper-Connections to adaptively regulate interlayer information flow. Empirically, we evaluate FLUID on a broad set of learning tasks, including (i) irregular time-series, (ii) long-range modeling, (iii) lane-keeping control of autonomous vehicles, and (iv) learning physical dynamics under a scarce data regime. Across all the tasks, FLUID consistently matches or outperforms CT baselines, achieving improvements of up to 47% in certain scenarios and enhancing generalization under distributional shifts. Additionally, FLUID demonstrates superior noise robustness and a self-correcting inductive bias in autonomous vehicle control. We also provide a detailed analysis of key hyperparameters to guide tuning and show that FLUID occupies an intermediate position among competing approaches in terms of runtime and memory efficiency.

Comment: Introduces a continuous-time attention mechanism via ODE-governed liquid attention and hyper-connections, bridging transformers and CT-RNNs.

Topic Match: Its main value is a new sequence-model architecture and attention formulation rather than an application domain result.

Relevance: 9 Novelty: 8

6. Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

ArXiv ID: 2605.04217

Primary Topic: Architecture and Training Dynamics

Authors: Yaobo Zhang

Abstract: Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-\gamma d}\cos(\omega d)$, $e^{-\gamma d}\sin(\omega d)$, $d e^{-\gamma d}\cos(\omega d)$, and $d e^{-\gamma d}\sin(\omega d)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{i\omega d}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

Comment: Introduces a new relative positional encoding based on defective complex Jordan blocks, expanding the function class of attention logits beyond RoPE and ALiBi.

Topic Match: This is directly about a core attention mechanism: the structure of relative positional encoding and what functions attention can express.

Relevance: 9 Novelty: 8

7. On the Invariants of Softmax Attention

ArXiv ID: 2605.02907

Primary Topic: Architecture and Training Dynamics

Authors: Wonsuk Lee

Abstract: Softmax attention maps every query--key interaction into a probability distribution, but the underlying structure remains largely unexplored. We define the \emph{energy field}, the row-centered attention logit, and show that it exhibits invariant properties across models, architectures, and inputs. Two classes of invariants emerge. \emph{Mechanism-level} invariants follow from the algebraic structure of softmax attention. They include a per-row zero-sum constraint, a rank bound determined by the head dimension, and spectral signatures that follow from them. \emph{Model-level} regularities are not required by the mechanism, yet hold in every autoregressive language model we test, spanning several architecture families. The energy field distributes its variance over key positions without concentrating at a few. This delocalization traces to a property of the key matrix we call \emph{key incoherence}. These invariants have practical consequences. The rank bound confines the energy field to a low-dimensional subspace. Key incoherence yields a per-head training monitor. All results are verified at multiple context lengths and input texts.

Comment: Characterizes invariant structure of softmax attention through the row-centered energy field, including rank and spectral constraints plus empirical key incoherence.

Topic Match: The paper directly analyzes a core transformer mechanism and extracts architectural invariants of attention itself.

Relevance: 9 Novelty: 8

8. Training-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine Networks

ArXiv ID: 2605.04946

Primary Topic: Architecture and Training Dynamics

Authors: Xuan Qi, Yi Wei, Fanqi Yu, Furao shen, Vittorio Murino, Cigdem Beyan

Abstract: Batch normalization (BN) is central to modern deep networks, but its effect on the realized function during training remains less understood than its optimization benefits. We study training-time BN in continuous piecewise-affine (CPA) networks through the geometry of switching hyperplanes and the induced affine-region partition. Conditioned on a mini-batch, we show that BN defines for each neuron a reference hyperplane through the batch centroid, and that breakpoint-switching hyperplanes are parallel translates whose offsets are expressed in batch-standardized coordinates and are independent of the raw bias. This yields an exact criterion for when a switching hyperplane intersects a local $\ell_\infty$ window and motivates a local region-density functional based on exact affine-region counts. Under explicit sufficient conditions, we show that BN increases expected local partition refinement in ReLU and more general piecewise-affine networks, and that this mechanism transfers locally through depth inside parent affine regions where the upstream representation map is an affine embedding. These results provide a function-level geometric account of training-time BN as a batch-conditional recentering mechanism near the data.

Comment: Gives a function-level geometric theory for how training-time batch normalization reshapes local affine-region partitions during training.

Topic Match: This directly analyzes a core training-stability and architecture mechanism, explaining BN through partition geometry rather than application performance.

Relevance: 9 Novelty: 8

9. How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

ArXiv ID: 2605.05113

Primary Topic: Architecture and Training Dynamics

Authors: Mariia Seleznova

Abstract: We study signal propagation in linear recurrent models at finite width. While existing signal propagation theory relies predominantly on the infinite-width limit, it remains unclear for how long that approximation remains accurate when recurrent depth $t$ grows jointly with width $n$. This question is especially relevant for modern recurrent sequence models, whose natural operating regime involves long input sequences, i.e., large $t$. We derive exact finite-width formulas for the hidden state signal energies in linear recurrences under complex Gaussian initialization. Using these formulas, we identify the joint depth-width scaling regimes that govern signal propagation: (i) a subcritical regime $t=o(\sqrt n)$, in which the infinite-width approximation remains valid; (ii) a critical regime $t\sim c\sqrt n$, in which non-negligible deviations from infinite-width predictions appear and a nontrivial joint scaling limit emerges; and (iii) a supercritical regime $t\gg \sqrt n$, in which finite-width effects dominate. Thus, our results pinpoint the precise recurrent depth scale at which infinite-width theory breaks down in long-range linear recurrences. In turn, this shows when standard initialization schemes, such as Glorot, become unstable. More broadly, our results demonstrate that finite-width effects accumulate more rapidly with depth in recurrent models than in feedforward ones, leading to qualitatively different signal propagation behavior.

Comment: Pins down the exact depth-width scaling where infinite-width signal propagation theory breaks for recurrent linear models.

Topic Match: This is a foundational training-dynamics result for recurrent sequence models, focusing on finite-width effects and initialization stability.

Relevance: 9 Novelty: 8

10. Estimating the expected output of wide random MLPs more efficiently than sampling

ArXiv ID: 2605.05179

Primary Topic: Architecture and Training Dynamics

Authors: Wilson Wu, Victor Lecomte, Michael Winer, George Robinson, Jacob Hilton, Paul Christiano

Abstract: By far the most common way to estimate an expected loss in machine learning is to draw samples, compute the loss on each one, and take the empirical average. However, sampling is not necessarily optimal. Given an MLP at initialization, we show how to estimate its expected output over Gaussian inputs without running samples through the network at all. Instead, we produce approximate representations of the distributions of activations at each layer, leveraging tools such as cumulants and Hermite expansions. We show both theoretically and empirically that for sufficiently wide networks, our estimator achieves a target mean squared error using substantially fewer FLOPs than Monte Carlo sampling. We find moreover that our methods perform particularly well at estimating the probabilities of rare events, and additionally demonstrate how they can be used for model training. Together, these findings suggest a path to producing models with a greatly reduced probability of catastrophic tail risks.

Comment: Estimates expected outputs of wide random MLPs using cumulants and Hermite expansions, beating Monte Carlo in FLOPs for wide-network regimes.

Topic Match: This is specialized foundational work on neural-network behavior at initialization, offering a new computational lens on wide-network statistics and tail estimation.

Relevance: 8 Novelty: 9

11. RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

ArXiv ID: 2605.02946

Primary Topic: Architecture and Training Dynamics

Authors: Zhiyuan Xu, Joseph Gardiner, Sana Belguith, Lichao Wu

Abstract: Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on this observation, RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time, the optimized suffix is appended to a malicious prompt, requiring only input access. Across seven MoE LLMs, RouteHijack achieves a 69.3\% average attack success rate (ASR), outperforming prior optimization-based attack by $3.2\times$. RouteHijack also transfers zero-shot across five sibling MoE variants, raising average ASR from 27.7\% to 61.2\%, and further generalizes to three MoE-based VLMs, increasing average ASR from 2.47\% to 38.7\%. These findings expose a fundamental vulnerability in sparse expert architectures and highlight the need for defenses beyond output-level alignment.

Comment: Reveals a routing-specific vulnerability in MoE models by exploiting safety-critical expert allocation through routing-aware adversarial optimization.

Topic Match: Despite the safety framing, the substantive insight is about MoE routing behavior and concentration of functionality across experts, a core architectural mechanism.

Relevance: 8 Novelty: 8

12. Average Attention Transformers and Arithmetic Circuits

ArXiv ID: 2605.04683

Primary Topic: Architecture and Training Dynamics

Authors: Lena Ehrmuth, Laura Strieker

Abstract: We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. The circuit families that can be simulated this way have constant depth while using unbounded addition, binary multiplication and sign gates. The transformers we use have arithmetic circuits instead of feed-forward networks. With typical average attention the functions they compute are also computed by the same class of circuit families. Our results hold for transformers over the reals, rationals and any ring in between the two.

Comment: Characterizes average-attention transformers through arithmetic-circuit simulation, clarifying their formal computational power.

Topic Match: This is a foundational analysis of transformer attention as a computational mechanism.

Relevance: 8 Novelty: 8

13. Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks

ArXiv ID: 2605.03598

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Jatin Sharma, Dan F. M Goodman, Danyal Akarca

Abstract: Understanding how biological and artificial neural networks implement computation from connectivity is a central problem in neuroscience and machine learning. In neural systems, structural and functional connectivity are known to diverge, motivating approaches that move beyond direct connections alone. Here, we show that the spatial and temporal function of recurrent neural networks (RNNs) trained on hierarchically modular tasks can be recovered by modelling the network as a graph and analysing the multi-hop pathways between input and output units. In particular, decomposing these pathways by hop length reveals how the network temporally routes information. This perspective reframes regularisation: if function is implemented through multi-hop communication, then standard penalties such as L1 regularisation, which act only on individual weights, constrain single-hop structure rather than the multi-hop pathways that support computation. Motivated by this view, we introduce resolvent-RNNs (R-RNNs), which constrain multi-hop pathways and thereby induce temporal sparsity beyond that achieved by standard L1 regularisation. Compared with L1 regularisation, R-RNNs achieve improved performance by inducing temporal sparsity that matches the task structure, even when the task signal is sparse. Moreover, R-RNNs exhibit stronger sparsity-function alignment, reflected in their increased robustness under strong regularisation. Together, our results identify multi-hop communication as a key principle linking structure to function in recurrent networks, and suggest that sparsity should be defined over functional pathways rather than individual parameters.

Comment: Shows that recurrent-network computation is organized by multi-hop graph pathways and introduces regularization over those pathways rather than individual weights.

Topic Match: The paper's key contribution is a mechanistic account of recurrent computation and a new structural regularization principle.

Relevance: 8 Novelty: 8

14. Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

ArXiv ID: 2605.04269

Primary Topic: Architecture and Training Dynamics

Authors: Sharan Sahu, Abir Sarkar, Cameron J. Hogan, Martin T. Wells

Abstract: We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general $L$-smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by $\beta_1$, and a preconditioner perturbation governed by $\beta_2$. We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated regimes, first-moment averaging and adaptive preconditioning can improve the high-probability error, whereas in drift-dominated regimes, stale first-moment information and preconditioner perturbations can compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. Our explicit $(\beta_1,\beta_2,\epsilon)$-dependent bounds delineate when adaptive step-sizing is beneficial versus harmful, and provide a theoretical mechanism for Adam's empirical instability and stabilization under distribution shift.

Comment: Provides explicit theory for when Adam helps or hurts under nonstationary objectives, isolating drift and noise tradeoffs through beta-dependent bounds.

Topic Match: This is foundational training-dynamics theory about optimizer behavior under distribution shift and nonstationarity.

Relevance: 8 Novelty: 8

15. Ortho-Hydra: Orthogonalized Experts for DiT LoRA

ArXiv ID: 2605.03252

Primary Topic: Architecture and Training Dynamics

Authors: Seunghyun Ji

Abstract: LoRA fine-tuning of diffusion transformers (DiT) on multi-style data suffers from \emph{style bleed}: a single low-rank residual cannot represent several distinct artist fingerprints, and the optimizer converges to their average. Mixture-of-experts LoRA in the HydraLoRA style replaces the up-projection with $E$ heads under a router, but when every expert is zero-initialized the router receives identical gradient from each head and remains at the uniform prior. The experts then evolve permutation-symmetrically, and the network trains as a single rank-$r$ LoRA at $E{\times}$ the cost. We present \textbf{Ortho-Hydra}, a re-parameterisation that combines an OFT-style Cayley-orthogonal shared basis with per-expert \emph{disjoint output subspaces} carved from the top-$(Er)$ left singular vectors of the pretrained weight. Disjointness makes the router's per-expert score non-degenerate at step~$0$, so specialization receives gradient signal before any expert has trained. We test the predicted deadlock on a DiT pipeline by comparing two HydraLoRA baselines, a zero-initialized shared-basis variant and the original $\sigma{=}0.1$ Gaussian-jitter mitigation, against Ortho-Hydra under a matched optimiser, dataset, and step budget. Neither baseline leaves the uniform prior within the first $1\text{k}$ steps; Ortho-Hydra begins de-uniformising within the first few hundred. End-task generation quality on multi-style data is out of scope; we report the construction, the cold-start mechanism, and the routing dynamics it changes. Code: https://github.com/sorryhyun/anima_lora.

Comment: Breaks zero-init MoE-LoRA routing symmetry by enforcing disjoint expert subspaces, giving specialization gradient at step zero.

Topic Match: This is a mechanistic contribution about expert routing dynamics and symmetry breaking in modular architectures.

Relevance: 8 Novelty: 8

16. Covariance-Aware Goodness for Scalable Forward-Forward Learning

ArXiv ID: 2605.04346

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Xiaoyi Jiang, Bashir M. Al-Hashimi, Kai Xu

Abstract: The Forward-Forward algorithm eliminates global gradient flow and full network activations storage. However, in convolutional settings, existing BP-free FF methods significantly under-perform backpropagation on complex benchmarks such as ImageNet-100 and Tiny-ImageNet. We identify this gap as a structural bottleneck in goodness extraction: standard sum-of-squares formulation collapses feature volumes into channel-wise activation energies which omits critical second-order dependencies. To address this, we propose a framework centered on three key components. First, Bi-axis Covariance Goodness(BiCovG) explicitly augments the standard goodness function with structured second-order information along two axes: cross-channel projections that model inter-feature covariance, and nested multi-scale aggregation that encodes spatial correlation statistics. This provides a tractable approximation to covariance-aware goodness without the prohibitive O(C^2) complexity of explicit matrix estimation. Second, a lightweight Logistic Fusion module aggregates layer-wise predictions, amplifying the contribution of deeper representations. Third, the Feature Alignment Layer(FAL) introduces a zero-initialized correction at block boundaries to mitigate representation misalignment in deep locally trained networks. By introducing these three components, we effectively double the depth of viable Forward-Forward learning, extending robust layer utilization from shallow baselines to 16 layer architectures like VGG-16. The resulting BP-free model achieves 73.01% on ImageNet-100 and 50.30% on Tiny-ImageNet. As a practical extension, Hybrid Goodness Blocks control the scope of gradient propagation via configurable block sizes, further narrowing the ImageNet-100 gap to 3.6% and matching BP on Tiny-ImageNet, while still reducing peak memory by approximately 50% relative to BP.

Comment: Adds covariance-aware goodness functions and alignment layers to make Forward-Forward learning viable at substantially greater depth.

Topic Match: The strongest match is a new training mechanism for non-backprop learning, centered on how representations and layer-local objectives are structured.

Relevance: 8 Novelty: 8

17. Perturbation is All You Need for Extrapolating Language Models

ArXiv ID: 2605.04344

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Zetai Cen, Jin Zhu, Xinwei Shen, Chengchun Shi

Abstract: We introduce a simple yet powerful framework for training large language models. In contrast to the standard autoregressive next-token prediction based on an exact prefix, we propose a perturbation-based procedure that first transforms the prefix into a semantic neighbor and then conditions on this perturbed variant for next-token prediction. This yields a hierarchical model with a pre-post-additive noise structure. Within this framework, we develop a rigorous theory of extrapolability, namely, the capacity of a model class to make reliable predictions for token sequences that lie outside the empirical support of the training corpus. We evaluate the finite-sample performance of the proposed procedure using both synthetic and real-world language data. Results show that the proposed method consistently improves out-of-support prediction while maintaining competitive in-support performance, demonstrating that perturbation offers a practical route to language modeling.

Comment: Introduces perturbation-conditioned next-token training and develops a theory of extrapolability beyond corpus support.

Topic Match: The paper proposes a new language-model training principle with accompanying theory about out-of-support generalization, making training dynamics the best primary fit.

Relevance: 8 Novelty: 8

18. Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

ArXiv ID: 2605.04291

Primary Topic: Architecture and Training Dynamics

Authors: Tarun Kathuria, Sachin Kumar

Abstract: We present a discrete diffusion-based language model using Glauber dynamics from statistical physics. Our main insight is that instead of trying to train a discrete state space diffusion model using Glauber dynamics with a uniform transition kernel as the forward process, one can set up an ``energy function'' based on pretrained causal/masked language models. When viewed as the stationary distribution, this energy function allows us to significantly improve the quality of the generated text. Incorporating UL2 as the pretrained model into our diffusion pipeline, we outperform prior diffusion based LMs and perform competitively with autoregressive models of comparable model sizes. Furthermore, our models are competitive with or outperform prior diffusion models and GPT-2 style auto-regressive models on zero-shot common sense reasoning tasks as well as planning and search tasks like Sudoku and Zebra puzzles.

Comment: Uses pretrained language models as energy functions in Glauber-dynamics text diffusion instead of learning the energy from scratch.

Topic Match: Primary fit is architecture and core generative mechanism design: the paper proposes a different diffusion construction for language modeling built around LM-derived energies.

Relevance: 8 Novelty: 8

19. Endogenous Regime Switching Driven by Scalar-Irreducible Learning Dynamics

ArXiv ID: 2605.04054

Primary Topic: Architecture and Training Dynamics

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Sheng Ran

Abstract: Achieving endogenous regime switching is crucial for the emergence of autonomous intelligence, yet remains a central challenge for existing machine learning frameworks, where such transitions are typically externally imposed. In this work, we introduce a classification that distinguishes scalar-reducible dynamics, which can be expressed as gradient flows driven by a scalar objective, from scalar-irreducible dynamics that cannot be reduced to such a form. While most existing machine learning systems operate within the scalar-reducible class, we demonstrate that scalar-irreducible dynamics naturally enable internally generated regime switching through feedback between fast dynamical variables and slow structural adaptation. Using a minimal dynamical model, we illustrate how this mechanism produces sustained endogenous regime transitions without external scheduling. Our results suggest a new dynamical paradigm for regime exploration and provide a potential route toward autonomous learning systems whose adaptive behavior is organized internally rather than externally prescribed.

Comment: Argues scalar-irreducible learning dynamics enable endogenous regime switching that scalar-objective gradient systems cannot produce.

Topic Match: Primary fit is architecture/training dynamics because the paper proposes a foundational dynamical lens on learning systems and internally generated mode switching.

Relevance: 8 Novelty: 8

20. Koopman Identification of Nonlinear Systems via Reservoir Liftings

ArXiv ID: 2605.04917

Primary Topic: Architecture and Training Dynamics

Also Matches: Memory Structures and Agent Memory Systems

Authors: Weibin Gu, Chen Yang, Lu Shi

Abstract: Learning tractable linear representations of nonlinear dynamical systems via Koopman operator theory is often hindered by dictionary selection, temporal memory encoding, and numerical ill-conditioning. Inspired by Reservoir Computing (RC) paradigm, this paper introduces the RC-Koopman framework, which interprets reservoir as a stateful, finite-dimensional Koopman dictionary whose temporal depth is explicitly controlled by its spectral radius. We show that the Echo State Property (ESP) guarantees well-posedness and favorable numerical conditioning of the lifted Koopman approximation. A correlation-based spectral radius selection algorithm aligns reservoir memory with dominant system timescales. Analysis reveals how the finite memory of the reservoir determines which Koopman eigenfunctions remain observable from the lifted features. Evaluation on synthetic benchmarks demonstrates that RC-Koopman achieves a favorable balance between reconstruction accuracy of the underlying nonlinear dynamics and dynamical stability, compared to Extended Dynamic Mode Decomposition (EDMD) and Hankel-based lifting approaches. Code available at: https://github.com/NEAR-the-future/RC-Koopman.git

Comment: Uses reservoir states as Koopman liftings, with analysis connecting spectral radius, memory depth, observability, and numerical conditioning.

Topic Match: The core idea is a principled recurrent/stateful modeling mechanism for nonlinear dynamics, analyzed through memory and observability properties.

Relevance: 8 Novelty: 8

21. Exact Dual Geometry of SOC-ICNN Value Functions

ArXiv ID: 2605.04722

Primary Topic: Architecture and Training Dynamics

Authors: Kang Liu, Jianchen Hu, Wei Peng

Abstract: Input Convex Neural Networks (ICNNs) are commonly used in a two-stage manner: one first trains a convex network and then minimizes it over its input in a downstream inference problem. Recent second-order-cone ICNNs (SOC-ICNNs) enrich ReLU-based ICNNs with quadratic and conic modules and admit an exact representation as value functions of second-order cone programs (SOCPs). This value-function structure enables an explicit convex-analytic treatment of SOC-ICNN inference. In this paper, we study the exact first-order and local second-order geometry of SOC-ICNNs from the dual viewpoint. We show that supporting slopes, subdifferentials, directional derivatives, and local Hessians can be recovered directly from optimal dual variables. These results provide the geometric primitives for white-box SOC-ICNN inference, going beyond black-box automatic differentiation. Numerical experiments validate the exact multiplier readout, the local Hessian formula, and the set-valued behavior at structurally degenerate inputs. We also provide a step-by-step tutorial showing how the readout mechanism instantiates a complete white-box inference loop. The code is available at https://anonymous.4open.science/r/SOC-ICNN-Theory-BEFC/.

Comment: Derives exact subdifferentials, directional derivatives, and local Hessians of SOC-ICNN value functions directly from optimal dual variables.

Topic Match: The paper gives white-box geometric analysis of a specific neural architecture class, directly fitting architectural mechanism and inference theory.

Relevance: 8 Novelty: 8

22. Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

ArXiv ID: 2605.03034

Primary Topic: Architecture and Training Dynamics

Authors: Kerri Prinos, Lilianne Brush, Cameron Denton, Zhanqi Wang, Joshua Knox, Snehal Antani, Anton Foltz, Amy Villase\~nor

Abstract: Agentic systems involved in high-stake decision-making under adversarial pressure need formal guarantees not offered by existing approaches. Motivated by the operational needs of security operations centers (SOCs) that must configure endpoint detection and response (EDR) policies under adversarial pressure, we present a tool-mediated architecture: LLM agents use deterministic tools (Stackelberg best-response, Bayesian observer updates, attack-graph primitives) and select from finite action catalogs enforced at the tool-output interface. A composite Lyapunov function machine-checked in Lean 4 with zero sorry certifies controllability, observability from asymmetric sensor data, and Input-to-State Stability (ISS) robustness under intelligent adversarial disturbance, with two corollaries extending the certificate to any controller or adversary from the catalogs. On 282 real enterprise attack graphs, the claims hold with margin. On paired offensive/defensive telemetry, a tool-mediated Claude Sonnet 4 controller reduces the attacker's expected payoff (game value) by 59% relative to a deterministic greedy baseline, with zero variance across 40 runs at four temperatures. A Claude Haiku 4.5 controller converges to suboptimal game values but stays catalog-bounded over an additional 40 runs, demonstrating that architectural stability is not dependent on the controller capability. The LLM agent's non-determinism furthers creative exploration of strategies, while the tool-mediated architecture ensures system stability.

Comment: Presents a tool-mediated agent architecture with finite action catalogs and a machine-checked Lyapunov stability certificate for adversarial control.

Topic Match: Best fit is architecture/training because the key contribution is an architectural principle for stabilizing agent behavior using constrained tool interfaces and formal control guarantees.

Relevance: 8 Novelty: 8

Efficiency, Compression, and Large-Scale Training (7)

1. Gated Subspace Inference for Transformer Acceleration

ArXiv ID: 2605.03109

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Stephen J. Thomas

Abstract: A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace component and a residual, computes the linear-layer output on the subspace component via a cached low-rank weight image at reduced memory bandwidth, and applies a per-token gate that determines whether the residual correction is computed or skipped. The gate ensures that the output distribution is preserved to within a controllable tolerance. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates effective speedups of 3.0x to 10.5x on linear-layer weight reads with perplexity ratios below 1.00 and top-1 token agreement above 98%. The method requires no retraining, no architectural modification, and no approximation of the attention mechanism. At the operating point (k = 256, {\epsilon} = 0.05) on GPT-J 14 6B, the accelerated model produces character-for-character identical output to the baseline.

Comment: Accelerates transformer inference by caching low-rank subspace projections and gating whether residual corrections are needed per token.

Topic Match: The contribution is a new inference-efficiency mechanism for large transformers with controllable approximation.

Relevance: 9 Novelty: 8

2. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

ArXiv ID: 2605.04738

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, Qingyi Gu

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities. However, their massive parameter scale leads to significant resource consumption and latency during inference. Post-training weight-only quantization offers a promising solution by reducing model size and accelerating token generation through alleviating the memory-bound issue. Nevertheless, the presence of inherent systematic outliers in weights continues to be a major obstacle. While existing methods, such as scaling and rotation, attempt to address this issue, the performance remains unsatisfactory. In this paper, we propose Outlier Self-Absorption Quantization (OSAQ), which performs additive weight suppression guided by the second-order low-rank property for low-bit weight-only quantization of LLMs. Specifically, we observe that the Hessian exhibits low-rank consistency across different inputs, with certain directions consistently showing vanishing curvature. Leveraging this property, we identify a stable null space of the Hessian and then construct an additive weight transformation by linearly combining the vectors within this null space, thereby suppressing weight outliers without affecting the task loss. This additive transformation can be absorbed into the weights offline, requiring no inter-layer transformations and introducing no inference overhead. Moreover, the construction is efficiently achieved by a closed-form solution, without resource-intensive training or iterative procedures. Extensive experiments demonstrate that OSAQ effectively suppresses outliers and enhances low-bit quantization performance. For instance, in 2-bit quantization, OSAQ, when integrated with GPTQ, achieves over 40% lower perplexity compared to vanilla GPTQ.

Comment: Suppresses quantization outliers by exploiting stable low-curvature Hessian null spaces, giving a closed-form low-bit weight transformation.

Topic Match: The core advance is a new quantization method grounded in second-order structure to improve low-bit LLM compression.

Relevance: 9 Novelty: 8

3. Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

ArXiv ID: 2605.05049

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Sajal Dash, Feiyi Wang

Abstract: Frontier models increasingly adopt Mixture-of-Experts (MoE) architectures to achieve large-model performance at reduced cost. However, training MoE models on HPC platforms is hindered by large memory footprints, frequent large-scale communication across heterogeneous networks, and severe workload imbalance. To characterize these challenges, we develop a mathematical model that quantifies memory, compute, and communication requirements for MoE configurations under various parallelization schemes, verified through micro-benchmarking, code instrumentation, and hardware profiling. Our analysis identifies performance bottlenecks: all-to-all latency at scale from expert parallelism, insufficient compute-communication overlap, low GPU utilization from imbalanced skinny GEMMs, and the absence of platform-aware hybrid parallelization strategies. To address these, we introduce Piper, a framework that leverages resource modeling to identify efficient training strategies for MoE models on target HPC platforms, applying pipeline parallelism with optimized schedules. Piper achieves 2-3.5X higher MFU than state-of-the-art frameworks such as X-MoE, and a novel all-to-all algorithm delivers 1.2-9X bandwidth over vendor implementation.

Comment: Builds a resource model for MoE training and uses it to choose pipelined hybrid parallel strategies that materially improve training efficiency.

Topic Match: The paper's core value is a new systems-and-algorithmic approach for large-scale MoE training efficiency, not MoE architecture itself.

Relevance: 9 Novelty: 8

4. UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

ArXiv ID: 2605.04543

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Yepeng Weng, Qiao Hu, Takehisa Yairi

Abstract: Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolation, applying either flat OT to single-step drafts or per-token rejection sampling to tree-structured candidates. This separation leaves the joint regime (where multi-step dependencies meet multi-draft branching) poorly optimized, as local verification rules fail to exploit the coupling between horizontal and vertical dimensions of candidate trees. In this paper, we propose a unified perspective that casts tree-based verification as a conditional OT problem. Our key insight is that vertical dependencies can be abstracted through prefix acceptance probabilities, which act as dynamic scaling factors to actively guide horizontal draft selection. Based on this principle, we introduce UniVer, a verification algorithm that jointly optimizes across tree levels by composing local optimal transport plans under prefix constraints. We prove that UniVer remains lossless and achieves the optimal acceptance rate under the proposed conditional framework. Extensive experiments across different tasks and models demonstrate that UniVer improves acceptance length by 4.2% to 8.5% over standard recursive rejection sampling without replacement, while maintaining exact distributional alignment with the target model.

Comment: Unifies multi-step and multi-draft speculative decoding as conditional optimal transport with lossless optimal verification.

Topic Match: Its central advance is a decoding algorithm that materially improves inference efficiency while preserving exactness.

Relevance: 9 Novelty: 8

5. A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

ArXiv ID: 2605.04595

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: Chengyi Nie, Nian Si, Zijie Zhou

Abstract: The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-value (KV) caching, which accelerates decoding but quickly exhausts GPU memory. In this paper, we introduce the first queueing-theoretic framework that explicitly incorporates both computation and GPU memory constraints into the analysis of LLM inference. Based on this framework, we derive rigorous stability and instability conditions that determine whether an LLM inference service can sustain incoming demand without unbounded queue growth. This result offers a powerful tool for system deployment, potentially addressing the core challenge of GPU provisioning. By combining an estimated request arrival rate with our derived stable service rate, operators can calculate the necessary cluster size to avoid both costly over-purchasing and performance-violating under-provisioning. We further validate our theoretical predictions through extensive experiments in real GPU production environments. Our results show that the predicted stability conditions are highly accurate, with deviations typically within 10%.

Comment: Develops the first queueing-theoretic stability analysis for LLM inference under joint compute and KV-cache memory constraints.

Topic Match: The main value is a systems-theoretic account of inference stability under KV-cache constraints, squarely in efficient large-scale serving.

Relevance: 9 Novelty: 8

6. Budget-aware Auto Optimizer Configurator

ArXiv ID: 2605.04711

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Kang Liu, Wei Peng, Jianchen Hu

Abstract: Optimizer states occupy massive GPU memory in large-scale model training. However, gradients in different network blocks exhibit distinct behaviors, such as varying directional stability and scale anisotropy, implying that expensive optimizer states are not universally necessary and using a global optimizer is often memory-inefficient. We propose the Budget-Aware Optimizer Configurator (BAOC) to reduce memory cost by assigning suitable optimizer configurations to individual blocks under given budgets. Specifically, BAOC samples gradient streams to derive statistical metrics that quantify the potential performance risk of applying cheaper configurations (e.g., low precision or removing momentum). It then solves a constrained allocation problem to minimize total risk under memory and time budgets, selecting a budget-feasible configuration for each block. Experiments across vision, language, and diffusion workloads demonstrate that BAOC maintains training quality while significantly reducing the memory usage of optimizer states. The code is available at https://anonymous.4open.science/r/BAOC-45C6.

Comment: Allocates optimizer configurations per network block under memory/time budgets using gradient statistics rather than applying one optimizer globally.

Topic Match: Its main idea is reducing large-scale training cost by optimizer-state budgeting and per-block configuration, making efficiency and scaling the best fit.

Relevance: 8 Novelty: 8

7. Rethinking the Rank Threshold for LoRA Fine-Tuning

ArXiv ID: 2605.03724

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Juneyoung Park

Abstract: A recent landscape analysis of LoRA fine-tuning in the neural tangent kernel regime establishes a sufficient condition $r(r+1)/2 > KN$ on the LoRA rank $r$ for the absence of spurious local minima under squared-error loss, prescribing $r \geq 12$ on canonical few-shot RoBERTa setups. The condition is stated for general output dimension $K$, so its sharpness in any particular regime, and its practical implication for the cross-entropy loss actually used in fine-tuning, are open. We give three results that together reduce the prescribed rank to $r = 1$ for binary classification in this regime. First, replacing the symmetric Sard-form count with the non-symmetric LoRA manifold dimension yields a strictly weaker capacity requirement, $r(m+n) - r^2 > C^ \cdot KN$ with $C^ \approx 1.35$ under Gaussian-iid features, satisfied at $r = 1$ on canonical setups. Second, in the cross-entropy setting the Polyak--\L{}ojasiewicz inequality removes the rank threshold entirely. Third, a Rademacher-complexity bound predicts rank-one variance optimality precisely when the bias term is saturated, which is the case for binary classification but not for $K > 2$. Empirically, across four GLUE-style binary tasks, three encoder architectures, and at scale on RoBERTa-large, rank one is competitive with the existing prescription $r = 12$; on multi-class MNLI the optimal rank shifts above one, also as predicted. The binary-regime guarantees are conditional on standard NTK assumptions; the multi-class extension is left to future work.

Comment: Shows that rank-one LoRA can suffice in binary classification by tightening the theoretical rank threshold and extending analysis to cross-entropy.

Topic Match: The contribution is fundamentally about low-rank adaptation efficiency and when tiny adaptation ranks are theoretically enough.

Relevance: 8 Novelty: 8

Representation Learning Theory and Structure (8)

1. Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning

ArXiv ID: 2605.04061

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Bryan Cheng, Jasper Zhang

Abstract: Understanding how large language models encode task identity from few-shot demonstrations is a central open problem in mechanistic interpretability. Prior work uses linear probing to localize task representations, reporting high classification accuracy at specific layers. We reveal a striking dissociation: probing accuracy completely fails to predict causal importance. Single-position activation intervention achieves 0% task transfer across all 28 layers of Llama-3.2-3B-despite 100% probing accuracy at those same positions. This null result is itself a key finding, demonstrating that task encoding is fundamentally distributed. Multi-position intervention-replacing activations at all demonstration output tokens simultaneously-achieves up to 96% transfer (N=50, 95% CI: [87%, 99%]) at layer 8, pinpointing for the first time the causal locus of ICL task identity. We establish the generality of these findings across four models spanning three architecture families (LLaMA, Qwen, Gemma), discovering a universal intervention window at ~30% network depth. Causal tracing uncovers an asymmetric architecture: the query position is strictly necessary (53-100% disruption) while no individual demonstration position is necessary (0% disruption)-resolving a key ambiguity in prior accounts. Crucially, transfer depends on internal representation compatibility, not surface similarity (r=-0.05 vs r=0.31), ruling out trivial explanations. These results establish the distributed template hypothesis: ICL task identity is encoded as output format templates distributed across demonstration tokens, fundamentally reshaping our understanding of how in-context learning operates.

Comment: Demonstrates that in-context task identity is causally encoded as a distributed output-template representation rather than at any single token position.

Topic Match: The main contribution is mechanistic understanding of how internal representations support in-context learning.

Relevance: 9 Novelty: 8

2. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

ArXiv ID: 2605.05115

Primary Topic: Representation Learning Theory and Structure

Authors: Daniel Wurgaft, Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana

Abstract: Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and a behavior manifold $M_y$ to output probability distributions. We then test the link $M_h \leftrightarrow M_y$ via interventions: we find that steering along $M_h$, which we term manifold steering, yields behavioral trajectories that follow $M_y$, while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along $M_y$ recovers activation trajectories that trace the curvature of $M_h$. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.

Comment: Shows a causal link between activation-manifold geometry and behavioral trajectories, replacing direction-based steering with geometry-respecting interventions.

Topic Match: The paper is fundamentally about the structure of learned representations and how that geometry governs behavior.

Relevance: 9 Novelty: 8

3. Adaptivity Under Realizability Constraints: Comparing In-Context and Agentic Learning

ArXiv ID: 2605.04995

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Anastasis Kratsios, A. Martina Neuman, Philipp Petersen

Abstract: We compare in-context learning with fixed queries and agentic learning with adaptive queries for uniform approximation of task families. We consider two settings: an unrestricted regime, where querying and approximation are arbitrary functions, and a realizable regime, where we require these operations to be implemented by ReLU neural networks. In both settings, adaptivity never hinders approximation performance. However, this advantage can change when one passes from the unrestricted regime to the realizable regime. We identify four distinct approximation scenarios, each witnessed by an explicit task family: (a) no advantage of adaptivity; (b) an advantage in the unrestricted regime that persists under ReLU realizability; (c) an advantage that arises only under realizability; and (d) an advantage that disappears under realizability. This demonstrates that representational constraints interact profoundly with the effect of adaptivity.

Comment: Gives a theoretical comparison of fixed-query in-context learning versus adaptive agentic learning under neural-network realizability constraints.

Topic Match: The main contribution is a theory result on how representational constraints change the value of adaptivity, which is best viewed as representation-learning structure.

Relevance: 8 Novelty: 8

4. Conceptors for Semantic Steering

ArXiv ID: 2605.04980

Primary Topic: Representation Learning Theory and Structure

Authors: Ilias Triantafyllopoulos, Young-Min Cho, Ren Tao, Miranda Muqing Miao, Sunny Rai, Lyle Ungar, Sharath Chandra Guntuku, Neville Ryant, Jo\~ao Sedoc

Abstract: Activation-based steering provides control of LLM behavior at inference time, but the dominant paradigm reduces each concept to a single direction whose geometry is left largely unexamined. Rather than selecting a single steering direction, we use conceptors: soft projection matrices estimated from activations pooled across both poles of a bipolar concept, which preserve the concept's full multidimensional subspace. A geometric analysis shows the bipolar subspace strictly subsumes the single-vector baseline. We further show that the conceptor quota provides a parameter-free layer-selection diagnostic, predicting concept separability with Pearson correlations up to r=0.96 across three instruction-tuned models and three semantic dimensions. Beyond selection, conceptors admit a closed-form Boolean algebra (AND, OR, NOT): we evaluate conceptor compositionality on thematically related sub-concepts. Across a systematic five-axis design-space evaluation, conceptors match or outperform additive baselines at layers where concept subspaces are multi-dimensional while producing substantially fewer degenerate outputs. Conceptor steering is a geometrically principled, compositional, and practically safer alternative to single-direction steering from a limited number of contrastive pairs.

Comment: Replaces single steering vectors with conceptor subspaces, giving a multidimensional and compositional geometry for activation steering.

Topic Match: The paper is best viewed as structure-of-representations work: it studies concept geometry in activation space and how to manipulate it.

Relevance: 8 Novelty: 8

5. Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

ArXiv ID: 2605.03058

Primary Topic: Representation Learning Theory and Structure

Authors: Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

Abstract: A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.

Comment: Links symbolic rule extraction to sparse causal neuron sets via efficient hierarchical ablation and regime-conditional group testing.

Topic Match: Its main value is mechanistic understanding of internal representations and circuits, rather than downstream explainability alone.

Relevance: 8 Novelty: 8

6. Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

ArXiv ID: 2605.02908

Primary Topic: Representation Learning Theory and Structure

Authors: Bumjun Kim, Albert No

Abstract: Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as , , and with corresponding embeddings $\mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}$. We discover that $\mathbf{v}^{\mathbf{pr}}$ contribute minimally to generation in memorized cases. In contrast, $\mathbf{v}^{\mathbf{pad}}$ strongly affect memorization due to their structural duplication of $\mathbf{v}^{\mathbf{eot}}$, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of $\mathbf{v}^{\mathbf{eot}}$, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default from to the ! token before embedding, and masking the $\mathbf{v}^{\mathbf{eot}}$; (2) Partial masking of $\mathbf{v}^{\mathbf{pad}}$. Both suppress memorization without degrading quality, and are readily deployable without prior detection.

Comment: Identifies a specific CLIP-embedding mechanism behind Stable Diffusion memorization and proposes simple inference-time masking fixes.

Topic Match: The core result is a mechanistic explanation of how particular learned text embeddings structurally drive memorization behavior in a generative model.

Relevance: 8 Novelty: 8

7. When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

ArXiv ID: 2605.02914

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Ismail Hossain, Sai Puppala, Jannatul Ferdaus, Md Jahangir Alam, Yoonpyo Lee, Syed Bahauddin Alam, Sajedul Talukder

Abstract: A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification. We extract per-layer safety subspaces via SVD on class-conditional activation differences and track how this boundary evolves under benign fine-tuning. Granite Guardian undergoes complete collapse -- refusal rate drops from 85\% to 0\%, CKA falls to zero, and 100\% of outputs become ambiguous -- a severity exceeding prior findings on general-purpose LLMs, explained by the specialization hypothesis: concentrated safety representations are efficient but catastrophically brittle. To mitigate this, we propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), a training-time penalty combining (i) curvature-aware direction weights derived from diagonal Fisher information and (ii) an adaptive $\lambda_t$ that scales with task-safety gradient conflict. FW-SSR recovers 75\% refusal on Granite Guardian (CKA = 0.983) and reduces WildGuard's Attack Success Rate to 3.6\% -- below the unmodified baseline -- by actively sharpening the safety subspace rather than merely anchoring it. Across all three models, structural representational geometry (CKA, Fisher score) predicts safety behavior more reliably than absolute displacement metrics, establishing geometry-based monitoring as a necessary component of guard model evaluation in agentic deployments.

Comment: Shows benign domain fine-tuning can destroy guard-model safety subspaces and introduces Fisher-weighted regularization to preserve latent safety geometry.

Topic Match: The main contribution is analysis and preservation of a specific learned representational geometry, rather than a new downstream guard application.

Relevance: 8 Novelty: 8

8. Simultaneous CNN Approximation on Manifolds with Applications to Boundary Value Problems

ArXiv ID: 2605.04126

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Hanfei Zhou, Lei Shi

Abstract: This paper develops convolutional neural network (CNN) methods for simultaneous approximation and elliptic boundary value problems on compact Riemannian manifolds. We establish simultaneous Sobolev approximation results for single- and multichannel CNNs, showing that manifold functions and their derivatives can be approximated with rates governed by the intrinsic dimension and the smoothness gap, rather than by the ambient dimension, thereby mitigating the curse of dimensionality. Building on this approximation theory, we propose a physics-informed CNN (PICNN) framework specially designed for boundary value problems. The main numerical issue is a boundary-norm mismatch: standard PINNs usually impose boundary data through low-order, often L2-type, penalties, whereas elliptic stability requires Sobolev trace control. We address this by introducing a spectral boundary loss based on the boundary Laplace-Beltrami operator, which represents trace errors as weighted frequency energies and relates truncation error to boundary eigenvalue decay. This avoids smooth auxiliary constructions required by exact boundary enforcement and singular double integrals arising in Sobolev-Slobodeckij penalties, while enabling implementations based on Fast Fourier Transforms (FFTs) or precomputed spectral bases on structured boundaries. Numerical experiments demonstrate improved accuracy, convergence, and stability over standard PINNs.

Comment: Provides intrinsic-dimension Sobolev approximation theory for CNNs on manifolds and uses it to design a spectral boundary loss that improves physics-informed training stability.

Topic Match: The strongest contribution is theoretical understanding of what CNN representations can approximate on manifolds, with the training method as a downstream consequence.

Relevance: 8 Novelty: 8

Memory Structures and Agent Memory Systems (7)

1. Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping

ArXiv ID: 2605.04308

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Kaustubh Pethkar, Ziyang Xiong, Zuofeng Shang, Yingcong Li

Abstract: Continual incorporation of new knowledge is essential for the long-term evolution of large language models (LLMs). Existing approaches typically rely on parameter-update algorithms to mitigate catastrophic forgetting, yet they suffer from fundamental limitations: 1) forgetting is unavoidable as the amount of newly injected knowledge grows; and 2) model updates are often irreversible. As modern LLMs become increasingly expressive, it is natural to question whether large-scale weight updates are necessary for acquiring a small amount of new knowledge. In this work, we propose a principled framework that models autoregressive language generation as a Markov process over tokens, where model memory is represented by a Markov transition matrix. Under this formulation, incorporating new knowledge/tokens corresponds to extending the state space, and preserving existing transitions guarantees retention of previously learned knowledge. We then prove a sample complexity bound for incorporating new tokens via a token-to-dictionary mapping strategy. In particular, for learning the transition behavior of each new token, the required number of samples scales linearly with the number of existing tokens it is mapped to. To realize this mapping, we propose an embedding-tuning algorithm that requires minimal parameter updates and induces zero forgetting. Experimental results further demonstrate the effectiveness of our method and validate our theoretical findings.

Comment: Models autoregressive memory as a Markov transition matrix and derives sample-complexity guarantees for adding new knowledge via token-to-dictionary mappings with zero forgetting.

Topic Match: The work directly proposes a new principle for memory expansion and retention in language models, making memory systems the strongest fit.

Relevance: 9 Novelty: 8

2. What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

ArXiv ID: 2605.03354

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Xutao Mao, Jinman Zhao, Gerald Penn, Cong Wang

Abstract: Agent memory failures are silent: an LLM-based agent can produce a fluent response even when it fails to extract, retain, or retrieve the information needed across sessions. The write-manage-read loop describes the external pipeline of these systems but leaves open which internal computations implement each stage. Tracing internal feature circuits across the Qwen-3 family (0.6B--14B) and two memory frameworks (mem0 and A-MEM), we report three findings. First, control is detectable before content: routing circuitry is causally active at 0.6B, while content circuitry produces no detectable signal until 4B under our tracing setup, creating a deployment regime where small models route with apparent competence but silently fail at extraction and grounding. Second, within the content group, Write and Read share a late-layer hub that operates as a context-grounding substrate already present in the base model; only memory framing recruits a functional grounding direction on this substrate, and the hub transfers across both frameworks. Third, emergence does not imply steerability: although the content circuit becomes detectable at 4B, it becomes reliably steerable only at 8B, indicating that detection and intervention have distinct scale thresholds. As a practical implication, the feature-space separation between the two circuit groups enables per-operation failure localization at 76.2% accuracy without supervision, providing a stage-level diagnostic for otherwise silent agent-memory failures.

Comment: Identifies separable internal circuits for agent-memory write/manage/read operations and their scale-dependent emergence and steerability.

Topic Match: This is directly about internal memory mechanisms in agents and how memory operations are implemented internally.

Relevance: 9 Novelty: 8

3. RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

ArXiv ID: 2605.04075

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Sihao Liu, YuFan Xiong, Zhonghua Jiang, Zhaode Wang, chengfei lv Shengyu Zhang

Abstract: Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression methods typically rely on the "persistence of importance" hypothesis to prune tokens. However, this approach proves fragile in multimodal settings due to two key issues: 1) Visual tokens display "deferred importance," initially exhibiting low salience but becoming pivotal during later decoding, which can lead to premature eviction. 2) Discrete pruning disrupts the inherent spatial continuity of visual cues. To address these challenges, we propose RetentiveKV, an entropy-driven KV cache optimization method that reformulates KV eviction from "discrete context truncation" to "continuous memory evolution" based on State Space Models. Our method leverages information entropy to quantify the information potential of low-attention tokens and integrates tokens scheduled for eviction into a continuous state space through entropy-guided state transitions, enabling their dynamic reactivation when semantic relevance arises during subsequent decoding. Extensive experiments on multimodal benchmarks demonstrate that RetentiveKV achieves 5.0 times KV cache compression and 1.5 times decoding acceleration.

Comment: Reframes multimodal KV eviction as continuous state-space memory evolution with entropy-guided reactivation of evicted tokens.

Topic Match: Although it improves KV-cache efficiency, the central idea is a new memory mechanism: converting discrete eviction into latent state-space retention and later recall.

Relevance: 9 Novelty: 8

4. Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

ArXiv ID: 2605.05189

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Nicholas Barnfield, Juno Kim, Eshaan Nichani, Jason D. Lee, Yue M. Lu

Abstract: How many key-value associations can a $d\times d$ linear memory store? We show that the answer depends not only on the $d^2$ degrees of freedom in the memory matrix, but also on the retrieval criterion. In an isotropic Gaussian model for the stored pairs, we show that top-1 retrieval, where every signal must beat its largest distractor, requires the logarithmic model-size scale $d^2\asymp n\log n$. We prove that the correlation matrix memory construction, which stores associations by superposing key-target outer products, achieves this scale through a sharp phase transition, and that the same scaling is necessary for any linear memory. Thus the logarithm is the intrinsic extreme-value price of winner-take-all decoding. We next consider listwise retrieval, where the correct target need not be the unique top-scoring item but should remain among the strongest candidates. To formalize this regime, we propose the Tail-Average Margin (TAM), a convex upper-tail criterion that certifies inclusion of the correct target in a controlled candidate list. Under this listwise retrieval criterion, the capacity follows the quadratic scale $d^2\asymp n$. At load $n/d^2\to\alpha$, we develop an exact asymptotic theory for the TAM empirical-risk minimizer through a two-parameter scalar variational principle. The theory has a rich phenomenology: in the ridgeless limit it yields a closed-form critical load separating satisfiable and unsatisfiable phases, and it predicts the limiting laws of true scores, competitor scores, margins, and percentile profiles. Finally, a small-tail extrapolation further leads to the conjectural sharp top-1 threshold $d^2\sim 2n\log n$.

Comment: Provides sharp capacity thresholds for linear associative memory and shows retrieval criterion changes scaling from d^2~n log n to d^2~n.

Topic Match: The core result is a foundational theory of associative memory capacity and retrieval regimes, directly matching memory mechanisms rather than downstream use.

Relevance: 9 Novelty: 8

5. MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

ArXiv ID: 2605.03482

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Ishrith Gowda (University of California, Berkeley)

Abstract: Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval-augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered-query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by $4\times$ (ASR-R: $0.25 \to 1.00$). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration-based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This coupling yields a certified detection radius guaranteeing correct classification regardless of adversary strategy. We prove minimax optimality via Le Cam's method, showing any threshold detector requires $\Omega(1/\rho^2)$ calibration samples and MEMSAD achieves this up to $\log(1/\delta)$ factors. We further derive online regret bounds for rolling calibration at rate $O(\sigma^{2/3}\Delta^{1/3})$, and formally characterize a discrete synonym-invariance loophole that marks the boundary of what continuous-space defenses can guarantee. Experiments on a $3 \times 5$ attack-defense matrix with bootstrap confidence intervals, Bonferroni-corrected hypothesis tests, and Clopper-Pearson validation ($n=1{,}000$) confirm: composite defenses achieve TPR $= 1.00$, FPR $= 0.00$ across all attacks, while synonym substitution evades detection at $\Delta$ ASR-R $\approx 0$, exposing a gap existing embedding-based defenses cannot close.

Comment: Gives a gradient-coupling theorem and certified detection radius for memory poisoning in retrieval-augmented agents, exposing the continuous-space defense boundary.

Topic Match: Best fit because it directly studies persistent external memory security with new theory about retrieval-memory dynamics, calibration, and poisoning detection.

Relevance: 9 Novelty: 8

6. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

ArXiv ID: 2605.04651

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Guangsheng Bao, Hongbo Zhang, Han Cui, Yanbin Zhao, Yue Zhang

Abstract: Adapting pretrained models typically involves a trade-off between the high training costs of backpropagation and the heavy inference overhead of memory-based or in-context learning. We propose FAAST, a forward-only associative adaptation method that analytically compiles labeled examples into fast weights in a single pass. By eliminating memory or context dependence, FAAST achieves constant-time inference and decouples task adaptation from pretrained representation. Across image classification and language modeling benchmarks, FAAST matches or exceeds backprop-based adaptation while reducing adaptation time by over 90\% and is competitive to memory/context-based adaptation while saving memory usage by up to 95\%. These results demonstrate FAAST as a highly efficient, scalable solution for supervised task adaptation, particularly for resource-constrained models. We release the code and models at https://github.com/baoguangsheng/faast.

Comment: Compiles labeled examples into closed-form fast weights for forward-only test-time supervised adaptation with constant-time inference.

Topic Match: The core idea is an associative fast-weight memory mechanism for adaptation, more than just an efficiency tweak.

Relevance: 8 Novelty: 8

7. Skill Neologisms: Towards Skill-based Continual Learning

ArXiv ID: 2605.04970

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics

Authors: Antonin Berthon, Nicolas Astorga, Mihaela van der Schaar

Abstract: Modern LLMs show mastery over an ever-growing range of skills, as well as the ability to compose them flexibly. However, extending model capabilities to new skills in a scalable manner is an open-problem: fine-tuning and parameter-efficient variants risk catastrophic forgetting, while context-based approaches have limited expressiveness and are constrained by the model's effective context. We explore skill neologisms--i.e., soft tokens integrated in the model's vocabulary and optimized to improve capabilities over a specific skill--as a way to selectively extend model capabilities to new skills without weight updates. We first observe that off-the-shelf pre-trained LLMs already demonstrate tokens associated with procedural knowledge. We then show that skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and that independently trained skill neologisms can be composed zero-shot. These results suggest that skill neologisms may provide a scalable path towards skill-based continual learning.

Comment: Explores skill neologisms—learned soft vocabulary tokens—as a weight-free mechanism for continual skill extension and zero-shot composition.

Topic Match: Best fit because the proposed soft-token skill mechanism functions as an externalized persistent capability memory for continual learning without updating model weights.

Relevance: 8 Novelty: 8

World Models, Exploration, and Open-Ended Reinforcement Learning (8)

1. ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

ArXiv ID: 2605.04709

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Yurui Du, Pinhao Song, Yutong Hu, Renaud Detry

Abstract: A central challenge of visual control with model-based reinforcement learning (RL) is reliable long-horizon planning: long rollouts with learned latent dynamics exhibit branching futures and multi-modal action-value distributions. In addition, compounding model errors amplified by visual occlusions make deep imagination brittle. We present ELVIS, a latent model predictive controller (MPC) designed to make long-horizon planning practical. ELVIS plans in a Dreamer-style recurrent state space model (RSSM) and replaces standard unimodal model predictive path integral (MPPI) with a Gaussian-mixture MPPI that maintains multiple coherent hypotheses over long horizons, avoiding mode averaging under branching rollouts. In parallel, ELVIS stabilizes deep imagination with a shared uncertainty-aware lambda-return: an ensemble of latent critics defines an upper-confidence-bound (UCB) score that gates a time-varying lambda, adaptively trading off bootstrapping versus look-ahead to limit compounding error during planning. The same return is used both to train an actor-critic prior from imagined rollouts and to score candidate trajectories inside GMM-MPPI, aligning RL objectives with the planner's long-horizon optimization. On fourteen DeepMind Control Suite visual tasks, ELVIS establishes state-of-the-art performance compared with TD-MPC2 and DreamerV3. Finally, ELVIS transfers zero-shot to a real-world sand-spraying task with severe occlusions, improving surface-quality metrics and demonstrating robustness beyond simulation.

Comment: Introduces a long-horizon visual model-based RL controller that combines multimodal trajectory planning with uncertainty-calibrated latent imagination.

Topic Match: This is directly about world-model-based planning and reliable long-horizon imagination in model-based RL, which squarely matches the world-models topic.

Relevance: 9 Novelty: 8

2. Learning to Theorize the World from Observation

ArXiv ID: 2605.03413

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Representation Learning Theory and Structure

Authors: Doojin Baek, Gyubin Lee, Junyeob Baek, Hosung Lee, Sungjin Ahn

Abstract: What does it mean to understand the world? Contemporary world models often operationalize understanding as accurate future prediction in latent or observation space. Developmental cognitive science, however, suggests a different view: human understanding emerges through the construction of internal theories of how the world works, even before mature language is acquired. Inspired by this theory-building view of cognition, we introduce Learning-to-Theorize, a learning paradigm for inferring explicit explanatory theories of the world from raw, non-textual observations. We instantiate this paradigm with the Neural Theorizer (NEO), a probabilistic neural model that induces latent programs as a learned Language of Thought and executes them through a shared transition model. In NEO, a theory is represented as an executable, compositional program whose learned primitives can be systematically recombined to explain novel phenomena. Experiments show that this formulation enables explanation-driven generalization, allowing observations to be understood in terms of the programs that generate them.

Comment: Learns executable latent programs as explicit theories of world dynamics, emphasizing explanation-driven world modeling over pure prediction.

Topic Match: The paper fits best as foundational world-model research centered on explicit internal theories of environment dynamics.

Relevance: 9 Novelty: 8

3. Discovering Reinforcement Learning Interfaces with Large Language Models

ArXiv ID: 2605.03408

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Akshat Singh Jaswal, Ashish Baghel, Paras Chopra

Abstract: Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.

Comment: Automates discovery of RL observation and reward interfaces from raw simulator state, treating interface design itself as a learnable object.

Topic Match: It targets a foundational RL problem—constructing the agent-environment interface—rather than LLM post-training.

Relevance: 8 Novelty: 8

4. Structural Equivalence and Learning Dynamics in Delayed MARL

ArXiv ID: 2605.04345

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Jules Sintes, Ana Bu\v{s}i\'c, Jiamin Zhu

Abstract: We formally establish the equivalence between Observation Delay (OD) and Action Delay (AD) in cooperative partially observable multi-agent systems using observation-action histories. We show that both systems generate identical admissible joint-policy sets, and their induced state-action-observation trajectories are identical in distribution, leading to identical optimal solutions in Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). This formally generalizes existing infinite-horizon single-agent results to any-horizon partially observable cooperative multi-agent problems with decentralized policy execution, and allows any mixed-delay configuration to be reduced to a pure OD system. We further prove that in Transition-Independent MDPs (TI-MDPs), the observation-action history reduces to a tractable minimal local augmented state. However, we show through numerical experiments that although the optimal solution spaces are structurally isomorphic, the practical learning dynamics are fundamentally different. First, using the minimal local augmented state, the equivalence no longer holds when transitions are not independent. Second, operational constraints and causal credit-assignment errors in Temporal Difference (TD) algorithms induce different learning behaviors across regimes. Finally, leveraging this structural equivalence to bypass these learning challenges, we demonstrate successful multi-agent zero-shot policy transfer from OD to AD, paving the way for unified, efficient solution methods in complex delayed systems.

Comment: Formal equivalence between observation delay and action delay in Dec-POMDPs, plus analysis of why TD learning still behaves differently despite structural isomorphism.

Topic Match: This is foundational RL theory on delayed partially observable multi-agent systems, with direct implications for learning dynamics and transfer in sequential decision making.

Relevance: 8 Novelty: 8

5. Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

ArXiv ID: 2605.05102

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Harin Lee, Min-hwan Oh

Abstract: We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $\delta \in (0,1]$, thereby characterizing the regret distribution across the full range of $\delta$. We present a simple UCBVI-style algorithm with exploration bonus $\min{c_{1,k}/N, c_{2,k}/\sqrt{N}}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/\delta))$, confirming the conjecture of Lattimore & Szepesv\'ari (2020, Section 17.1) for the first time.

Comment: Provides a unified distributional-regret framework for bandits and episodic RL, including optimal trade-offs and a first proof of a conjectured MAB bound.

Topic Match: This is core reinforcement-learning theory on exploration-performance tradeoffs, with a unified treatment of regret distributions across bandits and RL.

Relevance: 8 Novelty: 8

6. Counterfactual identifiability beyond global monotonicity: non-monotone triangular structural causal models

ArXiv ID: 2605.04413

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Pengcheng Tan, Jiang Chen, Dehui Du

Abstract: Structural causal models provide a unified semantics for interventions and counterfactuals, but most identifiability results rely on restrictive assumptions like global monotonicity, which are often violated in embodied interaction, where the same exogenous perturbation can induce opposite responses under different contact contexts. We ask what structure still suffices once global monotonicity is dropped. We introduce non-monotone triangular structural causal models (NM-TM-SCM), which retain triangular recursion but replace global monotonicity with mechanism-wise invertibility and context-independent inverse transport. We prove that these conditions are equivalent to exogenous isomorphism and imply complete counterfactual identifiability, and we give a counterexample showing that local invertibility alone is insufficient. We instantiate the theory in CausalInverter, with triangular invertible layers, orientation gates, and transport-stability regularization. On synthetic non-monotonic mechanisms, the structural bias yields systematic counterfactual gains as non-monotonicity increases. On MuJoCo Door, our model achieves perfect event-level counterfactual recovery, lowers continuous angle error relative to a Transformer baseline, and delivers substantially more stable recovery than Transformer and conditional-flow predictors. On MuJoCo Push, where non-monotonicity is weaker, the same low-data predictors remain competitive or better, consistent with a bias-variance boundary. These results identify a broader identifiable regime between globally monotone triangular models and unconstrained black-box world models.

Comment: Proves complete counterfactual identifiability for a broader non-monotone triangular SCM class via mechanism-wise invertibility and context-independent inverse transport.

Topic Match: Best fit because it develops foundational causal structure for counterfactual modeling in embodied environments, which is closely tied to world-model learning rather than downstream application performance.

Relevance: 8 Novelty: 8

7. Bilinear Mamba-Koopman Neural MPC for Varying Dynamics

ArXiv ID: 2605.04793

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Matan Pagi, Zohar Sorek

Abstract: Koopman-based neural MPC models generate time-varying dynamics from historical data, but preserve convexity by enforcing that the system operator is independent of the current control input. This conditional independence constraint limits adaptation to changing dynamics within a single MPC horizon, particularly under time-varying conditions and under stale-plan execution. We propose Bilinear Mamba-Koopman Neural MPC, a minimal extension that introduces control-dependent coupling in the latent dynamics, allowing the effective operator to adapt to the current input. The resulting model is a strict generalization of the standard linear, conditional-independence formulation, adds less than 1% parameters through a low-rank structure, and admits exact model Jacobians that enable efficient Sequential Convex Programming (SCP) with monotone-descent and KKT convergence results under standard trust-region assumptions. Across CartPole and RSCP benchmarks in time-invariant and time-varying regimes, the proposed model matches or improves forecasting accuracy on every cell when training noise is averaged out, with strict gains where control-state coupling is structurally present. Its main closed-loop gains appear in the RSCP TV task, where iterative SCP improves adaptation within the horizon and substantially stabilizes training; in CartPole TV, the gains are modest but consistent. In delayed re-planning experiments on the time-varying variants, the bilinear model degrades more gracefully under stale-plan execution, maintaining a consistent advantage on CartPole TV and a substantially larger robustness margin on RSCP TV. These results show that control-dependent latent dynamics provide a simple and effective mechanism for robust MPC under varying conditions.

Comment: Adds low-rank control-dependent bilinear latent dynamics to Mamba-Koopman MPC, enabling within-horizon adaptation under varying dynamics.

Topic Match: Best fit because the contribution is a foundational action-conditioned world-modeling mechanism for adaptive control, not merely improved control benchmarks.

Relevance: 8 Novelty: 8

8. CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

ArXiv ID: 2605.04470

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Keyu Chen, Nanfei Ye, Yida Wang, Wenchao Sun, Danqi Zhao, Hao Cheng, Sifa Zheng

Abstract: Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade-offs: closed-loop RL fine-tuning provides grounded feedback from executed actions but is constrained by the sparsity of informative events, whereas counterfactual fine-tuning provides dense supervision over candidate futures but inherits bias from imperfect future estimates. We introduce Counterfactual-to-Interactive Reinforcement Fine-Tuning (CRAFT), an on-policy framework that formulates closed-loop post-training as proxy-residual optimization. CRAFT uses group-normalized counterfactual advantages as a dense proxy for real closed-loop advantages and aligns this proxy with the closed-loop world through grounded residual correction from interaction-critical events. To stabilize adaptation, CRAFT regularizes the online policy toward an EMA teacher via asymmetric KL self-distillation. Theoretically, CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution, reducing residual variance with an aligned proxy while mitigating proxy bias through grounded residual approximation. Empirically, CRAFT achieves the strongest closed-loop gains on Bench2Drive across hierarchical planning, vision-language-action, and vocabulary-scoring architectures. Ablations, scaling behavior, stability analyses, and transfer results further validate the complementary roles of dense counterfactual proxy and grounded residual correction. Project page: https://currychen77.github.io/CRAFT.

Comment: Decomposes closed-loop policy gradients into dense counterfactual proxy and grounded residual correction.

Topic Match: This is fundamentally an RL method for closed-loop policy improvement under distribution shift, not an LLM alignment paper.

Relevance: 8 Novelty: 8

Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.

7-8: substantially related, but partly peripheral or focused on a narrower aspect.

5-6: touches the target topics, but the main contribution is elsewhere.

3-4: largely outside the target topics, often application-focused or domain-specific.

1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

9-10: new paradigm, theory, or major methodological breakthrough.

7-8: substantial methodological advance or strong new insight.

5-6: meaningful but incremental extension or refinement.

3-4: minor, narrow, or mostly engineering or domain-specific improvement.

1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.