Previous Day 2026-04-13
Monthly Overview 2026-04

Personalized Daily ArXiv Papers 2026-04-14

Model Metric Usage Papers
Prompt Completion Total Total arXiv Scanned Relevant
gpt-5.4 Tokens 354413 37243 391656 1248 789 51
Cost $0.89 $0.56 $1.44

Topic Coverage:

TopicPapers
Architecture and Training Dynamics18
Efficiency, Compression, and Large-Scale Training9
Representation Learning Theory and Structure12
Memory Structures and Agent Memory Systems8
World Models, Exploration, and Open-Ended Reinforcement Learning4

Table of contents by topic:

Architecture and Training Dynamics (18)

  1. Universality of first-order methods on random and deterministic matrices Authors: Nicola Gorini, Chris Jones, Dmitriy Kunisky, Lucas Pesenti

  2. Introspective Diffusion Language Models Authors: Yifan Yu, Yuqing Jian, Junxiong Wang, Zhongzhu Zhou, Donglin Zhuang, Xinyu Fang, Sri Yanamandra, Xiaoxia Wu, Qingyang Wu, Shuaiwen Leon Song, Tri Dao, Ben Athiwaratkun, James Zou, Fan Lai, Chenfeng Xu

  3. Discrete Flow Maps Authors: Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, Michael S. Albergo

  4. INCRT: An Incremental Transformer That Determines Its Own Architecture Authors: Giansalvo Cirrincione

  5. A Mechanistic Analysis of Looped Reasoning Language Models Authors: Hugh Blayney, \'Alvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, Xiaowen Dong

  6. Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures Authors: Maxim Bolshim (ITMO University, Saint Petersburg, Russia), Alexander Kugaevskikh (ITMO University, Saint Petersburg, Russia)

  7. Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria Authors: Nikodem Tomczak

  8. Query Lower Bounds for Diffusion Sampling Authors: Zhiyang Xun, Eric Price

  9. The Diffusion-Attention Connection Authors: Julio Candanedo

  10. SHANG++: Robust Stochastic Acceleration under Multiplicative Noise Authors: Yaxin Yu, Long Chen, Minfu Feng

  11. THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture Authors: Augustus Haoyang Li

  12. Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD Authors: Konstantinos Emmanouilidis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Rene Vidal

  13. Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models Authors: Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo

  14. The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks Authors: Mani Rash Ahmadi

  15. Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression Authors: Yijin Ni, Xiaoming Huo

  16. Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis Authors: Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

  17. LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling Authors: Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu

  18. Continuous Adversarial Flow Models Authors: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

Efficiency, Compression, and Large-Scale Training (9)

  1. ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval Authors: David H. Yang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen

  2. SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding Authors: Jehyeon Bang, Eunyeong Cho, Ranggi Hwang, Jinha Chung, Minsoo Rhu

  3. ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation Authors: Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak

  4. S$^3$: Structured Sparsity Specification Authors: Ayoub Ghriss

  5. LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention Authors: Dongjie Xu, Hao Wu, Weijie Shi, Yue Cui, Yuanjun Liu, Jiawei Li, Haolun Ma, An Liu, Jia Zhu, Jiajie Xu

  6. VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination Authors: Muyan Hu, Ahan Gupta, Jiachen Yuan, Vima Gupta, Taeksang Kim, Xin Xu, Janardhan Kulkarni, Ofer Dekel, Vikram Adve, Charith Mendis

  7. Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees Authors: Zhuolun Dong, Junyu Cao

  8. ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios Authors: Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan

  9. Mitigating Privacy Risk via Forget Set-Free Unlearning Authors: Aviraj Newatia, Michael Cooper, Viet Nguyen, Rahul G. Krishnan

Representation Learning Theory and Structure (12)

  1. A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics Authors: Louie Hong Yao, Yuhao Li, Shengchao Liu

  2. Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs Authors: Hongkang Li, Hancheng Min, Rene Vidal

  3. Mild Over-Parameterization Benefits Asymmetric Tensor PCA Authors: Shihong Ding, Weicheng Lin, Cong Fang

  4. A Deep Generative Approach to Stratified Learning Authors: Randy Martinez, Rong Tang, Lizhen Lin

  5. Steered LLM Activations are Non-Surjective Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu

  6. Why Do Large Language Models Generate Harmful Content? Authors: Rajesh Ganguli, Raha Moraffah

  7. Tracing the Thought of a Grandmaster-level Chess-Playing Transformer Authors: Rui Lin, Zhenyu Jin, Guancheng Zhou, Xuyang Ge, Wentao Shu, Jiaxing Wu, Junxuan Wang, Zhengfu He, Junping Zhang, Xipeng Qiu

  8. Closed-Form Concept Erasure via Double Projections Authors: Chi Zhang, Jingpu Cheng, Zhixian Wang, Ping Liu

  9. Pando: Do Interpretability Methods Work When Models Won't Explain Themselves? Authors: Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan

  10. Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment Authors: Yang Cui, Jingyuan Sun, Yizheng Sun, Yifan Wang, Yunhao Zhang, Jixing Li, Shaonan Wang, Hongpeng Zhou, John Hale, Chengqing Zong, Goran Nenadic

  11. Exact Finite-Sample Variance Decomposition of Subagging: A Spectral Filtering Perspective Authors: Ye Su, Mingrui Ye, Yining Wang, Jipeng Guo, Yong Liu

  12. Learning What's Real: Disentangling Signal and Measurement Artifacts in Multi-Sensor Data, with Applications to Astrophysics Authors: Pablo Mercader-Perez, Carolina Cuesta-Lazaro, Daniel Muthukrishna, Jeroen Audenaert, V. Ashley Villar, David W. Hogg, Marc Huertas-Company, William T. Freeman

Memory Structures and Agent Memory Systems (8)

  1. Human-like Working Memory Interference in Large Language Models Authors: Hua-Dong Xiong (School of Psychological and Brain Sciences, Georgia Tech), Li Ji-An (Department of Psychology, New York University), Jiaqi Huang (Department of Cognitive Science, Indiana University Bloomington, Honda Research Institute), Robert C. Wilson (School of Psychological and Brain Sciences, Georgia Tech, Center of Excellence for Computational Cognition, Georgia Tech), Kwonjoon Lee (Honda Research Institute), Xue-Xin Wei (Departments of Neuroscience and Psychology, The University of Texas at Austin)

  2. Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory Authors: Weixian Waylon Li, Jiaxin Zhang, Xianan Jim Yang, Tiejun Ma, Yiwen Guo

  3. Learning to Forget -- Hierarchical Episodic Memory for Lifelong Robot Deployment Authors: Leonard B\"armann, Joana Plewnia, Alex Waibel, Tamim Asfour

  4. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents Authors: Mofasshara Rafique, Laurent Bindschaedler

  5. Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning Authors: Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, Yang Li

  6. Endogenous Information in Routing Games: Memory-Constrained Equilibria, Recall Braess Paradoxes, and Memory Design Authors: Saad Alqithami

  7. Beyond LLMs, Sparse Distributed Memory, and Neuromorphics Authors: Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato

  8. Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure Authors: Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano

World Models, Exploration, and Open-Ended Reinforcement Learning (4)

  1. Grounded World Model for Semantically Generalizable Planning Authors: Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh

  2. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps Authors: Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen

  3. Evolving Many Worlds: Towards Open-Ended Discovery in Petri Dish NCA via Population-Based Training Authors: Uljad Berdica, Jakob Foerster, Frank Hutter, Arber Zela

  4. Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations Authors: Abhijeet Vishwasrao, Francisco Giral, Mahmoud Golestanian, Federica Tonti, Andrea Arroyo Ramo, Adrian Lozano-Duran, Steven L. Brunton, Sergio Hoyas, Soledad Le Clainche, Hector Gomez, Ricardo Vinuesa


Architecture and Training Dynamics (18)

1. Universality of first-order methods on random and deterministic matrices

ArXiv ID: 2604.11729

Primary Topic: Architecture and Training Dynamics

Authors: Nicola Gorini, Chris Jones, Dmitriy Kunisky, Lucas Pesenti

Abstract: General first-order methods (GFOM) are a flexible class of iterative algorithms which update a state vector by matrix-vector multiplications and entrywise nonlinearities. A long line of work has sought to understand the large-n dynamics of GFOM, mostly focusing on "very random" input matrices and the approximate message passing (AMP) special case of GFOM whose state is asymptotically Gaussian. Yet, it has long remained unknown how to construct iterative algorithms that retain this Gaussianity for more structured inputs, or why existing AMP algorithms can be as effective for some deterministic matrices as they are for random matrices. We analyze diagrammatic expansions of GFOM via the limiting traffic distribution of the input matrix, the collection of all limiting values of permutation-invariant polynomials in the matrix entries, to obtain the following results: 1. We calculate the traffic distribution for the first non-trivial deterministic matrices, including (minor variants of) the Walsh-Hadamard and discrete sine and cosine transform matrices. This determines the limiting dynamics of GFOM on these inputs, resolving parts of longstanding conjectures of Marinari, Parisi, and Ritort (1994). 2. We design a new AMP iteration which unifies several previous AMP variants and generalizes to new input types, whose limiting dynamics are Gaussian conditional on some latent random variables. The asymptotic dynamics hold for a large and natural class of traffic distributions (encompassing both random and deterministic input matrices) and the algorithm's analysis gives a simple combinatorial interpretation of the Onsager correction, answering questions posed recently by Wang, Zhong, and Fan (2022).

Comment: Introduces a generalized AMP iteration with Gaussian asymptotics beyond random matrices, including structured deterministic transforms via traffic distributions.

Topic Match: Best fit is architecture/training dynamics because the core contribution is a new iterative computational mechanism and its asymptotic dynamics analysis.

Relevance: 9 Novelty: 9


2. Introspective Diffusion Language Models

ArXiv ID: 2604.11035

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Yifan Yu, Yuqing Jian, Junxiong Wang, Zhongzhu Zhou, Donglin Zhuang, Xinyu Fang, Sri Yanamandra, Xiaoxia Wu, Qingyang Wu, Shuaiwen Leon Song, Tri Dao, Ben Athiwaratkun, James Zou, Fan Lai, Chenfeng Xu

Abstract: Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.

Comment: Introduces introspective consistency as a structural principle and uses it to redesign diffusion language model training and decoding.

Topic Match: The core contribution is a new training/decoding paradigm for diffusion language models based on an architectural-mechanistic diagnosis of inconsistency.

Relevance: 9 Novelty: 9


3. Discrete Flow Maps

ArXiv ID: 2604.09784

Primary Topic: Architecture and Training Dynamics

Authors: Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, Michael S. Albergo

Abstract: The sequential nature of autoregressive next-token prediction imposes a fundamental speed limit on large language models. While continuous flow models offer a path to parallel generation, they traditionally demand expensive iterative integration. Flow Maps bypass this bottleneck by compressing generative trajectories into single-step mappings, theoretically enabling the generation of full text sequences from noise in a single forward pass. However, standard formulations rely on Euclidean regression losses that are geometrically ill-suited for discrete data. In this work, we resolve this conflict with Discrete Flow Maps, a framework that reconciles trajectory compression with the geometry of the probability simplex. We recast standard flow map training for the discrete domain, aligning the training dynamics with the discrete nature of language. Empirically, this strict geometric alignment allows our method to surpass previous state-of-the-art results in discrete flow modeling.

Comment: Reformulates flow-map training to respect discrete simplex geometry for one-step parallel discrete generation.

Topic Match: The core contribution is a new generative modeling formulation for discrete sequence modeling, squarely in core architecture/mechanism design rather than application.

Relevance: 9 Novelty: 9


4. INCRT: An Incremental Transformer That Determines Its Own Architecture

ArXiv ID: 2604.10703

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Giansalvo Cirrincione

Abstract: Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.

Comment: Proposes a transformer that grows and prunes attention heads online using an explicit geometric sufficiency criterion.

Topic Match: The heart of the paper is an architectural/training mechanism for adaptive capacity allocation inside transformers.

Relevance: 9 Novelty: 8


5. A Mechanistic Analysis of Looped Reasoning Language Models

ArXiv ID: 2604.11791

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Hugh Blayney, \'Alvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, Xiaowen Dong

Abstract: Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM's layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.

Comment: Mechanistic study shows looped reasoning models converge to cyclic fixed points with stable stage-like inference dynamics.

Topic Match: The paper directly analyzes recurrent/looped architectural dynamics and how inference unfolds internally, making architecture/training dynamics the clearest fit.

Relevance: 9 Novelty: 8


6. Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures

ArXiv ID: 2604.11639

Primary Topic: Architecture and Training Dynamics

Authors: Maxim Bolshim (ITMO University, Saint Petersburg, Russia), Alexander Kugaevskikh (ITMO University, Saint Petersburg, Russia)

Abstract: Modern automatic differentiation frameworks (JAX, PyTorch) return the Hessian of the loss function as a monolithic tensor, without exposing the internal structure of inter-layer interactions. This paper presents an analytical formalism that explicitly decomposes the full Hessian into blocks indexed by the DAG of an arbitrary architecture. The canonical decomposition $H = H^{GN} + H^T$ separates the Gauss--Newton component (convex part) from the tensor component (residual curvature responsible for saddle points). For piecewise-linear activations (ReLU), the tensor component of the input Hessian vanishes ($H^{T}{v,w}!\equiv!0$ a.e., $H^f}!=!H^{GN{v,w}!\succeq!0$); the full parametric Hessian contains residual terms that do not reduce to the GGN. Building on this decomposition, we introduce diagnostic metrics (inter-layer resonance~$\mathcal{R}$, geometric coupling~$\mathcal{C}$, stable rank~$\mathcal{D}$, GN-Gap) that are estimated stochastically in $O(P)$ time and reveal structural curvature interactions between layers. The theoretical analysis explains exponential decay of resonance in vanilla networks and its preservation under skip connections; empirical validation spans fully connected MLPs (Exp.\,1--5) and convolutional architectures (ResNet-18, ${\sim}11$M~parameters, Exp.\,6). When the architecture reduces to a single node, all definitions collapse to the standard Hessian $\nabla^2\theta\mathcal{L}(\theta)\in\mathbb{R}^{p\times p}$.

Comment: Decomposes the Hessian along DAG architecture structure and introduces inter-layer curvature diagnostics estimable at scale.

Topic Match: This is directly about architecture-aware training dynamics and curvature structure in neural networks.

Relevance: 9 Novelty: 8


7. Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria

ArXiv ID: 2604.10560

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Nikodem Tomczak

Abstract: Profiled Sparse Networks (PSN) replace uniform connectivity with deterministic, heterogeneous fan-in profiles defined by continuous, nonlinear functions, creating neurons with both dense and sparse receptive fields. We benchmark PSN across four classification datasets spanning vision and tabular domains, input dimensions from 54 to 784, and network depths of 2--3 hidden layers. At 90% sparsity, all static profiles, including the uniform random baseline, achieve accuracy within 0.2-0.6% of dense baselines on every dataset, demonstrating that heterogeneous connectivity provides no accuracy advantage when hub placement is arbitrary rather than task-aligned. This result holds across sparsity levels (80-99.9%), profile shapes (eight parametric families, lognormal, and power-law), and fan-in coefficients of variation from 0 to 2.5. Internal gradient analysis reveals that structured profiles create a 2-5x gradient concentration at hub neurons compared to the ~1x uniform distribution in random baselines, with the hierarchy strength predicted by fan-in coefficient of variation ($r = 0.93$). When PSN fan-in distributions are used to initialise RigL dynamic sparse training, lognormal profiles matched to the equilibrium fan-in distribution consistently outperform standard ERK initialisation, with advantages growing on harder tasks, achieving +0.16% on Fashion-MNIST ($p = 0.036$, $d = 1.07$), +0.43% on EMNIST, and +0.49% on Forest Cover. RigL converges to a characteristic fan-in distribution regardless of initialisation. Starting at this equilibrium allows the optimiser to refine weights rather than rearrange topology. Which neurons become hubs matters more than the degree of connectivity variance, i.e., random hub placement provides no advantage, while optimisation-driven placement does.

Comment: Analyzes heterogeneous fan-in sparse networks and shows optimized dynamic sparse training converges toward characteristic topological equilibria.

Topic Match: It is fundamentally about sparse architecture design and training dynamics, with mechanistic analysis of gradient hierarchy and topology evolution.

Relevance: 9 Novelty: 8


8. Query Lower Bounds for Diffusion Sampling

ArXiv ID: 2604.10857

Primary Topic: Architecture and Training Dynamics

Authors: Zhiyang Xun, Eric Price

Abstract: Diffusion models generate samples by iteratively querying learned score estimates. A rapidly growing literature focuses on accelerating sampling by minimizing the number of score evaluations, yet the information-theoretic limits of such acceleration remain unclear. In this work, we establish the first score query lower bounds for diffusion sampling. We prove that for $d$-dimensional distributions, given access to score estimates with polynomial accuracy $\varepsilon=d^{-O(1)}$ (in any $L^p$ sense), any sampling algorithm requires $\widetilde{\Omega}(\sqrt{d})$ adaptive score queries. In particular, our proof shows that any sampler must search over $\widetilde{\Omega}(\sqrt{d})$ distinct noise levels, providing a formal explanation for why multiscale noise schedules are necessary in practice.

Comment: Establishes the first adaptive score-query lower bounds for diffusion sampling, proving multiscale noise search is inherently necessary.

Topic Match: This is best read as foundational analysis of a core generative-model computational mechanism: the query complexity of diffusion sampling.

Relevance: 8 Novelty: 9


9. The Diffusion-Attention Connection

ArXiv ID: 2604.09560

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Julio Candanedo

Abstract: Transformers, diffusion-maps, and magnetic Laplacians are usually treated as separate tools; we show they are all different regimes of a single Markov geometry built from pre-softmax query-scores. We define a QK "bidivergence" whose exponentiated and normalized forms yield attention, diffusion-maps, and magnetic diffusion. And use product of experts and Schr\"odinger-bridges to connect and organize them into equilibrium, nonequilibrium steady-state, and driven dynamics.

Comment: Unifies attention, diffusion maps, and magnetic diffusion through a shared Markov geometry over pre-softmax scores.

Topic Match: Its primary value is a conceptual and mathematical reframing of attention as part of a broader computational mechanism family.

Relevance: 8 Novelty: 9


10. SHANG++: Robust Stochastic Acceleration under Multiplicative Noise

ArXiv ID: 2603.09355

Primary Topic: Architecture and Training Dynamics

Authors: Yaxin Yu, Long Chen, Minfu Feng

Abstract: Under the multiplicative noise scaling (MNS) condition, original Nesterov acceleration is provably sensitive to noise and may diverge when gradient noise overwhelms the signal. In this paper, we develop two accelerated stochastic gradient descent methods by discretizing the Hessian-driven Nesterov accelerated gradient flow. We first derive SHANG, a direct Gauss-Seidel-type discretization that already improves stability under MNS. We then introduce SHANG++, which adds a damping correction and achieves faster convergence with stronger noise robustness. We establish convergence guarantees for both convex and strongly convex objectives under MNS, together with explicit parameter choices. In our experiments, SHANG++ performs consistently well across convex problems and applications in deep learning. In a dedicated noise experiment on ResNet-34, a single hyperparameter configuration attains accuracy within 1% of the noise-free setting. Across all experiments, SHANG++ outperforms existing accelerated methods in robustness and efficiency, with minimal parameter sensitivity.

Comment: Introduces SHANG++ as an accelerated stochastic optimizer explicitly designed for robustness under multiplicative noise.

Topic Match: The paper is fundamentally about training dynamics and optimizer stability under realistic noise, making architecture/training the best fit.

Relevance: 8 Novelty: 8


11. THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture

ArXiv ID: 2604.11284

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Augustus Haoyang Li

Abstract: We present THEIA, a modular neural architecture that learns complete Kleene three-valued logic (K3) end-to-end without any external symbolic solver, and investigate what architectural prior enables compositional generalization under uncertainty. THEIA processes four mathematical domains (arithmetic, order, set membership, propositional logic) through dedicated engines that converge in a final logic module. Trained on a 2M-sample dataset with input space ~3.4x10^13, it achieves 12/12 Kleene K3 rule coverage across 5 seeds in 9.2 +/- 3.5 minutes (5.6x faster than a parameter-comparable Transformer under matched settings). A mod-3 sequential composition experiment generalizes from 5-step training to 500-step evaluation at 99.97% +/- 0.02% -- a result that critically depends on structured inductive bias: replacing the four-engine backbone with a flat MLP collapses length generalization to chance by 50 steps regardless of capacity (both 0.80M and parameter-matched 2.75M variants fail), while a pre-LN TF8LTuned Transformer baseline (3,582,147 params) trained under the identical protocol reaches 99.24% at 500 steps (Appendix D). Mechanistic probing reveals that modularity induces a delayed verdict: upstream engines encode domain-specific variables without committing to the final truth value (probe accuracy <= 74% uncertainty-only ceiling), with the verdict emerging only at the Logic Engine boundary -- causally confirmed by activation patching (100% flip rate on 986 matched pairs, replicated across n=5 seeds; 100.0% aggregate). The Transformer baseline reaches equivalent correctness through a qualitatively different representational trajectory (contraction then expansion), suggesting that modular and monolithic architectures implement distinct compositional strategies.

Comment: Shows a modular pure-neural architecture can learn full Kleene three-valued logic and links compositional generalization to delayed-verdict modular structure.

Topic Match: Primary fit is architecture/training because the core claim is that a specific modular inductive bias enables robust compositional computation.

Relevance: 8 Novelty: 8


12. Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD

ArXiv ID: 2604.10373

Primary Topic: Architecture and Training Dynamics

Authors: Konstantinos Emmanouilidis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Rene Vidal

Abstract: From adversarial robustness to multi-agent learning, many machine learning tasks can be cast as finite-sum min-max optimization or, more generally, as variational inequality problems (VIPs). Owing to their simplicity and scalability, stochastic gradient methods with constant step size are widely used, despite the fact that they converge only up to a constant term. Among the many heuristics adopted in practice, two classical techniques have recently attracted attention to mitigate this issue: \emph{Random Reshuffling} of data and \emph{Richardson--Romberg extrapolation} across iterates. Random Reshuffling sharpens the mean-squared error (MSE) of the estimated solution, while Richardson-Romberg extrapolation acts orthogonally, providing a second-order reduction in its bias. In this work, we show that their composition is strictly better than both, not only maintaining the enhanced MSE guarantees but also yielding an even greater cubic refinement in the bias. To the best of our knowledge, our work provides the first theoretical guarantees for such a synergy in structured non-monotone VIPs. Our analysis proceeds in two steps: (i) we smooth the discrete noise induced by reshuffling and leverage tools from continuous-state Markov chain theory to establish a novel law of large numbers and a central limit theorem for its iterates; and (ii) we employ spectral tensor techniques to prove that extrapolation debiases and sharpens the asymptotic behavior even under the biased gradient oracle induced by reshuffling. Finally, extensive experiments validate our theory, consistently demonstrating substantial speedups in practice.

Comment: Proves that combining random reshuffling with Richardson-Romberg extrapolation yields a stronger cubic bias refinement for constant-step stochastic methods.

Topic Match: Best fit is training dynamics because the paper studies how optimization heuristics fundamentally alter convergence and bias in stochastic training.

Relevance: 8 Novelty: 8


13. Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

ArXiv ID: 2604.10567

Primary Topic: Architecture and Training Dynamics

Authors: Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo

Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

Comment: Identifies proximity bias in non-autoregressive diffusion decoding and corrects early trajectory shaping with lightweight planning.

Topic Match: This is about decoding dynamics and failure modes in a specific generative architecture, with a mechanistic diagnosis guiding a new inference method.

Relevance: 8 Novelty: 8


14. The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

ArXiv ID: 2604.10272

Primary Topic: Architecture and Training Dynamics

Authors: Mani Rash Ahmadi

Abstract: We prove that in a coupled Kuramoto oscillator network at stable equilibrium, the physical phase displacement under weak output nudging is the gradient of the loss with respect to natural frequencies, with equality as the nudging strength beta tends to zero. Prior oscillator equilibrium propagation work explicitly set aside natural frequency as a learnable parameter; we show that on sparse layered architectures, frequency learning outperforms coupling-weight learning among converged seeds (96.0% vs. 83.3% at matched parameter counts, p = 1.8e-12). The approximately 50% convergence failure rate under random initialization is a loss-landscape property, not a gradient error; topology-aware spectral seeding eliminates it in all settings tested (46/100 to 100/100 seeds on the primary task; 50/50 on a second task, K-only training, and a larger architecture).

Comment: Shows equilibrium phase displacements in Kuramoto networks compute gradients for learning natural frequencies.

Topic Match: The contribution is a foundational learning mechanism for a nonstandard neural architecture, with both theory and training dynamics insight.

Relevance: 8 Novelty: 8


15. Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

ArXiv ID: 2604.10814

Primary Topic: Architecture and Training Dynamics

Authors: Yijin Ni, Xiaoming Huo

Abstract: We study online covariance matrix estimation for Polyak--Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of $O(n^{-(1-\alpha)/4})$, which yields $O(n^{-1/8})$ at the optimal learning-rate exponent $\alpha \rightarrow 1/2^+$. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to $O(n^{-(1-\alpha)/3})$, achieving $O(n^{-1/6})$. The modified estimator requires no Hessian access and preserves $O(d^2)$ memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate $\Theta(n^{-(1-\alpha)/2})$ for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives $\Omega(n^{-(1-\alpha)/2})$, and a trajectory-regression estimator--which estimates the Hessian by regressing SGD increments on iterates--achieves $O(n^{-(1-\alpha)/2})$, matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.

Comment: Derives minimax-optimal online covariance estimation from SGD trajectories and introduces a Hessian-free trajectory-regression estimator.

Topic Match: This is fundamentally about SGD training dynamics and statistical estimation of optimizer behavior, not an application domain.

Relevance: 8 Novelty: 8


16. Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

ArXiv ID: 2604.11056

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.

Comment: Analyzes token-level RLVR credit assignment through polarity and entropy, deriving entropy-aware optimization.

Topic Match: The paper centers on training dynamics and credit assignment theory for autoregressive reasoning models, making architecture/training the primary fit.

Relevance: 8 Novelty: 8


17. LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

ArXiv ID: 2604.11748

Primary Topic: Architecture and Training Dynamics

Authors: Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu

Abstract: Continuous diffusion models have achieved strong performance across domains such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion. Our approach connects embedding-space DLMs to Flow Matching via Bregman divergence and introduces three key innovations: (1) a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) an information-uniform principle for noise scheduling, motivating a learnable scheduler based on a Gumbel distribution; and (3) an improved training protocol incorporating self-conditioning, which enhances both likelihood and sample quality.LangFlow achieves strong performance across benchmarks, reaching a perplexity (PPL) of 30.0 on LM1B and 24.6 on OpenWebText. It matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks. LangFlow provides clear evidence that continuous diffusion is a competitive and promising paradigm for language modeling. https://github.com/nealchen2003/LangFlow

Comment: Makes continuous diffusion language modeling competitive via a new ODE-based NLL bound, information-uniform noise scheduling, and improved self-conditioning.

Topic Match: The central contribution is a new generative modeling/training formulation for language models, with architectural and objective-level innovations rather than downstream use.

Relevance: 8 Novelty: 8


18. Continuous Adversarial Flow Models

ArXiv ID: 2604.11521

Primary Topic: Architecture and Training Dynamics

Authors: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

Abstract: We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which uses a fixed mean-squared-error criterion, our approach introduces a learned discriminator to guide training. This change in objective induces a different generalized distribution, which empirically produces samples that are better aligned with the target data distribution. Our method is primarily proposed for post-training existing flow-matching models, although it can also train models from scratch. On the ImageNet 256px generation task, our post-training substantially improves the guidance-free FID of latent-space SiT from 8.26 to 3.63 and of pixel-space JiT from 7.17 to 3.57. It also improves guided generation, reducing FID from 2.06 to 1.53 for SiT and from 1.86 to 1.80 for JiT. We further evaluate our approach on text-to-image generation, where it achieves improved results on both the GenEval and DPG benchmarks.

Comment: Introduces adversarially trained continuous-time flow models as a post-training alternative to fixed MSE flow matching objectives.

Topic Match: The core idea changes the training objective for flow models in a foundational way, making it primarily an architecture/training-dynamics paper.

Relevance: 8 Novelty: 8


Efficiency, Compression, and Large-Scale Training (9)

1. ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

ArXiv ID: 2604.10898

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: David H. Yang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen

Abstract: Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically "zooming in" on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.

Comment: Uses hierarchical summaries as coarse indices for dynamic KV retrieval during long reasoning, reducing decode-time memory by over 4x.

Topic Match: Best fit is efficiency/scaling because the main contribution is a new memory-efficient KV-cache access mechanism for long-output decoding.

Relevance: 9 Novelty: 8


2. SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

ArXiv ID: 2604.10152

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Jehyeon Bang, Eunyeong Cho, Ranggi Hwang, Jinha Chung, Minsoo Rhu

Abstract: The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30\times$, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.

Comment: Self-assisted speculative decoding speeds MoE inference without extra training while reducing memory and bandwidth pressure.

Topic Match: The paper's main value is a new inference algorithm/system for MoE efficiency under memory constraints, not a new downstream application.

Relevance: 9 Novelty: 8


3. ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ArXiv ID: 2604.11080

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak

Abstract: Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

Comment: Layer-wise quantization uses residual subspace rotation to retain local expressivity while preserving offline fusion efficiency.

Topic Match: The paper directly targets post-training quantization efficiency with a nontrivial algorithmic design that reconciles accuracy and inference overhead.

Relevance: 9 Novelty: 8


4. S$^3$: Structured Sparsity Specification

ArXiv ID: 2604.11315

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Ayoub Ghriss

Abstract: We introduce the Structured Sparsity Specification (S$^3$), an algebraic framework for defining, composing, and implementing structured sparse patterns. S$^3$ specifies sparsity through three components: a View that reshapes the tensor via layout composition, a Block specification that defines the atomic pruning unit, and the sparsity decision Scope. Both Block and Scope support Coupling across tensors for coordinated sparsification. S$^3$ enables precise specification of diverse sparsity structures, from fine-grained N:M patterns to coarse channel pruning, and integrates seamlessly with Optimal Brain Damage (OBD) and Surgeon (OBS). We formalize the framework mathematically, demonstrate its expressiveness on canonical patterns, and validate it experimentally via structured OBS and OBD implementations built entirely on S$^3$, which surpasses well-established second order heuristics on output reconstruction across common configurations.

Comment: Introduces an algebraic specification language for structured sparsity patterns and coordinated pruning across tensors.

Topic Match: This is directly about sparsity structure and pruning methodology, a core efficiency/compression topic.

Relevance: 9 Novelty: 8


5. LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention

ArXiv ID: 2604.10044

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics, Memory Structures and Agent Memory Systems

Authors: Dongjie Xu, Hao Wu, Weijie Shi, Yue Cui, Yuanjun Liu, Jiawei Li, Haolun Ma, An Liu, Jia Zhu, Jiajie Xu

Abstract: Through systematic experiments on long-context generation, we observe a damaging failure mode in which decoding can collapse into persistent repetition loops. We find that this degeneration is driven by collapsed attention patterns, where a subset of heads locks onto a narrow suffix of the history, and is further stabilized by inference-time KV cache reuse. Crucially, since many existing KV cache policies rely on attention-based importance, this collapse can produce spuriously high scores for repetitive tokens, causing cache management to inadvertently amplify repetition. To study this phenomenon in a controlled and reproducible manner, we introduce LoopBench, a benchmark with explicit loop-inducing conditions and loop-oriented metrics that quantify repetition severity and generation instability beyond downstream task scores. Building on these insights, we propose LoopGuard, a lightweight, plug-in KV cache guard that detects loop onset online and disrupts the feedback cycle by pruning repetitive tail spans under a fixed cache budget. Experiments on LoopBench show that LoopGuard reduces loop incidence by over 90 percentage points, while restoring output diversity and reducing token waste.

Comment: Identifies repetition loops as a KV-cache failure mode and proposes an online cache intervention mechanism to break them.

Topic Match: KV-cache design and intervention are the paper's main contribution, making efficiency the clearest primary fit.

Relevance: 9 Novelty: 8


6. VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination

ArXiv ID: 2604.09558

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Muyan Hu, Ahan Gupta, Jiachen Yuan, Vima Gupta, Taeksang Kim, Xin Xu, Janardhan Kulkarni, Ofer Dekel, Vikram Adve, Charith Mendis

Abstract: With the widening gap between compute and memory operation latencies, data movement optimizations have become increasingly important for DNN compilation. Current optimizations such as layout transformations and operator fusion only target a subset of tensor operators and consequently miss important opportunities for reducing data movement in contemporary DNN workloads, including large language models. We introduce VTC, a novel tensor compilation framework that for the first time eliminates all unnecessary data movement by targeting the full spectrum of data movement operators. VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions. We also introduce a novel data movement elimination algorithm to automatically identify a profitable virtual tensor creation strategy. Evaluation on a variety of DNNs shows that VTC can outperform existing ML compilers by up to 1.93x (1.28x on average) on NVIDIA GPUs with up to 60% (17.5% on average) inference memory savings.

Comment: Introduces virtual tensors to eliminate unnecessary data movement across arbitrary tensor operator compositions in DNN compilation.

Topic Match: The core contribution is a new systems/compiler mechanism that materially changes inference memory traffic and performance for large models.

Relevance: 9 Novelty: 8


7. Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

ArXiv ID: 2604.11001

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Zhuolun Dong, Junyu Cao

Abstract: Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.

Comment: Provable flow-control for LLM inference stability under unknown decode lengths and growing KV-cache demand.

Topic Match: This is directly about inference-time scheduling and KV-cache stability, with a new control framework plus necessary and sufficient stability guarantees.

Relevance: 9 Novelty: 8


8. ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

ArXiv ID: 2604.09603

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan

Abstract: Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.

Comment: Recasts speculative decoding for high-concurrency serving as budgeted scheduling with sparse confidence gating over a batch-level super-tree.

Topic Match: This is directly about inference-efficiency design for large-model serving, with a nontrivial new algorithmic scheduling idea.

Relevance: 9 Novelty: 8


9. Mitigating Privacy Risk via Forget Set-Free Unlearning

ArXiv ID: 2604.10636

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Aviraj Newatia, Michael Cooper, Viet Nguyen, Rahul G. Krishnan

Abstract: Training machine learning models requires the storage of large datasets, which often contain sensitive or private data. Storing data is associated with a number of potential risks which increase over time, such as database breaches and malicious adversaries. Machine unlearning is the study of methods to efficiently remove the influence of training data subsets from previously-trained models. Existing unlearning methods typically require direct access to the "forget set" -- the data to be forgotten-and organisations must retain this data for unlearning rather than deleting it immediately upon request, increasing risks associated with the forget set. We introduce partially-blind unlearning -- utilizing auxiliary information to unlearn without explicit access to the forget set. We also propose a practical framework Reload, a partially-blind method based on gradient optimization and structured weight sparsification to operationalize partially-blind unlearning. We show that Reload efficiently unlearns, approximating models retrained from scratch, and outperforms several forget set-dependent approaches. On language models, Reload unlearns entities using <0.025% of the retain set and <7% of model weights in <8 minutes on Llama2-7B. In the corrective case, Reload achieves unlearning even when only 10% of corrupted data is identified.

Comment: Partially-blind unlearning removes training influence without retaining the forget set, using sparse weight updates.

Topic Match: Although framed around privacy, the technical core is an efficient structured-weight method for model editing/unlearning at LLM scale.

Relevance: 8 Novelty: 8


Representation Learning Theory and Structure (12)

1. A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics

ArXiv ID: 2604.09979

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Louie Hong Yao, Yuhao Li, Shengchao Liu

Abstract: Self-supervised representation learning is central to modern machine learning because it extracts structured latent features from unlabeled data and enables robust transfer across tasks and domains. However, it can suffer from representation collapse, a widely observed failure mode in which embeddings lose discriminative structure and distinct inputs become indistinguishable. To understand the mechanisms that drive collapse and the ingredients that prevent it, we introduce a minimal embedding-only model whose gradient-flow dynamics and fixed points can be analyzed in closed form, using a classification-representation setting as a concrete playground where collapse is directly quantified through the contraction of label-embedding geometry. We illustrate that the model does not collapse when the data are perfectly classifiable, while a small fraction of frustrated samples that cannot be classified consistently induces collapse through an additional slow time scale that follows the early performance gain. Within the same framework, we examine collapse prevention by adding a shared projection head and applying stop-gradient at the level of the training dynamics. We analyze the resulting fixed points and develop a dynamical mean-field style self-consistency description, showing that stop-gradient enables non-collapsed solutions and stabilizes finite class separation under frustration. We further verify empirically that the same qualitative dynamics and collapse-prevention effects appear in a linear teacher-student model, indicating that the minimal theory captures features that persist beyond the pure embedding setting.

Comment: Provides a closed-form minimal theory of representation collapse and explains how stop-gradient stabilizes non-collapsed solutions under frustration.

Topic Match: The paper is directly about the dynamics and fixed points of self-supervised representation formation and collapse prevention.

Relevance: 10 Novelty: 8


2. Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

ArXiv ID: 2604.10074

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Hongkang Li, Hancheng Min, Rene Vidal

Abstract: Transformer-based diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically quantify the required number of tokens per data point and training iterations for the global convergence towards the Bayes optimal risk of the denoising objective, thereby achieving a desired score matching error. A deeper investigation reveals that the self-attention module of the trained transformer implements a mean denoising mechanism that enables the trained model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator of the injected noise in the diffusion steps. Numerical experiments validate these findings.

Comment: Provides convergence analysis showing transformer self-attention learns Bayes-optimal DDPM denoising on multi-token Gaussian mixtures.

Topic Match: This is foundational theory about why transformer diffusion denoisers work and what mechanism self-attention learns, making mechanistic understanding the best fit.

Relevance: 9 Novelty: 9


3. Mild Over-Parameterization Benefits Asymmetric Tensor PCA

ArXiv ID: 2604.10208

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Shihong Ding, Weicheng Lin, Cong Fang

Abstract: Asymmetric Tensor PCA (ATPCA) is a prototypical model for studying the trade-offs between sample complexity, computation, and memory. Existing algorithms for this problem typically require at least $d^{\left\lceil\overline{k}/2\right\rceil}$ state memory cost to recover the signal, where $d$ is the vector dimension and $\overline{k}$ is the tensor order. We focus on the setting where $\overline{k} \geq 4$ is even and consider (stochastic) gradient descent-based algorithms under a limited memory budget, which permits only mild over-parameterization of the model. We propose a matrix-parameterized method (in $d^{2}$ state memory cost) using a novel three-phase alternating-update algorithm to address the problem and demonstrate how mild over-parameterization facilitates learning in two key aspects: (i) it improves sample efficiency, allowing our method to achieve \emph{near-optimal} $d^{\overline{k}-2}$ sample complexity in our limited memory setting; and (ii) it enhances adaptivity to problem structure, a previously unrecognized phenomenon, where the required sample size naturally decreases as consecutive vectors become more aligned, and in the symmetric limit attains $d^{\overline{k}/2}$, matching the \emph{best} known polynomial-time complexity. To our knowledge, this is the \emph{first} tractable algorithm for ATPCA with $d^{\overline{k}}$-independent memory costs.

Comment: Shows mild over-parameterization enables a first tractable low-memory algorithm for asymmetric tensor PCA with strong sample-complexity guarantees.

Topic Match: The contribution is fundamentally theoretical: it studies learnability, memory cost, and structure recovery in a canonical representation-learning model.

Relevance: 9 Novelty: 8


4. A Deep Generative Approach to Stratified Learning

ArXiv ID: 2604.10650

Primary Topic: Representation Learning Theory and Structure

Authors: Randy Martinez, Rong Tang, Lizhen Lin

Abstract: While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying dimensionality, intersection singularities, and lack of efficient models in learning the underlying distributions. We provide a deep generative approach to stratified learning by developing two generative frameworks for learning distributions on stratified spaces. The first is a sieve maximum likelihood approach realized via a dimension-aware mixture of variational autoencoders. The second is a diffusion-based framework that explores the score field structure of a mixture. We establish the convergence rates for learning both the ambient and intrinsic distributions, which are shown to be dependent on the intrinsic dimensions and smoothness of the underlying strata. Utilizing the geometry of the score field, we also establish consistency for estimating the intrinsic dimension of each stratum and propose an algorithm that consistently estimates both the number of strata and their dimensions. Theoretical results for both frameworks provide fundamental insights into the interplay of the underlying geometry, the ambient noise level, and deep generative models. Extensive simulations and real dataset applications, such as molecular dynamics, demonstrate the effectiveness of our methods.

Comment: Develops generative theory for learning distributions on stratified spaces, including consistency for estimating strata counts and dimensions.

Topic Match: Its focus is foundational structure in learned representations and data geometry, with substantial new theory rather than application benchmarking.

Relevance: 8 Novelty: 9


5. Steered LLM Activations are Non-Surjective

ArXiv ID: 2604.09839

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu

Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

Comment: Formalizes activation steering as a non-surjectivity problem and proves steered residual states lie off the prompt-reachable manifold.

Topic Match: The core contribution is a mechanistic/formal result about internal activation-state geometry and reachability, which fits representation-structure analysis better than general architecture work.

Relevance: 8 Novelty: 8


6. Why Do Large Language Models Generate Harmful Content?

ArXiv ID: 2604.11663

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Rajesh Ganguli, Raha Moraffah

Abstract: Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.

Comment: Causal mediation localizes harmful generation to later-layer MLP pathways and sparse gating-like neurons.

Topic Match: The central contribution is mechanistic understanding of internal representations and causal pathways underlying a model behavior.

Relevance: 8 Novelty: 8


7. Tracing the Thought of a Grandmaster-level Chess-Playing Transformer

ArXiv ID: 2604.10158

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Rui Lin, Zhenyu Jin, Guancheng Zhou, Xuyang Ge, Wentao Shu, Jiaxing Wu, Junxuan Wang, Zhengfu He, Junping Zhang, Xipeng Qiu

Abstract: While modern transformer neural networks achieve grandmaster-level performance in chess and other reasoning tasks, their internal computation process remains largely opaque. Focusing on Leela Chess Zero (LC0), we introduce a sparse decomposition framework to interpret its internal computation by decomposing its MLP and attention modules with sparse replacement layers, which capture the primary computation process of LC0. We conduct a detailed case study showing that these pathways expose rich, interpretable tactical considerations that are empirically verifiable. We further introduce three quantitative metrics and show that LC0 exhibits parallel reasoning behavior consistent with the inductive bias of its policy head architecture. To the best of our knowledge, this is the first work to decompose the internal computation of a transformer on both MLP and attention modules for interpretability. Combining sparse replacement layers and causal interventions in LC0 provides a comprehensive understanding of advanced tactical reasoning, offering critical insights into the underlying mechanisms of superhuman systems. Our code is available at https://github.com/JacklE0niden/Leela-SAEs.

Comment: Interprets a grandmaster-level transformer by sparsely decomposing both MLP and attention computation pathways.

Topic Match: The strongest fit is mechanistic understanding of learned internal computation and feature pathways, making it primarily about representation structure.

Relevance: 8 Novelty: 8


8. Closed-Form Concept Erasure via Double Projections

ArXiv ID: 2604.10032

Primary Topic: Representation Learning Theory and Structure

Authors: Chi Zhang, Jingpu Cheng, Zhixian Wang, Ping Liu

Abstract: While modern generative models such as diffusion-based architectures have enabled impressive creative capabilities, they also raise important safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.

Comment: Derives an analytic double-projection transformation for concept erasure, giving a closed-form and geometrically interpretable representation intervention.

Topic Match: The paper is fundamentally about manipulating and understanding concept directions in learned representations through a principled linear geometry.

Relevance: 8 Novelty: 8


9. Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

ArXiv ID: 2604.11061

Primary Topic: Representation Learning Theory and Structure

Authors: Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan

Abstract: Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from $10$ labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful, black-box elicitation matches or exceeds all white-box methods; when explanations are absent or misleading, gradient-based attribution improves accuracy by 3-5 percentage points, and relevance patching, RelP, gives the largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit. Variance decomposition suggests gradients track decision computation, which fields causally drive the output, whereas other readouts are dominated by task representation, biases toward field identity and value. We release all models, code, and evaluation infrastructure.

Comment: Introduces a benchmark that isolates the elicitation confounder to test whether white-box interpretability methods recover hidden decision rules when models give absent or misleading explanations.

Topic Match: Its main value is mechanistic evaluation of internal signals and whether interpretability methods capture actual computation versus superficial representations.

Relevance: 8 Novelty: 8


10. Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

ArXiv ID: 2604.10627

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Yang Cui, Jingyuan Sun, Yizheng Sun, Yifan Wang, Yunhao Zhang, Jixing Li, Shaonan Wang, Hongpeng Zhou, John Hale, Chengqing Zong, Goran Nenadic

Abstract: How the brain supports language across different languages is a basic question in neuroscience and a useful test for multilingual artificial intelligence. Neuroimaging has identified language-responsive brain regions across languages, but it cannot by itself show whether the underlying processing is shared or language-specific. Here we use six multilingual large language models (LLMs) as controllable systems and create targeted ``computational lesions'' by zeroing small parameter sets that are important across languages or especially important for one language. We then compare intact and lesioned models in predicting functional magnetic resonance imaging (fMRI) responses during 100 minutes of naturalistic story listening in native English, Chinese and French (112 participants). Lesioning a compact shared core reduces whole-brain encoding correlation by 60.32% relative to intact models, whereas language-specific lesions preserve cross-language separation in embedding space but selectively weaken brain predictivity for the matched native language. These results support a shared backbone with embedded specializations and provide a causal framework for studying multilingual brain-model alignment.

Comment: Computational lesions in multilingual LLMs separate shared versus language-specific mechanisms for brain alignment.

Topic Match: The strongest match is mechanistic understanding of internal representations, using targeted lesions to identify shared and language-specific structure.

Relevance: 8 Novelty: 8


11. Exact Finite-Sample Variance Decomposition of Subagging: A Spectral Filtering Perspective

ArXiv ID: 2604.10469

Primary Topic: Representation Learning Theory and Structure

Authors: Ye Su, Mingrui Ye, Yining Wang, Jipeng Guo, Yong Liu

Abstract: Standard resampling ratios (e.g., $\alpha \approx 0.632$) are widely used as default baselines in ensemble learning for three decades. However, how these ratios interact with a base learner's intrinsic functional complexity in finite samples lacks a exact mathematical characterization. We leverage the Hoeffding-ANOVA decomposition to derive the first exact, finite-sample variance decomposition for subagging, applicable to any symmetric base learner without requiring asymptotic limits or smoothness assumptions. We establish that subagging operates as a deterministic low-pass spectral filter: it preserves low-order structural signals while attenuating $c$-th order interaction variance by a geometric factor approaching $\alpha^c$. This decoupling reveals why default baselines often under-regularize high-capacity interpolators, which instead require smaller $\alpha$ to exponentially suppress spurious high-order noise. To operationalize these insights, we propose a complexity-guided adaptive subsampling algorithm, empirically demonstrating that dynamically calibrating $\alpha$ to the learner's complexity spectrum consistently improves generalization over static baselines.

Comment: Exact finite-sample variance decomposition shows subagging acts as a low-pass spectral filter over interaction orders.

Topic Match: This is foundational learning theory on how ensemble resampling suppresses higher-order interactions, squarely fitting representation/training structure analysis.

Relevance: 8 Novelty: 8


12. Learning What's Real: Disentangling Signal and Measurement Artifacts in Multi-Sensor Data, with Applications to Astrophysics

ArXiv ID: 2604.09787

Primary Topic: Representation Learning Theory and Structure

Authors: Pablo Mercader-Perez, Carolina Cuesta-Lazaro, Daniel Muthukrishna, Jeroen Audenaert, V. Ashley Villar, David W. Hogg, Marc Huertas-Company, William T. Freeman

Abstract: Data collected from the physical world is always a combination of multiple sources: an underlying signal from the physical process of interest and a signal from measurement-dependent artifacts from the sensor or instrument. This secondary signal acts as a confounding factor, limiting our ability to extract information about the physics underlying the phenomena we observe. Furthermore, it complicates the combination of observations in heterogeneous or multi-instrument settings. We propose a deep learning framework that leverages overlapping observations, a dual-encoder architecture, and a counterfactual generation objective to disentangle these factors of variation. The resulting representations explicitly separate intrinsic signals from sensor-specific distortions and noise, and can be used for counterfactual view generation, parameter inference unconfounded by measurement distortions, and instrument-independent similarity search. We demonstrate the effectiveness of our approach on astrophysical galaxy images from the DESI Legacy Imaging Survey (Legacy) and the Hyper Suprime-Cam (HSC) Survey as a representative multi-instrument setting. This framework provides a general recipe for scientific and multi-modal self-supervised pretraining: construct training pairs from overlapping observations of the same physical system, treat sensor- or modality-specific effects as augmentations, and learn invariant representations through counterfactual generation.

Comment: Learns invariant representations that disentangle underlying signal from sensor-specific artifacts using overlapping observations and counterfactual generation.

Topic Match: This is fundamentally about representation identifiability and structure: separating intrinsic content from measurement confounds in a self-supervised setup.

Relevance: 8 Novelty: 8


Memory Structures and Agent Memory Systems (8)

1. Human-like Working Memory Interference in Large Language Models

ArXiv ID: 2604.09670

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Hua-Dong Xiong (School of Psychological and Brain Sciences, Georgia Tech), Li Ji-An (Department of Psychology, New York University), Jiaqi Huang (Department of Cognitive Science, Indiana University Bloomington, Honda Research Institute), Robert C. Wilson (School of Psychological and Brain Sciences, Georgia Tech, Center of Excellence for Computational Cognition, Georgia Tech), Kwonjoon Lee (Honda Research Institute), Xue-Xin Wei (Departments of Neuroscience and Psychology, The University of Texas at Austin)

Abstract: Intelligent systems must maintain and manipulate task-relevant information online to adapt to dynamic environments and changing goals. This capacity, known as working memory, is fundamental to human reasoning and intelligence. Despite having on the order of 100 billion neurons, both biological and artificial systems exhibit limitations in working memory. This raises a key question: why do large language models (LLMs) show such limitations, given that transformers have full access to prior context through attention? We find that although a two-layer transformer can be trained to solve working memory tasks perfectly, a diverse set of pretrained LLMs continues to show working memory limitations. Notably, LLMs reproduce interference signatures observed in humans: performance degrades with increasing memory load and is biased by recency and stimulus statistics. Across models, stronger working memory capacity correlates with broader competence on standard benchmarks, mirroring its link to general intelligence in humans. Yet despite substantial variability in working memory performance, LLMs surprisingly converge on a common computational mechanism. Rather than directly copying the relevant memory item from context, models encode multiple memory items in entangled representations, such that successful recall depends on interference control -- actively suppressing task-irrelevant content to isolate the target for readout. Moreover, a targeted intervention that suppresses stimulus content information improves performance, providing causal support for representational interference. Together, these findings identify representational interference as a core constraint on working memory in pretrained LLMs, suggesting that working-memory limits in biological and artificial systems may reflect a shared computational challenge: selecting task-relevant information under interference.

Comment: Shows pretrained LLM working-memory limits arise from representational interference and validates this with causal suppression interventions.

Topic Match: Primary fit is memory systems because the paper directly studies working-memory mechanisms, interference, and recall constraints in LLMs.

Relevance: 9 Novelty: 8


2. Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory

ArXiv ID: 2604.11544

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Weixian Waylon Li, Jiaxin Zhang, Xianan Jim Yang, Tiejun Ma, Yiwen Guo

Abstract: Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation's text embedding to a volatility score, learning from data that evolving relations (e.g., "president of") should rotate fast while persistent ones (e.g., "born in") should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).

Comment: Introduces continuous phase rotation in temporal knowledge graphs so memory can preserve persistent facts while geometrically aging volatile ones.

Topic Match: Its core contribution is a new structured memory mechanism with explicit update and persistence semantics for agent memory over time.

Relevance: 9 Novelty: 8


3. Learning to Forget -- Hierarchical Episodic Memory for Lifelong Robot Deployment

ArXiv ID: 2604.11306

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Leonard B\"armann, Joana Plewnia, Alex Waibel, Tamim Asfour

Abstract: Robots must verbalize their past experiences when users ask "Where did you put my keys?" or "Why did the task fail?" Yet maintaining life-long episodic memory (EM) from continuous multimodal perception quickly exceeds storage limits and makes real-time query impractical, calling for selective forgetting that adapts to users' notions of relevance. We present H$^2$-EMV, a framework enabling humanoids to learn what to remember through user interaction. Our approach incrementally constructs hierarchical EM, selectively forgets using language-model-based relevance estimation conditioned on learned natural-language rules, and updates these rules given user feedback about forgotten details. Evaluations on simulated household tasks and 20.5-hour-long real-world recordings from ARMAR-7 demonstrate that H$^2$-EMV maintains question-answering accuracy while reducing memory size by 45% and query-time compute by 35%. Critically, performance improves over time - accuracy increases 70% in second-round queries by adapting to user-specific priorities - demonstrating that learned forgetting enables scalable, personalized EM for long-term human-robot collaboration.

Comment: Hierarchical episodic memory with learned selective forgetting updates retention rules from user feedback.

Topic Match: The paper is centrally about how an embodied agent stores, compresses, forgets, and retrieves long-term episodic memory under resource limits.

Relevance: 9 Novelty: 8


4. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

ArXiv ID: 2604.10352

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Mofasshara Rafique, Laurent Bindschaedler

Abstract: Stateful tool-using LLM agents treat the context window as working memory, yet today's agent harnesses manage residency and durability as best-effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textsc{ClawVM}, a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real-session traces, and adversarial stress tests, \textsc{ClawVM} eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle, and adds median <50 microseconds of policy-engine overhead per turn.

Comment: Virtual-memory layer for tool-using agents enforces typed page residency, durability, and writeback under token budgets.

Topic Match: The work is fundamentally about memory organization and lifecycle guarantees for agent state, not generic agent orchestration.

Relevance: 9 Novelty: 8


5. Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

ArXiv ID: 2604.11462

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, Yang Li

Abstract: Large Language Models (LLMs) struggle with long-horizon tasks due to the "context bottleneck" and the "lost-in-the-middle" phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.

Comment: RL-trained context curator actively prunes and preserves reasoning anchors in an agent's working memory.

Topic Match: The main idea is a new learned principle for managing agent working memory under long-horizon context limits, not just standard RAG or chat history handling.

Relevance: 9 Novelty: 8


6. Endogenous Information in Routing Games: Memory-Constrained Equilibria, Recall Braess Paradoxes, and Memory Design

ArXiv ID: 2604.11733

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Saad Alqithami

Abstract: We study routing games in which travelers optimize over routes that are remembered or surfaced, rather than over a fixed exogenous action set. The paper develops a tractable design theory for endogenous recall and then connects it back to an explicit finite-memory micro model. At the micro level, each traveler carries a finite memory state, receives surfaced alternatives, chooses via a logit rule, and updates memory under a policy such as LRU. This yields a stationary Forgetful Wardrop Equilibrium (FWE); existence is proved under mild regularity, and uniqueness follows in a contraction regime for the reduced fixed-point map. The paper's main design layer is a stationary salience model that summarizes persistent memory and interface effects as route-specific weights. Salience-weighted stochastic user equilibrium is the unique minimizer of a strictly convex potential, which yields a clean optimization and implementability theory. In this layer we characterize governed implementability under ratio budgets and affine tying constraints, and derive constructive algorithms on parallel and series-parallel networks. The bridge between layers is exact for last-choice memory (B=1): the micro model is then equivalent to the salience model, so any interior salience vector can be realized by an appropriate surfacing policy. For larger memories, we develop an explicit LRU-to-TTL-to-salience approximation pipeline and add contraction-based bounds that translate surrogate-map error into fixed-point and welfare error. Finally, we define a Recall Braess Paradox, in which improving recall increases equilibrium delay without changing physical capacity, and show that it can arise on every two-terminal network with at least two distinct s-t paths. Targeted experiments support the approximation regime, governed-design predictions, and the computational advantages of the reduced layer.

Comment: Develops endogenous memory-constrained equilibrium theory and explicit memory-design mechanisms in routing games.

Topic Match: The paper is fundamentally about how bounded recall and surfacing policies shape behavior, with formal memory models and design theory at its core.

Relevance: 8 Novelty: 9


7. Beyond LLMs, Sparse Distributed Memory, and Neuromorphics

ArXiv ID: 2604.11665

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Representation Learning Theory and Structure

Authors: Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato

Abstract: This paper reports an unexpected finding: in a deterministic hyperdimensional computing (HDC) architecture based on Galois-field algebra, a path-dependent semantic selection mechanism emerges, equivalent to spike-timing-dependent plasticity (STDP), with magnitude predictable a priori by a closed-form expression matching large-scale measurements. This addresses limitations of modern AI including catastrophic forgetting, learning stagnation, and the Binding Problem at an algebraic level. We propose VaCoAl (Vague Coincident Algorithm) and its Python implementation PyVaCoAl, combining ultra-high-dimensional memory with deterministic logic. Rooted in Sparse Distributed Memory, it resolves orthogonalisation and retrieval in high-dimensional binary spaces via Galois-field diffusion, enabling low-load deployment. VaCoAl is a memory-centric architecture prioritising retrieval and association, enabling reversible composition while preserving element independence and supporting compositional generalisation with a transparent reliability metric (CR score). We evaluated multi-hop reasoning on about 470k mentor-student relations from Wikidata, tracing up to 57 generations (over 25.5M paths). Using HDC bundling and unbinding with CR-based denoising, we quantify concept propagation over DAGs. Results show a reinterpretation of the Newton-Leibniz dispute and a phase transition from sparse convergence to a post-Leibniz "superhighway", from which structural indicators emerge supporting a Kuhnian paradigm shift. Collision-tolerance mechanisms further induce path-based pruning that favors direct paths, yielding emergent semantic selection equivalent to STDP. VaCoAl thus defines a third paradigm, HDC-AI, complementing LLMs with reversible multi-hop reasoning.

Comment: Proposes a memory-centric hyperdimensional architecture with reversible composition, retrieval, and path-dependent semantic selection akin to STDP.

Topic Match: The central claim is a novel associative memory architecture with explicit storage/retrieval principles and forgetting-related arguments, making memory systems the clearest fit.

Relevance: 8 Novelty: 9


8. Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

ArXiv ID: 2604.11759

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano

Abstract: Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.

Comment: Argues for typed epistemic knowledge objects with decay, contradiction edges, and explicit modeled ignorance in organizational AI.

Topic Match: This is directly about a new principle for organizing, updating, and surfacing memory-like knowledge for agents, including forgetting and uncertainty structure.

Relevance: 8 Novelty: 8


World Models, Exploration, and Open-Ended Reinforcement Learning (4)

1. Grounded World Model for Semantically Generalizable Planning

ArXiv ID: 2604.11751

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh

Abstract: In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

Comment: Learns a vision-language-aligned world model so MPC can plan directly against language goals instead of goal images.

Topic Match: This is squarely a world-model paper: action proposals are evaluated through predicted futures in a grounded latent space for planning.

Relevance: 9 Novelty: 8


2. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

ArXiv ID: 2604.11135

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen

Abstract: Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.

Comment: Bridges video world models and action generation by routing control through an explicit spatial value-map interface with intent-causal attention.

Topic Match: The core idea is a new world-action modeling mechanism for control, making it a strong fit for foundational world-model research.

Relevance: 9 Novelty: 8


3. Evolving Many Worlds: Towards Open-Ended Discovery in Petri Dish NCA via Population-Based Training

ArXiv ID: 2604.11248

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Uljad Berdica, Jakob Foerster, Frank Hutter, Arber Zela

Abstract: The generation of sustained, open-ended complexity from local interactions remains a fundamental challenge in artificial life. Differentiable multi-agent systems, such as Petri Dish Neural Cellular Automata (PD-NCA), exhibit rich self-organization driven purely by spatial competition; however, they are highly sensitive to hyperparameters and frequently collapse into uninteresting patterns and dynamics, such as frozen equilibria or structureless noise. In this paper, we introduce PBT-NCA, a meta-evolutionary algorithm that evolves a population of PD-NCAs subject to a composite objective that rewards both historical behavioral novelty and contemporary visual diversity. Driven by this continuous evolutionary pressure, PBT-NCA spontaneously generates a plethora of emergent lifelike phenomena over extended horizons-a hallmark of true open-endedness. Strikingly, the substrate autonomously discovers diverse morphological survival and self-organization strategies. We observe highly regular, coordinated periodic waves; spore-like scattering where homogeneous groups eject cell-like clusters to colonize distant territories; and fluid, shape-shifting macro-structures that migrate across the substrate, maintaining stable outer boundaries that enclose highly active interiors. By actively penalizing monocultures and dead states, PBT-NCA sustains a state of effective complexity that is neither globally ordered nor globally random, operating persistently at the "edge of chaos".

Comment: Uses population-based training to sustain open-ended novelty and effective complexity in neural cellular automata worlds.

Topic Match: Primary fit is world models/open-ended RL because the work is centered on sustained open-ended discovery and emergent behavior generation.

Relevance: 8 Novelty: 8


4. Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations

ArXiv ID: 2604.09584

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Abhijeet Vishwasrao, Francisco Giral, Mahmoud Golestanian, Federica Tonti, Andrea Arroyo Ramo, Adrian Lozano-Duran, Steven L. Brunton, Sergio Hoyas, Soledad Le Clainche, Hector Gomez, Ricardo Vinuesa

Abstract: Flow physics and more broadly physical phenomena governed by partial differential equations (PDEs), are inherently continuous, high-dimensional and often chaotic in nature. Traditionally, researchers have explored these rich spatiotemporal PDE solution spaces using laboratory experiments and/or computationally expensive numerical simulations. This severely limits automated and large-scale exploration, unlike domains such as drug discovery or materials science, where discrete, tokenizable representations naturally interface with large language models. We address this by coupling multi-agent LLMs with latent foundation models (LFMs), a generative model over parametrised simulations, that learns explicit, compact and disentangled latent representations of flow fields, enabling continuous exploration across governing PDE parameters and boundary conditions. The LFM serves as an on-demand surrogate simulator, allowing agents to query arbitrary parameter configurations at negligible cost. A hierarchical agent architecture orchestrates exploration through a closed loop of hypothesis, experimentation, analysis and verification, with a tool-modular interface requiring no user support. Applied to flow past tandem cylinders at Re = 500, the framework autonomously evaluates over 1,600 parameter-location pairs and discovers divergent scaling laws: a regime-dependent two-mode structure for minimum displacement thickness and a robust linear scaling for maximum momentum thickness, with both landscapes exhibiting a dual-extrema structure that emerges at the near-wake to co-shedding regime transition. The coupling of the learned physical representations with agentic reasoning establishes a general paradigm for automated scientific discovery in PDE-governed systems.

Comment: Couples latent foundation world models with hierarchical agents for autonomous exploration of PDE-governed parameter spaces.

Topic Match: The central idea is using a learned generative world model as an interaction substrate for open-ended exploration and hypothesis testing.

Relevance: 8 Novelty: 8


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

  1. Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

  2. Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

  3. Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

  4. Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

  5. World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

  • 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
  • 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
  • 5-6: touches the target topics, but the main contribution is elsewhere.
  • 3-4: largely outside the target topics, often application-focused or domain-specific.
  • 1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

  • 9-10: new paradigm, theory, or major methodological breakthrough.
  • 7-8: substantial methodological advance or strong new insight.
  • 5-6: meaningful but incremental extension or refinement.
  • 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
  • 1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.