Personalized Daily ArXiv Papers 2026-05-14

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	213613	41627	255240	917	554	33
`gpt-5.4`	Cost	$0.53	$0.62	$1.16	917	554	33

Topic Coverage:

Topic	Papers
Architecture and Training Dynamics	11
Efficiency, Compression, and Large-Scale Training	7
Representation Learning Theory and Structure	6
Memory Structures and Agent Memory Systems	2
World Models, Exploration, and Open-Ended Reinforcement Learning	7

Table of contents by topic:

Architecture and Training Dynamics (11)

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention Authors: Tomohiro Hayase, Ryo Karakida
Effective Context in Transformers: An Analysis of Fragmentation and Tokenization Authors: Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten
ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection Authors: Huy Tran, Max Milkert, David Hyde
State-Space NTK Collapse Near Bifurcations Authors: James Hazelden, Eric Shea-Brown
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence Authors: Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, Trung Le
Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction Authors: Florian Hess, Florian G\"otz, Daniel Durstewitz
EMO: Frustratingly Easy Progressive Training of Extendable MoE Authors: Linghao Jin, Chufan Shi, Huijuan Wang, Nuan Wen, Zhengzhong Liu, Eric Xing, Xuezhe Ma
The critical slowing down in diffusion models Authors: Luca Maria Del Bono, Giulio Biroli, Patrick Charbonneau, Marylou Gabri\'e
Inference-Time Machine Unlearning via Gated Activation Redirection Authors: Vin\'icius Conte Turani, Ot\'avio Parraga, Jo\~ao Vitor Boer Abitante, Kristen K. Arguello, Joana Pasquali, Ramiro N. Barros, Flavio du Pin Calmon, Christian Mattjie, Rodrigo C. Barros, Lucas S. Kupssinsk\"u
Negation Neglect: When models fail to learn negations in training Authors: Harry Mayne, Lev McKinney, Jan Dubi\'nski, Adam Karvonen, James Chua, Owain Evans
Discrete Stochastic Localization for Non-autoregressive Generation Authors: Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

Efficiency, Compression, and Large-Scale Training (7)

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity Authors: Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers Authors: Victor Norgren
DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum Authors: Jihwan Kim, Chenglin Fan
Provable Quantization with Randomized Hadamard Transform Authors: Ying Feng, Piotr Indyk, Michael Kapralov, Dmitry Krachun, Boris Prokhorov
Scaling Laws for Mixture Pretraining Under Data Constraints Authors: Anastasiia Sedova, Skyler Seto, Natalie Schluter, Pierre Ablin
DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning Authors: Marc Molina Van den Bosch, Riccardo Taiello, Albert Sund Aillet, Andrea Protani, Miguel Angel Gonzalez Ballester, Luigi Serio
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving Authors: Zedong Liu, Xinyang Ma, Dejun Luo, Hairui Zhao, Bing Lu, Wenjing Huang, Yida Gu, Xingchen Liu, Zheng Wei, Jinyang Liu, Dingwen Tao, Guangming Tan

Representation Learning Theory and Structure (6)

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge Authors: Ryoya Awano, Taiji Suzuki
WriteSAE: Sparse Autoencoders for Recurrent State Authors: Jack Young
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization Authors: Zhehang Du, Hangfeng He, Weijie Su
From Generalist to Specialist Representation Authors: Yujia Zheng, Fan Feng, Yuke Li, Shaoan Xie, Kevin Murphy, Kun Zhang
The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models Authors: Zhiyu Zhao, Xuejie Liu, Muhan Zhang, Anji Liu
Support-Conditioned Flow Matching Is Kernel Smoothing Authors: Daniel Matsui Smola

Memory Structures and Agent Memory Systems (2)

Cognifold: Always-On Proactive Memory via Cognitive Folding Authors: Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Xinliang Zhou
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction Authors: Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-T\"ur

World Models, Exploration, and Open-Ended Reinforcement Learning (7)

Delightful Exploration Authors: Ian Osband
Ergodic Trajectory Design by Learned Pushforward Maps: Provable Coverage via Conditional Flow Matching Authors: Ehsan Aghazadeh, Masoud Malekzadeh, Ahmad Ghasemi, Hossein Pishro-Nik
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy Authors: JaeHyeok Doo, Byeongguk Jeon, Seonghyeon Ye, Kimin Lee, Minjoon Seo
State-Centric Decision Process Authors: Sungheon Jeong, Ryozo Masukawa, Sanggeon Yun, Mahdi Imani, Mohsen Imani
Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning Authors: Qinchuan Cheng, Zhantao Gong, Pengzhan Sun, Angela Yao, Xulei Yang, Shijie Li
Macro-Action Based Multi-Agent Instruction Following through Value Cancellation Authors: Wo Wei Lin, Ethan Rathbun, Enrico Marchesini Xiang Zhi Tan
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation Authors: Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius

Architecture and Training Dynamics (11)

1. A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

ArXiv ID: 2605.12697

Primary Topic: Architecture and Training Dynamics

Authors: Tomohiro Hayase, Ryo Karakida

Abstract: Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.

Comment: Derives a unifying theory for critical inverse-temperature scaling in long-context self-attention via gap-counting.

Topic Match: This is directly about a core attention mechanism and its scaling law, squarely matching architecture and training dynamics.

Relevance: 9 Novelty: 8

2. Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

ArXiv ID: 2605.13485

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten

Abstract: Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.

Comment: Gives an information-theoretic account of how fragmentation and tokenization alter effective finite-context prediction in Transformers.

Topic Match: This is fundamentally about sequence-model computational constraints and how representation choice changes what a fixed-context Transformer can express.

Relevance: 9 Novelty: 8

3. ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection

ArXiv ID: 2605.12879

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Huy Tran, Max Milkert, David Hyde

Abstract: Doubly-stochastic attention has emerged as a transport-based alternative to row-softmax attention, with recent Transformer variants using it to reduce attention sinks and rank collapse while improving performance. In this family, the standard approach is Sinkhorn scaling, which trains more efficiently but still repeats matrix scaling in every inference forward pass. Sliced-transport attention removes the online iteration, but its soft sorting approximation materializes dense tensors for each slice, requiring substantially more training resources than Sinkhorn attention. We introduce ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection, a train-then-compile method that trains the doubly-stochastic layer with Sinkhorn, then replaces the iterative scaling loop at inference with a fixed sliced-dual operator. It learns a lightweight parametric map from exact one-dimensional Kantorovich potentials to the Sinkhorn query-side dual, then reconstructs the attention plan with a two-sided entropic c-transform. Across language and vision benchmarks, ASAP keeps the cheaper training setup and remains highly competitive with recent baselines. In the main frozen-layer benchmark, ASAP is 5.3 faster than the trained Sinkhorn teacher while matching its accuracy; in downstream replacements, ASAP recovers most of the teacher performance without any retraining.

Comment: Amortizes doubly-stochastic attention by compiling Sinkhorn scaling into a fixed sliced-dual operator for faster inference.

Topic Match: The main contribution is a new attention/computation mechanism with a clear architectural and inference-efficiency payoff.

Relevance: 9 Novelty: 8

4. State-Space NTK Collapse Near Bifurcations

ArXiv ID: 2605.12763

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: James Hazelden, Eric Shea-Brown

Abstract: Rich feature learning in tasks that unfold over time often requires the model to pass through bifurcations, constituting qualitative changes in the underlying model dynamics. We develop a local theory of gradient descent near these transitions through the empirical state-space neural tangent kernel (sNTK). Our central finding is that bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reduce sNTK to a rank-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high-dimensional recurrent systems. Concretely, we give a procedure for decomposing sNTK into bifurcation-relevant and residual channels, showing that near commonly codimension-1 bifurcations the relevant channel is a rank-one operator that is highly amplified. This amplification causes the bifurcation channel to dominate the full sNTK. Thus, bifurcations locally warp the learning landscape, funneling gradient descent into a few critical dynamical directions and making the nearby kernel and loss geometry predictable from classical normal forms. We illustrate this in a student-teacher recurrent neural network: the first learned bifurcation coincides with a sharp collapse in sNTK effective rank and the emergence of a dominant parameter direction whose restricted sNTK closely matches the landscape predicted by the scalar pitchfork normal form. Finally, we show that low-rank natural gradient methods resolve the resulting learning instability near bifurcations with very little overhead over SGD.

Comment: Analyzes recurrent learning near dynamical bifurcations via state-space NTK collapse to a dominant rank-one channel.

Topic Match: The core contribution is a mechanistic theory of training dynamics in recurrent systems near bifurcations, making architecture/training dynamics the best fit.

Relevance: 9 Novelty: 8

5. Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

ArXiv ID: 2605.13079

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, Trung Le

Abstract: Muon orthogonalizes the momentum buffer before each update, replacing its singular values with ones via Newton-Schulz iterations. This simple change lets Muon tolerate far larger learning rates and converge faster than other optimizers, but why? We show that the mechanism is spectral flattening, and develop two results around it. First, we prove that Muon's maximal stable step size scales with the average singular value of the gradient rather than the largest, which bottlenecks standard gradient descent. Second, we recast Muon as a preconditioned gradient method and show, under a Kronecker-factored curvature model, that it improves the effective convergence factor, with the improvement controlled by the spectrum of the gradient covariance. Extensive experiments validate both results: Muon remains stable at learning rates that cause SGD to diverge within the first few iterations, and reaches accuracy milestones several epochs earlier even at identical step sizes. Taken together, our results offer a principled, geometric explanation for Muon's empirical success.

Comment: Explains Muon's behavior by proving orthogonalization-induced spectral flattening increases stable learning rates and improves convergence.

Topic Match: This is directly about training dynamics and optimizer mechanism, with theory on why a specific update transformation changes stability and convergence.

Relevance: 9 Novelty: 8

6. Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction

ArXiv ID: 2605.12683

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Florian Hess, Florian G\"otz, Daniel Durstewitz

Abstract: Reconstructing nonlinear dynamical systems (DS) from data (DSR) is a fundamental challenge in science and engineering, but it inherently relies on sequential models. Recent breakthroughs for sequential models have produced algorithms that parallelize computation along sequence length $T$, achieving logarithmic time complexity, $\mathcal{O}(\log T)$. Since sequence lengths have been practically limited due to the linear runtime complexity $\mathcal{O}(T)$ of classical backpropagation through time, this opens new avenues for DSR. This paper studies two prominent classes of parallel-in-time algorithms for this task, both of which leverage parallel associative scans as their core computational primitive. The first class comprises models with linear yet non-autonomous dynamics and a nonlinear readout, such as modern State Space Models (SSMs), while the second consists of general nonlinear models which can be parallelized using the DEER framework. We find that the linear training-time recurrence of the first class of models imposes limitations that often hinder learning of accurate nonlinear dynamics. To address this, we augment DEER with Generalized Teacher Forcing (GTF), a novel variant within the more general nonlinear framework that ensures stable and effective learning of nonlinear dynamics across arbitrary sequence lengths. Using GTF-DEER, we investigate the benefits of training on extremely long sequences ($T>10^4$) for DSR. Our results show that access to such long trajectories significantly improves DSR if the data features long time scales. This work establishes GTF-DEER as a robust tool for data-driven discovery and underscores the largely untapped potential of long-sequence learning in modeling complex DS.

Comment: Combines DEER with generalized teacher forcing to stabilize parallel-in-time training of nonlinear recurrent models on very long sequences.

Topic Match: This is directly about sequence-model architecture and training dynamics, especially stable learning of nonlinear recurrent models with parallel-in-time algorithms.

Relevance: 9 Novelty: 8

7. EMO: Frustratingly Easy Progressive Training of Extendable MoE

ArXiv ID: 2605.13247

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Linghao Jin, Chufan Shi, Huijuan Wang, Nuan Wen, Zhengzhong Liu, Eric Xing, Xuezhe Ma

Abstract: Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.

Comment: Progressive MoE expansion treats expert capacity as expandable memory and derives compute-optimal stage-wise token budgets.

Topic Match: Architecture_training is the best fit because the paper directly targets MoE capacity allocation and training dynamics, with efficiency gains emerging from a new training principle.

Relevance: 9 Novelty: 8

8. The critical slowing down in diffusion models

ArXiv ID: 2605.12597

Primary Topic: Architecture and Training Dynamics

Authors: Luca Maria Del Bono, Giulio Biroli, Patrick Charbonneau, Marylou Gabri\'e

Abstract: Computational sampling has been central to the sciences since the mid-20th century. While machine-learning-based approaches have recently enabled major advances, their behavior remains poorly understood, with limited theoretical control over when and why they succeed. Here we provide such insight for diffusion models-a class of generative schemes highly effective in practice-by analyzing their application to the $O(n)$ model of statistical field theory in the Gaussian limit $n \to \infty$. In this analytically tractable setting, we show that training a score model with a one-layer network architecture matching the exact solution exhibits a form of critical slowing down in parameter learning. This slowing down also impacts the generation process, indicating that the well-known difficulties of sampling near criticality persist even for learned generative models. To overcome this bottleneck, we demonstrate the power of combining architectural depth with physical locality. We find that using a two-layer architecture drastically reduces the critical slowing down, with the training time scaling logarithmically rather than quadratically with system size. By introducing a local score approximation we show that this acceleration in training time can be achieved without increasing the number of neural network parameters. Taken together, these results demonstrate that diffusion models can overcome the critical slowing down through appropriate architectural design, and establish a controlled framework for understanding and improving learned sampling methods in statistical physics and beyond.

Comment: Provides theoretical analysis of diffusion training dynamics near criticality and shows how architectural depth and locality change the scaling behavior.

Topic Match: This is squarely about training dynamics and architectural mechanisms, with controlled theory explaining when diffusion models slow down and how design choices fix it.

Relevance: 8 Novelty: 8

9. Inference-Time Machine Unlearning via Gated Activation Redirection

ArXiv ID: 2605.12765

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Vin\'icius Conte Turani, Ot\'avio Parraga, Jo\~ao Vitor Boer Abitante, Kristen K. Arguello, Joana Pasquali, Ramiro N. Barros, Flavio du Pin Calmon, Christian Mattjie, Rodrigo C. Barros, Lucas S. Kupssinsk\"u

Abstract: Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.

Comment: Input-dependent gated activation redirection enables inference-time unlearning without changing weights, preserving utility under quantization.

Topic Match: The main idea is a new mechanistic intervention in the residual stream at inference time, making architecture/mechanism the strongest fit.

Relevance: 8 Novelty: 8

10. Negation Neglect: When models fail to learn negations in training

ArXiv ID: 2605.13829

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Harry Mayne, Lev McKinney, Jan Dubi\'nski, Adam Karvonen, James Chua, Owain Evans

Abstract: We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.

Comment: Shows finetuning on negated claims induces unstable true-claim representations, revealing a training-dynamics failure mode.

Topic Match: The key contribution is a training-dynamics phenomenon and inductive bias in how models learn claims versus negations during finetuning.

Relevance: 8 Novelty: 8

11. Discrete Stochastic Localization for Non-autoregressive Generation

ArXiv ID: 2605.12836

Primary Topic: Architecture and Training Dynamics

Authors: Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

Abstract: Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

Comment: Introduces a continuous-state denoising framework whose optimal denoiser is invariant to nominal SNR across generation paths.

Topic Match: This is fundamentally a new generative modeling mechanism for sequence generation, centered on denoising dynamics and architecture-level design.

Relevance: 8 Novelty: 8

Efficiency, Compression, and Large-Scale Training (7)

1. Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

ArXiv ID: 2605.13434

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik

Abstract: Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so that each worker contributes the same aggregate learning rate over a cycle. In the non-convex setting, under smoothness and bounded heterogeneity assumptions, we prove that the resulting method, Rescaled ASGD, converges to stationary points of the correct global objective in the fixed-computation model. Its time complexity matches the known lower bound in the leading term, while the effects of staleness and data heterogeneity appear only in lower-order terms. Experiments confirm that the method converges to the correct objective and is competitive with state-of-the-art baselines.

Comment: Fixes objective bias in asynchronous distributed optimization by rescaling worker stepsizes under data and system heterogeneity.

Topic Match: This directly matches large-scale training algorithms: it changes distributed optimization behavior under heterogeneity with strong theory and practical simplicity.

Relevance: 9 Novelty: 8

2. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

ArXiv ID: 2605.13784

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Memory Structures and Agent Memory Systems

Authors: Victor Norgren

Abstract: Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, llama.cpp), holding query latency constant as accumulated context grows.

Comment: Introduces stateful transformer sessions that move prefill off the query critical path and keep latency independent of accumulated streaming context.

Topic Match: The central advance is a new inference model for KV-cache persistence, scheduling, and streaming efficiency that materially changes large-context inference behavior.

Relevance: 9 Novelty: 8

3. DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

ArXiv ID: 2605.12994

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Jihwan Kim, Chenglin Fan

Abstract: We study differentially private (DP) training with Muon, a matrix-valued optimizer that updates hidden-layer weights using momentum followed by Newton--Schulz orthogonalization. While DP-SGD is well understood, the interaction between per-example clipping, Gaussian noise, momentum, and nonlinear orthogonalization in Muon has not been systematically analyzed. We formulate DP-Muon, a private Muon procedure that clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, and then applies momentum and Newton--Schulz orthogonalization as post-processing. We prove that DP-Muon inherits the privacy guarantee certified by the corresponding same-lot subsampled Gaussian accountant, with no additional privacy cost from Muon-specific post-processing. On the optimization side, we establish finite-horizon and vanishing stationarity guarantees under per-matrix clipping, with bounds that separate optimization error, clipping residual, privacy noise, and Newton--Schulz approximation error. We further show that the DP-induced bias in Muon arises not in the linear momentum buffer itself, but after the nonlinear Newton--Schulz map, where Gaussian noise induces a matrix-valued heat-smoothing bias. This motivates DP-MuonBC, a bias-corrected variant that removes the leading output-level bias term while preserving the same privacy guarantee. Experiments on E2E and DART show that Muon-style matrix updates improve private fine-tuning, and that DP-MuonBC further improves utility without increasing the privacy budget.

Comment: Analyzes private Muon optimization and identifies Newton–Schulz-induced bias under DP noise, with a bias-corrected variant.

Topic Match: The optimizer and its large-model training behavior are central, making large-scale training algorithms the best fit.

Relevance: 8 Novelty: 8

4. Provable Quantization with Randomized Hadamard Transform

ArXiv ID: 2605.13810

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Ying Feng, Piotr Indyk, Michael Kapralov, Dmitry Krachun, Boris Prokhorov

Abstract: Vector quantization via random projection followed by scalar quantization is a fundamental primitive in machine learning, with applications ranging from similarity search to federated learning and KV cache compression. While dense random rotations yield clean theoretical guarantees, they require $\Theta(d^2)$ time. The randomized Hadamard transform $HD$ reduces this cost to $O(d \log d)$, but its discrete structure complicates analysis and leads to weaker or purely empirical compression guarantees. In this work, we study a variant of this approach: dithered quantization with a single randomized Hadamard transform. Specifically, the quantizer applies $HD$ to the input vector and subtracts a random scalar offset before quantizing, injecting additional randomness at negligible cost. We prove that this approach is unbiased and provides mean squared error bounds that asymptotically match those achievable with truly random rotation matrices. In particular, we prove that a dithered version of TurboQuant achieves mean squared error $\bigl(\pi\sqrt{3}/2 + o(1)\bigr) \cdot 4^{-b}$ at $b$ bits per coordinate, where the $o(1)$ term vanishes uniformly over all unit vectors and all dimensions as the number of quantization levels grows.

Comment: Proves near-optimal quantization error bounds for randomized-Hadamard-based dithered quantization at O(d log d) cost.

Topic Match: The paper is directly about a foundational compression primitive with strong theory and implications for efficient ML systems.

Relevance: 8 Novelty: 8

5. Scaling Laws for Mixture Pretraining Under Data Constraints

ArXiv ID: 2605.12715

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Anastasiia Sedova, Skyler Seto, Natalie Schluter, Pierre Ablin

Abstract: As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.

Comment: Derives repetition-aware scaling laws for optimal target/generic data mixtures under pretraining data scarcity.

Topic Match: The primary contribution is about compute/data-efficient large-scale pretraining under constrained data budgets, making efficiency/scaling the best fit.

Relevance: 8 Novelty: 8

6. DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning

ArXiv ID: 2605.13418

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Marc Molina Van den Bosch, Riccardo Taiello, Albert Sund Aillet, Andrea Protani, Miguel Angel Gonzalez Ballester, Luigi Serio

Abstract: Differentially private optimization suffers from a fundamental geometric mismatch: deep networks have highly anisotropic loss landscapes, yet DP-SGD injects isotropic noise. Second-order preconditioning can resolve this, but estimating curvature typically requires private data (consuming privacy budget) or public data (introducing distribution shift). We show that the Fisher Information Matrix decouples into architectural sensitivity, recoverable via synthetic noise, and input correlations, approximable from modality-specific frequency statistics. We propose DP-KFC, which constructs KFAC preconditioners by probing networks with structured synthetic noise, requiring neither private nor public data. Empirically, DP-KFC consistently outperforms DP-SGD and adaptive baselines across diverse modalities in strong privacy regimes ($\varepsilon \leq 3$). DP-KFC matches private-data preconditioners while public-data variants degrade by up to $4.8\%$, showing that curvature can be estimated without consuming privacy budget or introducing distribution shift. This enables privacy-preserving learning in specialized domains (e.g., medical applications) where regulatory constraints make data scarce.

Comment: Builds KFAC-style preconditioners from synthetic noise to align DP optimization with anisotropic geometry without using private data.

Topic Match: The main contribution is an optimizer/preconditioning method that changes large-model training behavior under privacy constraints through a new data-free curvature approximation.

Relevance: 8 Novelty: 8

7. KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

ArXiv ID: 2605.13734

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Zedong Liu, Xinyang Ma, Dejun Luo, Hairui Zhao, Bing Lu, Wenjing Huang, Yida Gu, Xingchen Liu, Zheng Wei, Jinyang Liu, Dingwen Tao, Guangming Tan

Abstract: LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.

Comment: Adaptive service-aware KV compression controller for disaggregated serving with online profile selection under latency and quality constraints.

Topic Match: Best fits efficiency_scaling because the contribution directly targets KV-cache communication and memory bottlenecks in large-scale LLM serving with a new adaptive systems method.

Relevance: 8 Novelty: 8

Representation Learning Theory and Structure (6)

1. The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

ArXiv ID: 2605.12908

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Ryoya Awano, Taiji Suzuki

Abstract: Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $\kappa$. We prove that the strong model efficiently learns task $\kappa$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

Comment: Provides theory for weak-to-strong generalization as feature elicitation from pre-trained latent knowledge while retaining off-target features.

Topic Match: The paper is fundamentally about how target features are elicited and preserved in learned representations under SGD.

Relevance: 9 Novelty: 8

2. WriteSAE: Sparse Autoencoders for Recurrent State

ArXiv ID: 2605.12770

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics, Memory Structures and Agent Memory Systems

Authors: Jack Young

Abstract: We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a $d_k \times d_v$ cache through rank-1 updates $k_t v_t^\top$ that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of $n=4{,}851$ firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at $R^2=0.98$, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at $3\times$ lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.

Comment: Introduces sparse autoencoders tailored to recurrent matrix-cache writes, enabling mechanistic decomposition and intervention at the native recurrent write site.

Topic Match: The strongest match is representation structure because the paper studies and decomposes learned recurrent-state features at a mechanistic level.

Relevance: 9 Novelty: 8

3. Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

ArXiv ID: 2605.12756

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Zhehang Du, Hangfeng He, Weijie Su

Abstract: Large language models (LLMs) are pretrained by minimizing the cross-entropy loss for next-token prediction. In this paper, we study whether this optimization strategy can induce geometric structure in the learned model weights and context embeddings. We approach this problem by analyzing a constrained layer-peeled optimization program, which serves as a mathematically tractable surrogate for LLMs by treating the output projection matrix and last-layer context embeddings as optimization variables. Our analysis of this nonconvex optimization program demonstrates that symmetries in the target next-token distributions are transferred to the global minimizers of the layer-peeled model in a precise group-theoretic sense. Specifically, we prove that when the target tokens exhibit a cyclic-shift symmetry (such as the seven days of the week or the twelve months of the year), the optimal logit matrix is exactly circulant, and the Gram matrices of both the output projections and the context embeddings form circulant geometries as well. Next, for exchangeable target distributions invariant under the symmetric group and, more generally, under two-transitive group actions, we show that the global optimal output projection matrix forms a simplex equiangular tight frame, while the optimal logit matrix and context embeddings inherit the permutation symmetries present in the input data. A key technical step is to reduce the constrained nonconvex factorized problem to an explicit logit-level convex characterization for cyclic symmetry and to a symmetry-based lower bound for permutation symmetry, together with a sharp characterization of the optimal factorization. Finally, we empirically demonstrate that open-source LLMs naturally exhibit symmetries consistent with our theoretical predictions, despite being trained without any explicit regularization promoting such geometric structure.

Comment: Proves that symmetries in target next-token distributions induce corresponding geometric structure in LLM embeddings and output weights.

Topic Match: The paper is fundamentally about geometric structure and identifiability of learned representations induced by training objectives, with theory tied to LLM optimization.

Relevance: 9 Novelty: 8

4. From Generalist to Specialist Representation

ArXiv ID: 2605.12733

Primary Topic: Representation Learning Theory and Structure

Authors: Yujia Zheng, Fan Feng, Yuke Li, Shaoan Xie, Kevin Murphy, Kun Zhang

Abstract: Given a generalist model, learning a task-relevant specialist representation is fundamental for downstream applications. Identifiability, the asymptotic guarantee of recovering the ground-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints. We first prove that the structure between time steps and tasks is identifiable in a fully unsupervised manner, even when sequences lack strict temporal dependence and may exhibit disconnections, and task assignments can follow arbitrarily complex and interleaving structures. We then prove that, within each time step, the task-relevant latent representation can be disentangled from the irrelevant part under a simple sparsity regularization, without any additional information or parametric constraints. Together, these results establish a hierarchical foundation: task structure is identifiable across time steps, and task-relevant latent representations are identifiable within each step. To our knowledge, each result provides a first general nonparametric identifiability guarantee, and together they mark a step toward provably moving from generalist to specialist models.

Comment: Provides nonparametric identifiability guarantees for disentangling task-relevant specialist representations from generalist ones.

Topic Match: This is directly about the structure and identifiability of learned representations, with foundational theoretical guarantees.

Relevance: 9 Novelty: 8

5. The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

ArXiv ID: 2605.12940

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Zhiyu Zhao, Xuejie Liu, Muhan Zhang, Anji Liu

Abstract: Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.

Comment: Identifies output and context-encoding bottlenecks that define the expressivity gap between probabilistic circuits and transformer language models.

Topic Match: The paper is fundamentally about representational capacity and structural expressivity, making representation structure the best primary fit.

Relevance: 8 Novelty: 8

6. Support-Conditioned Flow Matching Is Kernel Smoothing

ArXiv ID: 2605.13386

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Daniel Matsui Smola

Abstract: Generative models are often conditioned on a small set of examples via cross-attention. Under the Gaussian optimal-transport path, we show that the exact velocity field induced by a finite support set is a Nadaraya--Watson kernel smoother whose bandwidth decreases with flow time, from broad averaging at early steps to nearest-neighbor at late steps. A single Gaussian-kernel attention head exactly computes this field, connecting cross-attention conditioning to classical kernel theory. The theory predicts three failure regimes: nearest-neighbor collapse of the kernel at high dimension, mismatch between the isotropic kernel and the data geometry, and insufficient support for nonparametric estimation. Experiments on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features confirm that learned conditioning improves in precisely these regimes, and that IP-Adapter's cross-attention implements approximate NW smoothing in practice.

Comment: Shows that support-conditioned flow matching reduces exactly to a time-varying Nadaraya-Watson kernel smoother under a Gaussian OT path.

Topic Match: The paper primarily offers mechanistic theory connecting conditioning behavior to kernel smoothing, making it strongest as a representation/structure insight rather than an architecture proposal.

Relevance: 8 Novelty: 8

Memory Structures and Agent Memory Systems (2)

1. Cognifold: Always-On Proactive Memory via Cognitive Folding

ArXiv ID: 2605.13438

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Xinliang Zhou

Abstract: Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce Cognifold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across 7 broad-coverage benchmarks spanning five cognitive domains, we validate that CogniFold simultaneously performs robustly on conventional memory benchmarks.

Comment: Proposes an always-on memory that proactively folds event streams into evolving cognitive structures with consolidation, decay, and associative relinking.

Topic Match: This is directly centered on agent memory organization, updating, forgetting, recall, and proactive structuring rather than standard retrieval pipelines.

Relevance: 10 Novelty: 8

2. When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

ArXiv ID: 2605.12922

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Architecture and Training Dynamics, Representation Learning Theory and Structure

Authors: Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-T\"ur

Abstract: Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

Comment: Mechanistically explains multi-turn instruction loss as a transition where goal tokens become inaccessible through attention despite residual retention.

Topic Match: Primary fit is memory systems because the paper studies how goal information is retained, lost, and accessed over long interactions, with explicit mechanistic diagnostics.

Relevance: 9 Novelty: 8

World Models, Exploration, and Open-Ended Reinforcement Learning (7)

1. Delightful Exploration

ArXiv ID: 2605.13287

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Ian Osband

Abstract: Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $\varepsilon$-greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and $\varepsilon$-greedy in the tested unresolved regimes. Delight improves acting for the same reason it improves learning: it prices scarce resources by the product of upside and surprisal.

Comment: Proposes delight-gated exploration, a reservation-value style rule that allocates exploration by expected improvement times surprisal.

Topic Match: This is directly a foundational exploration method with a clear new principle for allocating exploratory actions.

Relevance: 9 Novelty: 8

2. Ergodic Trajectory Design by Learned Pushforward Maps: Provable Coverage via Conditional Flow Matching

ArXiv ID: 2605.13063

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Ehsan Aghazadeh, Masoud Malekzadeh, Ahmad Ghasemi, Hossein Pishro-Nik

Abstract: Designing continuous trajectories whose time-averaged occupancy provably matches a prescribed spatial density (the \emph{ergodic coverage} problem) is central to UAV-assisted data collection and sensing, robotic exploration, and mobile monitoring. For flying agents in particular, this challenge is acute: trajectories must balance coverage fidelity against tight energy budgets, no-fly zones, and acceleration limits. Existing methods either re-optimize each trajectory online (with cost growing in the horizon and re-running for every target, agent, and realization) or rely on bespoke analytical constructions that must be re-derived for each new constraint. We propose a \emph{epushforward} framework that decouples ergodicity from density matching: an analytic latent trajectory provides exact uniform ergodicity on a simple annular domain, and a single map, learned offline by optimal-transport conditional flow matching, transports this latent occupancy onto the prescribed target density. The composed trajectory is then asymptotically ergodic with respect to the learned pushforward distribution, with deviation from the target controlled by the flow-matching training loss. Once trained for a given target density and constraint set, the map serves an unbounded number of trajectories and a multi-agent fleet without per-agent retraining, and many differentiable operational constraints (no-fly zones, acceleration ceilings, or fairness penalties) enter as additive soft penalties in the training loss without re-deriving the design. We prove three results (an acceleration-energy bound, an $O(1/\sqrt{K})$ ergodic convergence rate in the number of trajectory cycles $K$, and an approximation-error bound) that combine into an end-to-end coverage bound estimable from CFM training diagnostics (certified given an architectural Lipschitz bound on $v_\theta$).

Comment: Learns pushforward maps with coverage guarantees for exploration trajectories under constraints.

Topic Match: The core problem is exploration and coverage via learned trajectory design with theory, fitting open-ended interaction better than other topics.

Relevance: 8 Novelty: 8

3. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

ArXiv ID: 2605.13435

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: JaeHyeok Doo, Byeongguk Jeon, Seonghyeon Ye, Kimin Lee, Minjoon Seo

Abstract: There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.

Comment: Introduces a stable optimization principle for expressive flow-based RL policies by propagating value gradients through latent flow states without solver unrolling.

Topic Match: The paper is fundamentally about reinforcement-learning policy optimization, with the key contribution being a new learning principle for expressive RL policies rather than just an architectural tweak.

Relevance: 8 Novelty: 8

4. State-Centric Decision Process

ArXiv ID: 2605.12755

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Memory Structures and Agent Memory Systems

Authors: Sungheon Jeong, Ryozo Masukawa, Sanggeon Yun, Mahdi Imani, Mohsen Imani

Abstract: Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Comment: Constructs task-induced states online in language environments by certifying natural-language predicates into state transitions.

Topic Match: This is a foundational framework for sequential decision-making in language environments, explicitly building state abstractions needed for planning and credit assignment.

Relevance: 8 Novelty: 8

5. Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

ArXiv ID: 2605.13335

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Memory Structures and Agent Memory Systems

Authors: Qinchuan Cheng, Zhantao Gong, Pengzhan Sun, Angela Yao, Xulei Yang, Shijie Li

Abstract: Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

Comment: Compiles egocentric videos into executable symbolic worlds that require belief-state planning under partial observability.

Topic Match: The strongest fit is world models because the work creates executable world dynamics and explicitly evaluates belief-based planning under hidden state.

Relevance: 8 Novelty: 8

6. Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

ArXiv ID: 2605.12655

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Wo Wei Lin, Ethan Rathbun, Enrico Marchesini Xiang Zhi Tan

Abstract: Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

Comment: Corrects Bellman backups at instruction-switch boundaries to preserve macro-action values under changing language instructions.

Topic Match: The contribution is a new RL learning principle for interrupted long-horizon control, with theoretical treatment of value consistency under changing objectives.

Relevance: 8 Novelty: 8

7. Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

ArXiv ID: 2605.13554

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius

Abstract: Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}

Comment: First on-policy contrastive RL formulation that derives PPO advantages from contrastive Q-values without rewards or replay.

Topic Match: Best fits world_models_open_ended_rl because it proposes a foundational self-supervised RL learning principle for acquiring behaviors without handcrafted rewards.

Relevance: 8 Novelty: 8

Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.

7-8: substantially related, but partly peripheral or focused on a narrower aspect.

5-6: touches the target topics, but the main contribution is elsewhere.

3-4: largely outside the target topics, often application-focused or domain-specific.

1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

9-10: new paradigm, theory, or major methodological breakthrough.

7-8: substantial methodological advance or strong new insight.

5-6: meaningful but incremental extension or refinement.

3-4: minor, narrow, or mostly engineering or domain-specific improvement.

1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.