Personalized Daily ArXiv Papers 2026-04-15
| Model | Metric | Usage | Papers | ||||
|---|---|---|---|---|---|---|---|
| Prompt | Completion | Total | Total arXiv | Scanned | Relevant | ||
gpt-5.4 |
Tokens | 172336 | 23047 | 195383 | 592 | 339 | 21 |
| Cost | $0.43 | $0.35 | $0.78 | ||||
Topic Coverage:
Table of contents by topic:
Architecture and Training Dynamics (3)
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Authors: Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding
-
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory Authors: Shaopeng Fu, Di Wang
-
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization Authors: Andrei Atanov, Jesse Allardice, Roman Bachmann, O\u{g}uzhan Fatih Kar, R Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir
Efficiency, Compression, and Large-Scale Training (6)
-
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension Authors: Zhiyuan Zhang, Yanzhao Li, Zhiqiang Zou, Bai Du, Yupeng Sun, Hui Dong, Hui Wang
-
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models Authors: Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, Kurt Keutzer, Amir Gholami
-
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving Authors: Xu Bai, Muhammed Tawfiqul Islam, Chen Wang, Adel N. Toosi
-
Decentralized Learning via Random Walk with Jumps Authors: Zonghong Liu, Matthew Dwyer, Salim El Rouayheb
-
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models Authors: Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao
-
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems Authors: Daeyeon Son
Representation Learning Theory and Structure (5)
-
Latent Planning Emerges with Scale Authors: Michael Hanna, Emmanuel Ameisen
-
Loop Corrections to the Training and Generalization Errors of Random Feature Models Authors: Taeyoung Kim
-
A Bayesian Perspective on the Role of Epistemic Uncertainty for Delayed Generalization in In-Context Learning Authors: Abdessamed Qchohi, Simone Rossi
-
Information-Geometric Decomposition of Generalization Error in Unsupervised Learning Authors: Gilhan Kim
-
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation Authors: Yexiong Lin, Jia Shi, Shanshan Ye, Wanyu Wang, Yu Yao, Tongliang Liu
Memory Structures and Agent Memory Systems (6)
-
Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents Authors: Benjamin Stern, Peter Nadel
-
Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations Authors: Ziyang Liu
-
When to Forget: A Memory Governance Primitive Authors: Baris Simsek
-
M$^\star$: Every Task Deserves Its Own Memory Harness Authors: Wenbo Pan, Shujie Liu, Xiangyang Zhou, Shiwei Zhang, Wanlu Shi, Mirror Xu, Xiaohua Jia
-
Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness Authors: Madhava Gaikwad
-
Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents Authors: Swanand Rao, Kiran Kashalkar, Parvathi Somashekar, Priya Krishnan
World Models, Exploration, and Open-Ended Reinforcement Learning (1)
- Robust Optimization for Mitigating Reward Hacking with Correlated Proxies Authors: Zixuan Liu, Xiaolin Sun, Zizhan Zheng
Architecture and Training Dynamics (3)
1. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
ArXiv ID: 2604.13016
Primary Topic: Architecture and Training Dynamics
Authors: Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding
Abstract: On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.
Comment: Systematically studies on-policy distillation dynamics, identifying when teacher-student compatibility and novelty determine success or failure.
Topic Match: The value is in understanding post-training dynamics and token-level distillation mechanisms rather than reporting benchmark gains.
Relevance: 8 Novelty: 8
2. Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
ArXiv ID: 2604.12817
Primary Topic: Architecture and Training Dynamics
Authors: Shaopeng Fu, Di Wang
Abstract: Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.
Comment: Explains continuous adversarial training for LLMs through in-context learning theory and links robustness to the singular values of the embedding matrix.
Topic Match: This is squarely about training dynamics and mechanistic analysis of a specific robust-training procedure, with a derived architectural regularization insight.
Relevance: 8 Novelty: 8
3. VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
ArXiv ID: 2604.12887
Primary Topic: Architecture and Training Dynamics
Also Matches: Efficiency, Compression, and Large-Scale Training, Representation Learning Theory and Structure
Authors: Andrei Atanov, Jesse Allardice, Roman Bachmann, O\u{g}uzhan Fatih Kar, R Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir
Abstract: Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.
Comment: Introduces a variable-length coarse-to-fine video tokenizer where early tokens capture semantics and later tokens refine detail, materially changing token structure and downstream training efficiency.
Topic Match: The core contribution is a new tokenization architecture and representation organization mechanism, not just an efficiency tweak.
Relevance: 8 Novelty: 8
Efficiency, Compression, and Large-Scale Training (6)
1. OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
ArXiv ID: 2604.12782
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Zhiyuan Zhang, Yanzhao Li, Zhiqiang Zou, Bai Du, Yupeng Sun, Hui Dong, Hui Wang
Abstract: While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.
Comment: Finds token-persistent channel outliers and exploits them with a dual-path W4A4/FP16 execution scheme for hardware-friendly 4-bit inference.
Topic Match: The core contribution is a new quantization mechanism for handling activation outliers in low-bit LLM inference, directly matching efficient inference and compression.
Relevance: 9 Novelty: 8
2. LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
ArXiv ID: 2604.12056
Primary Topic: Efficiency, Compression, and Large-Scale Training
Also Matches: Architecture and Training Dynamics
Authors: Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, Kurt Keutzer, Amir Gholami
Abstract: Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.
Comment: Proposes a new sparse-attention mechanism for diffusion language models that reuses cached prefix attention for stable tokens to cut KV loads.
Topic Match: The paper’s central contribution is a nontrivial KV/cache-efficient inference mechanism with a clear new systems-algorithmic idea.
Relevance: 9 Novelty: 8
3. PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
ArXiv ID: 2604.12171
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Xu Bai, Muhammed Tawfiqul Islam, Chen Wang, Adel N. Toosi
Abstract: Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in place, without interrupting inference. However, live in-place PP reconfiguration is fundamentally challenging. GPUs are already saturated with model weights and KV cache, leaving little room for new layer placements and necessitating KV cache resizing, at odds with systems like vLLM that preallocate for throughput. Moreover, maintaining KV consistency during execution is difficult: stop-and-copy introduces large pauses, while background synchronization risks inconsistency as states evolve. We present PipeLive, which enables live in-place PP reconfiguration with minimal disruption. PipeLive introduces a redesigned KV cache layout together with a co-designed extension to PageAttention, forming a unified mechanism for live KV resizing. It further adopts an incremental KV patching mechanism, inspired by live virtual machine migration, to synchronize KV states between source and target configurations and identify a safe switch point. PipeLive achieves a 2.5X reduction in time-to-first-token (TTFT) without KV cache overflow compared to disabling KV resizing. Furthermore, compared to a variant without KV patching, it reduces reconfiguration overhead from seconds to under 10ms, and improves TTFT and time-per-output-token (TPOT) by up to 54.7% and 14.7%, respectively.
Comment: Enables live in-place pipeline-parallel reconfiguration through a redesigned KV-cache layout and incremental KV patching.
Topic Match: This is directly about large-model serving efficiency, KV-cache design, and distributed inference behavior under dynamic reconfiguration.
Relevance: 9 Novelty: 8
4. Decentralized Learning via Random Walk with Jumps
ArXiv ID: 2604.12260
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Zonghong Liu, Matthew Dwyer, Salim El Rouayheb
Abstract: We study decentralized learning over networks where data are distributed across nodes without a central coordinator. Random walk learning is a token-based approach in which a single model is propagated across the network and updated at each visited node using local data, thereby incurring low communication and computational overheads. In weighted random-walk learning, the transition matrix is designed to achieve a desired sampling distribution, thereby speeding up convergence under data heterogeneity. We show that implementing weighted sampling via the Metropolis-Hastings algorithm can lead to a previously unexplored phenomenon we term entrapment. The random walk may become trapped in a small region of the network, resulting in highly correlated updates and severely degraded convergence. To address this issue, we propose Metropolis-Hastings with Levy jumps, which introduces occasional long-range transitions to restore exploration while respecting local information constraints. We establish a convergence rate that explicitly characterizes the roles of data heterogeneity, network spectral gap, and jump probability, and demonstrate through experiments that MHLJ effectively eliminates entrapment and significantly speeds up decentralized learning.
Comment: Identifies Metropolis-Hastings entrapment in decentralized learning and fixes it with Lévy-jump random walks plus convergence theory.
Topic Match: It is fundamentally a distributed training/optimization algorithm paper about communication-efficient decentralized learning dynamics.
Relevance: 8 Novelty: 8
5. Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
ArXiv ID: 2604.12391
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao
Abstract: In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.
Comment: Accelerates training across an entire model family by chaining models and transferring parameters and features from smaller to larger pretrained models.
Topic Match: This is primarily a large-scale training cost reduction method, with a nonstandard family-level pretraining strategy rather than a task application.
Relevance: 8 Novelty: 8
6. ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems
ArXiv ID: 2604.11943
Primary Topic: Efficiency, Compression, and Large-Scale Training
Also Matches: Memory Structures and Agent Memory Systems
Authors: Daeyeon Son
Abstract: An OS kernel that runs LLM inference internally can read logit distributions before any text is generated -- and act on them as a governance primitive. I present ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters. On a 260-prompt OS action benchmark (9 categories including adversarial attacks), ProbeLogits achieves F1=0.980, Precision=1.000, and Recall=0.960 using a general-purpose 7B model at 4-bit quantization. On ToxicChat (1,000 human-annotated real conversations), it achieves F1=0.790 at default calibration strength $\alpha$=1.0, improving to F1=0.837 at $\alpha$=0.5 -- 89% of Llama Guard 3's F1~0.939 with zero learned parameters. A key design contribution is the calibration strength $\alpha$, which serves as a deployment-time policy knob rather than a learned hyperparameter. By adjusting $\alpha$, the OS can enforce strict policies for privileged operations ($\alpha \geq 0.8$, maximizing recall) or relaxed policies for conversational agents ($\alpha$=0.5, maximizing precision). Contextual calibration improves accuracy from 64.8% to 97.3% on the custom benchmark. I implement ProbeLogits within Anima OS, a bare-metal x86_64 OS written in 80,400 lines of Rust. Because agent actions must pass through kernel-mediated host functions, ProbeLogits enforcement operates below the WASM sandbox boundary, making it significantly harder to circumvent than application-layer classifiers. Each classification costs 65ms on 7B -- fast enough for per-action governance. I also show that treating KV cache as process state enables checkpoint, restore, and fork operations analogous to traditional process management. To my knowledge, no prior system exposes LLM logit vectors as OS-level governance primitives.
Comment: Treats KV cache as process state and exposes kernel-level logit probing as new systems primitives for LLM inference.
Topic Match: The strongest fit is systems design for LLM inference and process management, especially the explicit OS-level treatment of KV cache state and inference primitives.
Relevance: 8 Novelty: 8
Representation Learning Theory and Structure (5)
1. Latent Planning Emerges with Scale
ArXiv ID: 2604.12493
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Michael Hanna, Emmanuel Ameisen
Abstract: LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like "accountant", and cause them to output "an" rather than "a"; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models' planning abilities grow with scale.
Comment: Provides mechanistic evidence that latent planning features emerge with model scale and shape earlier token choices.
Topic Match: The paper is most about internal representational structure—planning-related features in activation space—and how they causally support generation.
Relevance: 8 Novelty: 8
2. Loop Corrections to the Training and Generalization Errors of Random Feature Models
ArXiv ID: 2604.12827
Primary Topic: Representation Learning Theory and Structure
Authors: Taeyoung Kim
Abstract: We investigate random feature models in which neural networks sampled from a prescribed initialization ensemble are frozen and used as random features, with only the readout weights optimized. Adopting a statistical-physics viewpoint, we study the training, test, and generalization errors beyond the mean-kernel approximation. Since the predictor is a nonlinear functional of the induced random kernel, the ensemble-averaged errors depend not only on the mean kernel but also on higher-order fluctuation statistics. Within an effective field-theoretic framework, these finite-width contributions naturally appear as loop corrections. We derive the loop corrections to the training, test, and generalization errors, obtain their scaling laws, and support the theory with experimental verification.
Comment: Derives finite-width loop corrections beyond the mean-kernel approximation for training and generalization errors in random feature models.
Topic Match: The main contribution is theoretical understanding of feature-based models and finite-width effects on generalization, a direct match to representation-learning structure and theory.
Relevance: 8 Novelty: 8
3. A Bayesian Perspective on the Role of Epistemic Uncertainty for Delayed Generalization in In-Context Learning
ArXiv ID: 2604.12434
Primary Topic: Representation Learning Theory and Structure
Authors: Abdessamed Qchohi, Simone Rossi
Abstract: In-context learning enables transformers to adapt to new tasks from a few examples at inference time, while grokking highlights that this generalization can emerge abruptly only after prolonged training. We study task generalization and grokking in in-context learning using a Bayesian perspective, asking what enables the delayed transition from memorization to generalization. Concretely, we consider modular arithmetic tasks in which a transformer must infer a latent linear function solely from in-context examples and analyze how predictive uncertainty evolves during training. We combine approximate Bayesian techniques to estimate the posterior distribution and we study how uncertainty behaves across training and under changes in task diversity, context length, and context noise. We find that epistemic uncertainty collapses sharply when the model groks, making uncertainty a practical label-free diagnostic of generalization in transformers. Additionally, we provide theoretical support with a simplified Bayesian linear model, showing that asymptotically both delayed generalization and uncertainty peaks arise from the same underlying spectral mechanism, which links grokking time to uncertainty dynamics.
Comment: Links delayed generalization in in-context learning to epistemic uncertainty collapse, with Bayesian analysis explaining grokking-like transitions.
Topic Match: Its core contribution is mechanistic and theoretical analysis of how uncertainty evolves during in-context learning and grokking, not an application benchmark.
Relevance: 8 Novelty: 8
4. Information-Geometric Decomposition of Generalization Error in Unsupervised Learning
ArXiv ID: 2604.12340
Primary Topic: Representation Learning Theory and Structure
Authors: Gilhan Kim
Abstract: We decompose the Kullback--Leibler generalization error (GE) -- the expected KL divergence from the data distribution to the trained model -- of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to $\epsilon$-PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank $N_K$ and discarded directions are pinned at a fixed noise floor $\epsilon$. Although rank-constrained $\epsilon$-PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff $\lambda_{\mathrm{cut}}^{} = \epsilon$ -- the model retains exactly those empirical eigenvalues exceeding the noise floor -- with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram -- retain-all, interior, and collapse -- separated by the lower Marchenko--Pastur edge and an analytically computable collapse threshold $\epsilon_{}(\alpha)$, where $\alpha$ is the dimension-to-sample-size ratio. All claims are verified numerically.
Comment: Provides an exact information-geometric decomposition of unsupervised generalization error into model error, data bias, and variance.
Topic Match: This is foundational theory for unsupervised representation learning structure and generalization, not a downstream application.
Relevance: 8 Novelty: 8
5. SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
ArXiv ID: 2604.12273
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Yexiong Lin, Jia Shi, Shanshan Ye, Wanyu Wang, Yu Yao, Tongliang Liu
Abstract: Flow matching has emerged as a powerful generative framework, with recent few-step methods achieving remarkable inference acceleration. However, we identify a critical yet overlooked limitation: these models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations of the target distribution. We trace this degradation to averaging distortion: when trained with MSE objectives, class-conditional flows learn a frequency-weighted mean over intra-class sub-modes, causing the model to over-represent high-density modes while systematically neglecting low-density ones. To address this, we propose SubFlow, Sub-mode Conditioned Flow Matching, which eliminates averaging distortion by decomposing each class into fine-grained sub-modes via semantic clustering and conditioning the flow on sub-mode indices. Each conditioned sub-distribution is approximately unimodal, so the learned flow accurately targets individual modes with no averaging distortion, restoring full mode coverage in a single inference step. Crucially, SubFlow is entirely plug-and-play: it integrates seamlessly into existing one-step models such as MeanFlow and Shortcut Models without any architectural modifications. Extensive experiments on ImageNet-256 demonstrate that SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID), confirming its broad applicability across different one-step generation frameworks. Project page: https://yexionglin.github.io/subflow.
Comment: Diagnoses averaging distortion in one-step flow matching and fixes it with sub-mode conditioning to restore mode coverage.
Topic Match: The main contribution is mechanistic understanding of representation collapse across sub-modes and a structural conditioning fix, making representation structure the best fit.
Relevance: 8 Novelty: 8
Memory Structures and Agent Memory Systems (6)
1. Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents
ArXiv ID: 2604.12948
Primary Topic: Memory Structures and Agent Memory Systems
Authors: Benjamin Stern, Peter Nadel
Abstract: LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.
Comment: Dual-trace memory encoding pairs facts with contextual scene traces to improve temporal and cross-session recall in agents.
Topic Match: This is directly about a new memory encoding principle for persistent agent recall, not generic RAG or context management.
Relevance: 9 Novelty: 8
2. Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations
ArXiv ID: 2604.12376
Primary Topic: Memory Structures and Agent Memory Systems
Also Matches: Efficiency, Compression, and Large-Scale Training
Authors: Ziyang Liu
Abstract: When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.
Comment: Replaces evicted conversation segments with compact keyword bookmarks and a recall tool, exposing bookmark discrimination as the key bottleneck in long-horizon memory.
Topic Match: This is centrally about a new memory organization and recall mechanism for long conversations, not generic RAG or chat-history management.
Relevance: 9 Novelty: 8
3. When to Forget: A Memory Governance Primitive
ArXiv ID: 2604.12007
Primary Topic: Memory Structures and Agent Memory Systems
Authors: Baris Simsek
Abstract: Agent memory systems accumulate experience but currently lack a principled operational metric for memory quality governance -- deciding which memories to trust, suppress, or deprecate as the agent's task distribution shifts. Write-time importance scores are static; dynamic management systems use LLM judgment or structural heuristics rather than outcome feedback. This paper proposes Memory Worth (MW): a two-counter per-memory signal that tracks how often a memory co-occurs with successful versus failed outcomes, providing a lightweight, theoretically grounded foundation for staleness detection, retrieval suppression, and deprecation decisions. We prove that MW converges almost surely to the conditional success probability p+(m) = Pr[y_t = +1 | m in M_t] -- the probability of task success given that memory m is retrieved -- under a stationary retrieval regime with a minimum exploration condition. Importantly, p+(m) is an associational quantity, not a causal one: it measures outcome co-occurrence rather than causal contribution. We argue this is still a useful operational signal for memory governance, and we validate it empirically in a controlled synthetic environment where ground-truth utility is known: after 10,000 episodes, the Spearman rank-correlation between Memory Worth and true utilities reaches rho = 0.89 +/- 0.02 across 20 independent seeds, compared to rho = 0.00 for systems that never update their assessments. A retrieval-realistic micro-experiment with real text and neural embedding retrieval (all-MiniLM-L6-v2) further shows stale memories crossing the low-value threshold (MW = 0.17) while specialist memories remain high-value (MW = 0.77) across 3,000 episodes. The estimator requires only two scalar counters per memory unit and can be added to architectures that already log retrievals and episode outcomes.
Comment: Defines Memory Worth, a two-counter online signal for suppressing or deprecating stale memories based on retrieval-conditioned success rates.
Topic Match: The paper proposes a concrete governance primitive for updating, forgetting, and suppressing memories based on outcomes, directly matching memory-system design.
Relevance: 9 Novelty: 8
4. M$^\star$: Every Task Deserves Its Own Memory Harness
ArXiv ID: 2604.11811
Primary Topic: Memory Structures and Agent Memory Systems
Authors: Wenbo Pan, Shujie Liu, Xiangyang Zhou, Shiwei Zhang, Wanlu Shi, Mirror Xu, Xiaohua Jia
Abstract: Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.
Comment: Evolves executable memory programs so each task gets its own schema, storage logic, and retrieval workflow rather than a fixed memory architecture.
Topic Match: The heart of the paper is automatic discovery of task-specific memory mechanisms, squarely within agent memory systems.
Relevance: 9 Novelty: 8
5. Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness
ArXiv ID: 2604.12811
Primary Topic: Memory Structures and Agent Memory Systems
Also Matches: Representation Learning Theory and Structure
Authors: Madhava Gaikwad
Abstract: Dense Associative Memory (DAM) generalizes Hopfield networks through higher-order interactions and achieves storage capacity that scales as $O(N^{n-1})$ under suitable pattern separation conditions. Existing dynamical analyses primarily study the thermodynamic limit $N\to\infty$ with randomly sampled patterns and therefore do not provide finite-size guarantees or explicit convergence rates. We develop an algorithmic analysis of DAM retrieval dynamics that yields finite-$N$ guarantees under explicit, verifiable pattern conditions. Under a separation assumption and a bounded-interference condition at high loading, we prove geometric convergence of asynchronous retrieval dynamics, which implies $O(\log N)$ convergence time once the trajectory enters the basin of attraction. We further establish adversarial robustness bounds expressed through an explicit margin condition that quantifies the number of corrupted bits tolerable per sweep, and derive capacity guarantees that scale as $\Theta(N^{n-1})$ up to polylogarithmic factors in the worst case, while recovering the classical $\Theta(N^{n-1})$ scaling for random pattern ensembles. Finally, we show that DAM retrieval dynamics admit a potential-game interpretation that ensures convergence to pure Nash equilibria under asynchronous updates. Complete proofs are provided in the appendices, together with preliminary experiments that illustrate the predicted convergence, robustness, and capacity scaling behavior.
Comment: Gives finite-size convergence, robustness, and capacity guarantees for dense associative memory retrieval under explicit pattern conditions.
Topic Match: Dense associative memory analysis is fundamentally about memory retrieval dynamics, capacity, and robustness, making memory systems the best fit.
Relevance: 8 Novelty: 8
6. Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents
ArXiv ID: 2604.12129
Primary Topic: Memory Structures and Agent Memory Systems
Also Matches: Efficiency, Compression, and Large-Scale Training
Authors: Swanand Rao, Kiran Kashalkar, Parvathi Somashekar, Priya Krishnan
Abstract: The transition from stateless model inference to stateful agentic execution is reshaping the systems assumptions underlying modern AI infrastructure. While large language models have made persistent, tool-using, and collaborative agents technically viable, existing runtime architectures remain constrained by materialization-heavy instantiation models that impose significant latency and memory overhead. This paper introduces Aethon, a reference-based replication primitive for near-constant-time instantiation of stateful AI agents. Rather than reconstructing agents as fully materialized objects, Aethon represents each instance as a compositional view over stable definitions, layered memory, and local contextual overlays. By shifting instantiation from duplication to reference, Aethon decouples creation cost from inherited structure. We present the conceptual framework, system architecture, and memory model underlying Aethon, including layered inheritance and copy-on-write semantics. We analyze its implications for complexity, scalability, multi-agent orchestration, and enterprise governance. We argue that reference-based instantiation is not merely an optimization, but a more appropriate systems abstraction for production-scale agentic software. Aethon points toward a new class of AI infrastructure in which agents become lightweight, composable execution identities that can be spawned, specialized, and governed at scale.
Comment: Introduces reference-based replication with layered memory and copy-on-write semantics for near-constant-time instantiation of stateful agents.
Topic Match: The core idea is a new memory and state organization principle for spawning and managing persistent agents.
Relevance: 8 Novelty: 8
World Models, Exploration, and Open-Ended Reinforcement Learning (1)
1. Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
ArXiv ID: 2604.12086
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Zixuan Liu, Xiaolin Sun, Zizhan Zheng
Abstract: Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxy-true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at https://github.com/ZixuanLiu4869/reward_hacking.
Comment: Formulates reward hacking under imperfect proxies as robust max-min optimization over all correlated proxy rewards.
Topic Match: Although not about world models, it is a foundational RL paper on robust policy optimization under uncertain rewards rather than LLM post-training.
Relevance: 8 Novelty: 8
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Relevant Topics
Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.
Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.
Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.
Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.
Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.
Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.
World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.
Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains
Scoring Criteria
Relevance and Novelty are independent axes. Score both from 1 to 10.
Relevance Scoring
- 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
- 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
- 5-6: touches the target topics, but the main contribution is elsewhere.
- 3-4: largely outside the target topics, often application-focused or domain-specific.
- 1-2: unrelated.
Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.
Novelty Scoring
- 9-10: new paradigm, theory, or major methodological breakthrough.
- 7-8: substantial methodological advance or strong new insight.
- 5-6: meaningful but incremental extension or refinement.
- 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
- 1-2: little originality; mainly standard application of existing methods.
Topic Registry
Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.
Papers
[PAPER LIST HERE]
Instructions
Respond in JSONL. Output exactly one JSON object per paper, one per line:
{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}
Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only:
daily_hot,new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return[]. -daily_hotmeans the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. -new_frontiermeans the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.