Personalized Daily ArXiv Papers 2026-06-04

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	173454	25482	198936	776	443	16
`gpt-5.4`	Cost	$0.43	$0.38	$0.82	776	443	16

Topic Coverage:

Topic	Papers
Architecture and Training Dynamics	6
Efficiency, Compression, and Large-Scale Training	4
Representation Learning Theory and Structure	1
Memory Structures and Agent Memory Systems	2
World Models, Exploration, and Open-Ended Reinforcement Learning	3

Table of contents by topic:

Architecture and Training Dynamics (6)

Edge of Stability Selectively Shapes Learning Across the Data Distribution Authors: Shauna Kwag, Anakha Ganesh, Tomaso Poggio, Pierfrancesco Beneventano
A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks Authors: Tian Ding, Dawei Li, Ruoyu Sun
Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent Authors: Ahanaf Hasan Ariq
Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting Authors: Federico Zucchi, Yi Xie, Chao Zhang, Keyuan Luo, Thomas Lampert, Ziyue Li
Flatness and Generalization: Learning Multi-Index Models with Homogeneous Neural Networks Authors: Harsh Vardhan, Hossein Taheri, Arya Mazumdar
Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability Authors: Vincent B\"urgin, Daniel Herbst, Ya-Wei Eileen Lin, Stefanie Jegelka

Efficiency, Compression, and Large-Scale Training (4)

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection Authors: Liulu He, XuanAng Liu, Juntao Liu, Taolue Feng, Ting Lu, Chunsheng Gan, Zhiyv Peng, Yuan Du, Huanrui Yang, Yijiang Liu, Li Du
AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization Authors: Wanqi Yang, Yuexiao Ma, Alexander Conzelmann, Xiawu Zheng, Michael W. Mahoney, T. Konstantin Rusch, Shiwei Liu
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding Authors: Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park
Near-Optimal Decentralized Stochastic Convex Optimization over Networks Authors: Nitai Kluger, Amit Attia, Tomer Koren

Representation Learning Theory and Structure (1)

Bayes-Sufficient Representations in Supervised Learning Authors: Vasileios Sevetlidis

Memory Structures and Agent Memory Systems (2)

SaliMory: Orchestrating Cognitive Memory for Conversational Agents Authors: Kai Zhang, Xinyuan Zhang, Hongda Jiang, Shiun-Zu Kuo, Hyokun Yun, Ejaz Ahmed, Shereen Oraby, Ziyun Li, Sanat Sharma, Ann Lee, Ahmed A Aly, Anuj Kumar, Raffay Hamid, Xin Luna Dong
Cartridges at Scale: Training Modular KV Caches over Large Document Collections Authors: Momchil Hardalov, Gonzalo Iglesias, Adri`a de Gispert

World Models, Exploration, and Open-Ended Reinforcement Learning (3)

What Type of Inference is Active Inference? Authors: Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries
From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments Authors: Saket Tiwari, Tejas Kotwal, George Konidaris
A Goal-Set Characterization of Task Composition in the Boolean Task Algebra Authors: Eduardo Terr\'es-Caballero, Herke van Hoof

Architecture and Training Dynamics (6)

1. Edge of Stability Selectively Shapes Learning Across the Data Distribution

ArXiv ID: 2606.04212

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Shauna Kwag, Anakha Ganesh, Tomaso Poggio, Pierfrancesco Beneventano

Abstract: Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.

Comment: Shows edge-of-stability selectively reallocates learning across training-data groups via Hessian-aligned gradients and saturation effects.

Topic Match: This is a direct training-dynamics paper explaining how optimization stability shapes where learning occurs in the data distribution.

Relevance: 9 Novelty: 8

2. A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks

ArXiv ID: 2606.04327

Primary Topic: Architecture and Training Dynamics

Authors: Tian Ding, Dawei Li, Ruoyu Sun

Abstract: We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of "neuron splitting" where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the "inner Hessian" matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geometry of the plateau. We show that "splitting" a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau, with a concrete sure-saddle region identified under mild assumptions. In contrast, splitting a saddle point always produces a plateau of saddle points. Our results unify and extend prior landscape analyses, elucidating when and how model expansion preserves or alters the nature of stationary points. These findings offer new geometric insights into the effects of width expansion and reparameterization in neural networks.

Comment: Geometric characterization of stationary plateaus from neuron splitting via an inner-Hessian analysis.

Topic Match: It gives foundational theory about loss-landscape geometry under width expansion and reparameterization.

Relevance: 9 Novelty: 8

3. Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

ArXiv ID: 2606.04031

Primary Topic: Architecture and Training Dynamics

Authors: Ahanaf Hasan Ariq

Abstract: Coupled gradient descent--where the update of one parameter block depends on another--underlies bilevel optimization, two-time-scale stochastic approximation, and adversarial training. When the coupled Jacobian is block-triangular, asymptotic stability is governed by the spectral radii of the diagonal blocks, yet transient amplification before convergence can be arbitrarily large due to non-normality. We develop a sharp pseudospectral theory for such block-triangular Jacobians, proving that the Kreiss constant satisfies $K(J) \leq 2/(1-\gamma) + |C|/(4(1-\gamma))$ when the diagonal blocks are symmetric with spectral radii at most $\gamma < 1$, and we establish matching minimax lower bounds. We characterize the critical coupling threshold for spectral instability and extend the analysis to nearly self-referential systems via a Neumann-series perturbation framework. As a consequence, we obtain a finite-horizon iteration-complexity bound of $O(K(J)^2 \log(1/\delta))$ for stochastic coupled descent. Framed as scaling laws for non-stationary two-time-scale optimization, our results expose a non-asymptotic, instance-dependent regime of high-dimensional learning dynamics that is invisible to spectral-radius analysis. Experiments on linear-quadratic problems, IQC-based comparisons, and neural-network training confirm the theory.

Comment: Pseudospectral theory for transient amplification in coupled gradient descent beyond spectral-radius stability analysis.

Topic Match: This is directly about optimization dynamics and training stability in coupled learning systems.

Relevance: 8 Novelty: 8

4. Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting

ArXiv ID: 2606.04074

Primary Topic: Architecture and Training Dynamics

Authors: Federico Zucchi, Yi Xie, Chao Zhang, Keyuan Luo, Thomas Lampert, Ziyue Li

Abstract: Adaptive patching is a recent and compelling proposal for time-series Transformers: allocate finer patches where the sequence looks locally informative. This paper asks under what conditions a content-adaptive patching operator should outperform a tuned uniform one. Local heterogeneity alone is not enough: under pointwise forecasting losses, a complex-looking region is not automatically one where finer patching reduces the loss. We model patching as a budgeted bitrate allocation and derive an explicit threshold that a dynamic patching rule must satisfy to beat a well-tuned uniform baseline, then bound the achievable improvement both locally (a quadratic surrogate) and globally (a strong-convexity bound under the model's assumptions). Two structural results follow: without a coupling constraint, scalar local complexity cannot produce a non-uniform optimum under a common loss landscape; and once the backbone is trained to its representation-aware optimum, the alignment gain collapses around a well-tuned uniform patch size. To test these predictions, we run a controlled isolation study on three representative architectures, replacing each adaptive mechanism with a uniform patch-size sweep while keeping the backbone, data, and training protocol fixed. On standard long-horizon forecasting benchmarks, the validation-selected uniform baseline is competitive with the dynamic counterpart, with per-setting effects concentrated near zero and no consistent directional advantage once results are aggregated by dataset. The larger gains we do observe are method- and dataset-specific. Adaptive patching should therefore be evaluated against a tuned uniform baseline; its value depends on whether a cheap and reliable routing signal can identify where finer patches actually reduce forecasting loss.

Comment: Theoretical conditions showing when adaptive patching can or cannot beat tuned uniform patching.

Topic Match: This is directly about an architectural/computational mechanism and when it changes learning or inference behavior.

Relevance: 8 Novelty: 8

5. Flatness and Generalization: Learning Multi-Index Models with Homogeneous Neural Networks

ArXiv ID: 2606.04429

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Harsh Vardhan, Hossein Taheri, Arya Mazumdar

Abstract: A common heuristic used to explain the generalization of first-order gradient methods on non-convex neural networks is that "flat interpolators generalize well" (Hochreiter and Schmidhuber, 1994; Keskar et al., 2017), where flatness can be measured by the trace of the Hessian of the empirical loss. However, Dinh et al. 2017) showed that, using symmetry of the network that can change flatness while keeping the population and empirical losses unchanged, any interpolator can be made sharper or flatter. This result makes the earlier heuristic statement vacuous. In this paper, we show that for learning an unknown multi-index model with $2$-layer non-convex homogeneous neural networks, there is a connection between flatness and generalization, despite the existence of symmetries. This connection pertains to the "flattest" interpolators, i.e., the interpolators that have orderwise minimum flatness among all interpolators. First, we show that there exists a natural class of non-generalizing interpolators whose flatness cannot be made closer to the flattest possible, even using symmetries. Second, we show that for data generated by a sum of single-index models, if the approximation error and label noise are low, any flattest interpolator achieves small population loss, i.e., the flattest interpolators always generalize. This establishes a direct link between flatness and generalization which applies to a large class of activations and realistic data distributions.

Comment: Shows that flattest interpolators in homogeneous neural networks provably generalize in multi-index learning settings.

Topic Match: It squarely studies training dynamics and generalization through flatness in non-convex neural networks, making architecture_training the strongest fit.

Relevance: 8 Novelty: 8

6. Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability

ArXiv ID: 2606.04754

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Vincent B\"urgin, Daniel Herbst, Ya-Wei Eileen Lin, Stefanie Jegelka

Abstract: Many striking phenomena in deep learning, such as linear mode connectivity and the structured behavior of training dynamics, are closely tied to parameter symmetries: transformations that leave the realized function unchanged. Despite growing attention to parameter symmetries, the exact interplay between parameters, data, and representations remains underexplored. To investigate this, we develop a theoretical framework of effective function classes, i.e., the set of functions a neuron can realize on its input support, and the norm cost of realizing them. We then formalize effective symmetry breaking via neuron identifiability across independent training runs. Our analysis shows that neural networks can admit large families of approximately equivalent solutions even in structurally asymmetric models. We further show that neuron identifiability enables representation merging without prior alignment, and characterize when such merging admits a linear low-loss path. These findings highlight the role of effective function classes in affecting the loss landscape.

Comment: Links linear mode connectivity to neuron identifiability through effective function classes beyond explicit structural symmetries.

Topic Match: The paper is fundamentally about loss landscapes, symmetry breaking, and representation alignment across training runs, making architecture_training the best primary fit.

Relevance: 8 Novelty: 8

Efficiency, Compression, and Large-Scale Training (4)

1. LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

ArXiv ID: 2606.04050

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Liulu He, XuanAng Liu, Juntao Liu, Taolue Feng, Ting Lu, Chunsheng Gan, Zhiyv Peng, Yuan Du, Huanrui Yang, Yijiang Liu, Li Du

Abstract: Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is alift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.

Comment: Introduces quasi-continuous bit-width control through lift-then-project quantization with hardware-friendly decoding.

Topic Match: Quantization is the core contribution, and the continuous bit-width mechanism is a strong foundational efficiency idea rather than routine compression.

Relevance: 9 Novelty: 8

2. AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

ArXiv ID: 2606.04980

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Representation Learning Theory and Structure

Authors: Wanqi Yang, Yuexiao Ma, Alexander Conzelmann, Xiawu Zheng, Michael W. Mahoney, T. Konstantin Rusch, Shiwei Liu

Abstract: Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are inevitably imperfect surrogates, and this can misestimate expert utilization and lead to suboptimal bit allocation. Motivated by the substantial cross-expert quality variability observed in modern MoE models, and by the success of Heavy-Tailed Self-Regularization (HT-SR) theory at predicting neural network model quality without access to training or testing data, we propose AlphaQ, a calibration-free bit-allocation method for MoE quantization. AlphaQ draws on HT-SR theory and follows a simple principle: experts with more heavy-tailed weight spectra are typically better trained and hence should receive higher bit-widths, while experts with weaker heavy-tailed structure can be quantized more aggressively. AlphaQ operationalizes this principle by measuring expert-wise spectral heavy-tailedness and solving a budget-constrained optimization problem that minimizes total quantization error under a global bit-budget constraint. Across several MoE models, AlphaQ consistently outperforms calibration-based baselines under matched bit budgets. Notably, on Qwen1.5-MoE, AlphaQ achieves near full-precision accuracy with an average expert precision of only 3.5 bits, while delivering more than 4$\times$ memory compression. Our code is available at https://github.com/Superone77/AlphaQ.

Comment: Calibration-free expert-wise bit allocation for MoE quantization using heavy-tailed spectral structure.

Topic Match: The main result is a new quantization principle for MoE deployment under global memory budgets, not a representation paper per se.

Relevance: 9 Novelty: 8

3. LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

ArXiv ID: 2606.04302

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics, Memory Structures and Agent Memory Systems

Authors: Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park

Abstract: Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.

Comment: Deferred positional encoding enables zero-copy position-agnostic KV reuse across logical requests.

Topic Match: The central contribution is a new cache-efficient attention mechanism that materially changes long-context inference cost.

Relevance: 9 Novelty: 8

4. Near-Optimal Decentralized Stochastic Convex Optimization over Networks

ArXiv ID: 2606.04757

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Nitai Kluger, Amit Attia, Tomer Koren

Abstract: We study decentralized stochastic smooth convex optimization, where $M$ workers minimize an average objective using local stochastic gradients and neighbor-only communication over a fixed gossip network. A central question in this setting is to determine the largest number of workers that can be used under a total budget of $N$ gradient samples while still preserving the centralized $O(1/\sqrt N)$ statistical rate. We introduce an accelerated decentralized method that preserves this rate for up to $\smash{M\lesssim \sqrt{\rho}\,N^{3/4}}$ workers, where $\rho$ is the spectral gap of the gossip network, improving the best prior maximal scaling of $\smash{M\lesssim \rho\sqrt N}$. The method is based on a one-step-delayed stochastic acceleration scheme that enables workers to interleave minibatching with accelerated gossip while controlling residual disagreement, and its guarantee depends only logarithmically on the optimum-local heterogeneity. We also establish a matching lower bound for linear-span decentralized first-order methods, showing that the method is optimal up to logarithmic factors.

Comment: Gives near-optimal worker scaling for decentralized stochastic convex optimization with accelerated gossip and matching lower bound.

Topic Match: This is directly about distributed optimization and communication-efficient large-scale training theory, a strong fit for efficiency and scaling.

Relevance: 8 Novelty: 8

Representation Learning Theory and Structure (1)

1. Bayes-Sufficient Representations in Supervised Learning

ArXiv ID: 2606.04045

Primary Topic: Representation Learning Theory and Structure

Authors: Vasileios Sevetlidis

Abstract: Representation learning is often described as preserving the information in an input that is relevant for prediction. This work asks what relevance means for a fixed supervised decision problem. A representation is defined to be Bayes-sufficient for a joint distribution and loss if some prediction head can use it to implement a Bayes-optimal action rule. This makes the target information loss-dependent. In the almost-surely unique Bayes-action case, the relevant object is a Bayes quotient, which identifies inputs that require the same Bayes-optimal action. A representation is sufficient when it refines this quotient, and Bayes-minimal when it is informationally equivalent to it. The framework connects naturally to property elicitation: zero-one loss requires the Bayes class, squared loss the conditional mean, Brier loss the conditional probability in binary prediction, and log loss or strictly proper scoring rules the predictive distribution. Controlled finite experiments, learned neural bottleneck experiments, and a real-data iNaturalist taxonomic refinement experiment illustrate the distinction between sufficiency, minimality, and retained non-required information. For a fixed supervised problem, the distribution and the loss determine the Bayes action, the Bayes action determines the quotient, and the quotient determines the minimal information required for Bayes-optimal prediction.

Comment: Defines Bayes-sufficient and Bayes-minimal representations as the loss-dependent minimal information needed for Bayes-optimal prediction.

Topic Match: This is directly about what information representations must preserve, offering a clean theoretical framework for sufficiency and minimality in supervised representation learning.

Relevance: 9 Novelty: 8

Memory Structures and Agent Memory Systems (2)

1. SaliMory: Orchestrating Cognitive Memory for Conversational Agents

ArXiv ID: 2606.04120

Primary Topic: Memory Structures and Agent Memory Systems

Authors: Kai Zhang, Xinyuan Zhang, Hongda Jiang, Shiun-Zu Kuo, Hyokun Yun, Ejaz Ahmed, Shereen Oraby, Ziyun Li, Sanat Sharma, Ann Lee, Ahmed A Aly, Anuj Kumar, Raffay Hamid, Xin Luna Dong

Abstract: Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.

Comment: Trains a single model to perform selective filtering, consolidation, and cue-driven recall in a cognitively structured memory system.

Topic Match: The paper is squarely about memory operations and learning principles for conversational agents, exactly matching the memory topic.

Relevance: 9 Novelty: 8

2. Cartridges at Scale: Training Modular KV Caches over Large Document Collections

ArXiv ID: 2606.04557

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Momchil Hardalov, Gonzalo Iglesias, Adri`a de Gispert

Abstract: Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.

Comment: Introduces modular multi-cartridge KV-cache training with dynamic distractor mixing for scalable reusable long-context memory.

Topic Match: The core idea is a new principle for storing and composing reusable document memories as trained KV caches, making memory organization the best fit over pure efficiency.

Relevance: 9 Novelty: 8

World Models, Exploration, and Open-Ended Reinforcement Learning (3)

1. What Type of Inference is Active Inference?

ArXiv ID: 2606.04935

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

Abstract: Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that the planning correction already helps when observations are decisive, whereas the additional observation-side epistemic corrections matter most when observations are merely suggestive.

Comment: Clarifies active inference by deriving the entropy and planning corrections needed for proper EFE-based policy optimization.

Topic Match: This is foundational decision-making theory about planning and epistemic behavior, strongly aligned with exploration-oriented RL foundations.

Relevance: 8 Novelty: 8

2. From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

ArXiv ID: 2606.04275

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Saket Tiwari, Tejas Kotwal, George Konidaris

Abstract: We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment's state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.

Comment: Derives continuous-time dynamics for overparameterized actor-critic learning in continuous environments.

Topic Match: This is foundational RL theory about neural actor-critic dynamics in continuous control, strongly aligned with the RL topic.

Relevance: 8 Novelty: 8

3. A Goal-Set Characterization of Task Composition in the Boolean Task Algebra

ArXiv ID: 2606.04053

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Eduardo Terr\'es-Caballero, Herke van Hoof

Abstract: The Boolean Task Algebra (BTA) provides a principled framework for zero-shot task composition in reinforcement learning by equipping goal-reaching tasks with Boolean operations. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q-value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant. Building on this observation, we introduce a goal-set-based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance. Experiments across tabular, visual, function-approximation, and continuous-control domains show that learning additional base tasks does not yield better performance. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals. Code is available at https://github.com/EduardoTerres/bta_paper.

Comment: Goal-set characterization showing redundancy in Boolean Task Algebra base tasks and reconstructing composed values from universal and empty tasks.

Topic Match: This is a foundational RL composition result about zero-shot task composition and value-function structure.

Relevance: 8 Novelty: 8

Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.

7-8: substantially related, but partly peripheral or focused on a narrower aspect.

5-6: touches the target topics, but the main contribution is elsewhere.

3-4: largely outside the target topics, often application-focused or domain-specific.

1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

9-10: new paradigm, theory, or major methodological breakthrough.

7-8: substantial methodological advance or strong new insight.

5-6: meaningful but incremental extension or refinement.

3-4: minor, narrow, or mostly engineering or domain-specific improvement.

1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.