Previous Day 2026-04-16
Monthly Overview 2026-04
Next Day 2026-04-18

Personalized Daily ArXiv Papers 2026-04-17

Model Metric Usage Papers
Prompt Completion Total Total arXiv Scanned Relevant
gpt-5.4 Tokens 178546 25221 203767 539 321 29
Cost $0.45 $0.38 $0.82

Topic Coverage:

TopicPapers
Architecture and Training Dynamics10
Efficiency, Compression, and Large-Scale Training6
Representation Learning Theory and Structure6
Memory Structures and Agent Memory Systems1
World Models, Exploration, and Open-Ended Reinforcement Learning6

Table of contents by topic:

Architecture and Training Dynamics (10)

  1. Gating Enables Curvature: A Geometric Expressivity Gap in Attention Authors: Satwik Bathula, Anand A. Joshi

  2. HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet Authors: Badri N. Patro, Vijay S. Agneeswaran

  3. Expressivity of Transformers: A Tropical Geometry Perspective Authors: Ye Su, Yong Liu

  4. Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations Authors: Wentao Hu, Yanbo Zhai, Xiaohui Hu, Mingkuan Zhao, Shanhong yu, Xue Liu, Kaidong Yu, Shuangyong Song, Xuelong Li

  5. Attention to Mamba: A Recipe for Cross-Architecture Distillation Authors: Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli

  6. Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models Authors: Chashi Mahiul Islam, Alan Villarreal, Mao Nishino, Shaeke Salman, Xiuwen Liu

  7. A Nonlinear Separation Principle: Applications to Neural Networks, Control and Learning Authors: Anand Gokhale, Anton V. Proskurnikov, Yu Kawano, Francesco Bullo

  8. Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus Authors: Zijian Zhao, Jing Gao, Sen Li

  9. Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation Authors: Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng

  10. Constraint-based Pre-training: From Structured Constraints to Scalable Model Initialization Authors: Fu Feng, Yucheng Xie, Ruixiao Shi, Jing Wang, Xin Geng

Efficiency, Compression, and Large-Scale Training (6)

  1. SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention Authors: Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia

  2. AdaSplash-2: Faster Differentiable Sparse Attention Authors: Nuno Gon\c{c}alves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, Marcos Treviso

  3. ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving Authors: Yuseon Choi, Jingu Lee, Jungjun Oh, Sunjoo Whang, Byeongcheol Kim, Minsung Kim, Hoi-Jun Yoo, Sangjin Kim

  4. Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels Authors: Yifan Zhao, Yuchen Yang, Matei Budiu, Sasa Misailovic

  5. Prism: Symbolic Superoptimization of Tensor Programs Authors: Mengdi Wu, Xiaoyu Jiang, Oded Padon, Zhihao Jia

  6. Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization Authors: Zhiyuan Zhai, Bingcong Li, Bingnan Xiao, Ming Li, Xin Wang

Representation Learning Theory and Structure (6)

  1. Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking Authors: Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc

  2. Metric-Aware Principal Component Analysis (MAPCA):A Unified Framework for Scale-Invariant Representation Learning Authors: Michael Leznik

  3. Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision Authors: Gerasimos Chatzoudis, Konstantinos D. Polyzos, Zhuowei Li, Difei Gu, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

  4. Identifiability of Potentially Degenerate Gaussian Mixture Models With Piecewise Affine Mixing Authors: Danru Xu, S\'ebastien Lachapelle, Sara Magliacane

  5. Weight Patching: Toward Source-Level Mechanistic Localization in LLMs Authors: Chenghao Sun, Chengsheng Zhang, Guanzheng Qin, Rui Dai, Xinmei Tian

  6. From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning Authors: Zonghuan Xu, Xingjun Ma

Memory Structures and Agent Memory Systems (1)

  1. Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments Authors: Rajat Khanda, Mohammad Baqar Sambuddha Chakrabarti, Satyasaran Changdar

World Models, Exploration, and Open-Ended Reinforcement Learning (6)

  1. Learning Ad Hoc Network Dynamics via Graph-Structured World Models Authors: Can Karacelebi, Yusuf Talha Sahin, Elif Surer, Ertan Onur

  2. Reinforcement Learning via Value Gradient Flow Authors: Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, Amy Zhang

  3. Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees Authors: Sourav Ganguly, Kartik Pandit, Arnob Ghosh

  4. Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning Authors: Jean-Bastien Grill, Michal Valko, R\'emi Munos

  5. Golden Handcuffs make safer AI agents Authors: Aram Ebtekar, Michael K. Cohen

  6. Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization Authors: Mathias Dus (IRMA)


Architecture and Training Dynamics (10)

1. Gating Enables Curvature: A Geometric Expressivity Gap in Attention

ArXiv ID: 2604.14702

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Satwik Bathula, Anand A. Joshi

Abstract: Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries whereas they provide no consistent advantage on tasks with linear decision boundaries. Furthermore, we identify a structured regime in which curvature accumulates under composition, yielding a systematic depth amplification effect.

Comment: Provides a mechanistic theory for why multiplicative gating expands attention expressivity by enabling non-flat representation geometry and depth-wise curvature accumulation.

Topic Match: The core contribution is an analysis of an attention-layer architectural mechanism—gating—and its effect on expressivity and training-relevant geometry.

Relevance: 9 Novelty: 8


2. HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

ArXiv ID: 2604.14724

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Badri N. Patro, Vijay S. Agneeswaran

Abstract: Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.

Comment: Introduces a scanning-free spectral state-space architecture with input-dependent frequency gating and simplified kernel parameterization.

Topic Match: The main contribution is a new sequence-model architecture design for state-space modeling, with efficiency benefits as a secondary aspect.

Relevance: 9 Novelty: 8


3. Expressivity of Transformers: A Tropical Geometry Perspective

ArXiv ID: 2604.14727

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Ye Su, Yong Liu

Abstract: To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to $\mathcal{O}(N^H)$, overcoming the $\mathcal{O}(N)$ bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers ($\Theta(N^{d_{\text{model}}L})$), demonstrating a combinatorial explosion driven intrinsically by sequence length $N$, ambient embedding dimension $d_{\text{model}}$, and network depth $L$. Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.

Comment: Gives a tropical-geometry account of transformer expressivity, with exact partition results for attention and tight linear-region bounds.

Topic Match: Its main contribution is a mechanistic theory of transformer architecture expressivity and attention geometry.

Relevance: 9 Novelty: 8


4. Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

ArXiv ID: 2604.14246

Primary Topic: Architecture and Training Dynamics

Also Matches: Representation Learning Theory and Structure

Authors: Wentao Hu, Yanbo Zhai, Xiaohui Hu, Mingkuan Zhao, Shanhong yu, Xue Liu, Kaidong Yu, Shuangyong Song, Xuelong Li

Abstract: Sparse Mixture-of-Experts (MoE) models have achieved remarkable scalability, yet they remain vulnerable to hallucinations, particularly when processing long-tail knowledge. We identify that this fragility stems from static Top-$k$ routing: routers tend to favor high-frequency patterns over rare factual associations. Consequently, specialist experts'' possessing critical long-tail knowledge are often assigned low gating scores and remaindormant'' -- under-prioritized for specific tokens despite their proven causal importance on other inputs. To address this, we propose Counterfactual Routing (CoR), a training-free inference framework designed to awaken these dormant experts. CoR integrates layer-wise perturbation analysis with the Counterfactual Expert Impact (CEI) metric to dynamically shift computational resources from syntax-dominant to knowledge-intensive layers while maintaining a constant total activation count, effectively retrieving causally decisive experts via virtual ablation. Extensive experiments on TruthfulQA, FACTOR, and TriviaQA demonstrate that CoR improves factual accuracy by 3.1\% on average without increasing the inference budget, establishing a superior Pareto frontier compared to static scaling strategies.

Comment: Mitigates MoE hallucinations by counterfactually identifying and activating causally important dormant experts at inference time.

Topic Match: MoE routing is the main mechanism under study, with a concrete new inference-time routing principle.

Relevance: 9 Novelty: 8


5. Attention to Mamba: A Recipe for Cross-Architecture Distillation

ArXiv ID: 2604.14191

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli

Abstract: State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a na\"ive distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.

Comment: Two-stage transformer-to-linear-attention-to-Mamba distillation gives a concrete recipe for cross-architecture transfer without hybrid attention blocks.

Topic Match: The core contribution is an architectural/training recipe for transferring knowledge into SSMs, with direct implications for how Mamba-like models can be trained from transformer teachers.

Relevance: 9 Novelty: 8


6. Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

ArXiv ID: 2604.13206

Primary Topic: Architecture and Training Dynamics

Authors: Chashi Mahiul Islam, Alan Villarreal, Mao Nishino, Shaeke Salman, Xiuwen Liu

Abstract: As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstream effects of these instabilities, the root causes and underlying mechanisms remain poorly understood. In this paper, we present a rigorous analysis of how unpredictability is rooted in the finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Specifically, we identify a chaotic "avalanche effect" in the early layers, where minor perturbations trigger binary outcomes: either rapid amplification or complete attenuation. Beyond specific error instances, we demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: 1) a stable regime, where perturbations fall below an input-dependent threshold and vanish, resulting in constant outputs; 2) a chaotic regime, where rounding errors dominate and drive output divergence; and 3) a signal-dominated regime, where true input variations override numerical noise. We validate these findings extensively across multiple datasets and model architectures.

Comment: Analyzes how floating-point rounding errors propagate through transformers and identifies scale-dependent chaotic regimes underlying LLM output unpredictability.

Topic Match: The paper is best viewed as a training/inference dynamics and numerical-stability study of transformer computation rather than an application paper.

Relevance: 8 Novelty: 8


7. A Nonlinear Separation Principle: Applications to Neural Networks, Control and Learning

ArXiv ID: 2604.15238

Primary Topic: Architecture and Training Dynamics

Authors: Anand Gokhale, Anton V. Proskurnikov, Yu Kawano, Francesco Bullo

Abstract: This paper investigates continuous-time and discrete-time firing-rate and Hopfield recurrent neural networks (RNNs), with applications in nonlinear control design and implicit deep learning. First, we introduce a nonlinear separation principle that guarantees global exponential stability for the interconnection of a contracting state-feedback controller and a contracting observer, alongside parametric extensions for robustness and equilibrium tracking. Second, we derive sharp linear matrix inequality (LMI) conditions that guarantee the contractivity of both firing rate and Hopfield neural network architectures. We establish structural relationships among these certificates-demonstrating that continuous-time models with monotone non-decreasing activations maximize the admissible weight space, and extend these stability guarantees to interconnected systems and Graph RNNs. Third, we combine our separation principle and LMI framework to solve the output reference tracking problem for RNN-modeled plants. We provide LMI synthesis methods for feedback controllers and observers, and rigorously design a low-gain integral controller to eliminate steady-state error. Finally, we derive an exact, unconstrained algebraic parameterization of our contraction LMIs to design highly expressive implicit neural networks, achieving competitive accuracy and parameter efficiency on standard image classification benchmarks.

Comment: Derives a nonlinear separation principle and contraction/LMI certificates for recurrent and implicit neural network stability.

Topic Match: Best fit is architectural and training-stability theory for recurrent and implicit networks rather than downstream control performance.

Relevance: 8 Novelty: 8


8. Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

ArXiv ID: 2604.13472

Primary Topic: Architecture and Training Dynamics

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Zijian Zhao, Jing Gao, Sen Li

Abstract: Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .

Comment: Introduces latent-consensus action generation to make centralized multi-agent transformers order-independent while optimizing as hierarchical single-agent RL.

Topic Match: Its core novelty is an architectural factorization for joint action modeling in transformers, even though it is evaluated in MARL.

Relevance: 8 Novelty: 8


9. Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

ArXiv ID: 2604.13088

Primary Topic: Architecture and Training Dynamics

Authors: Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng

Abstract: In sparse termination rewards, intra-group comparisons have become the dominant paradigm for fine-tuning reasoning models via reinforcement learning. However, long-term training often leads to issues like ineffective update accumulation (learning tax), solution probability drift, and entropy collapse. This paper presents a necessary condition for algorithm design from a token-level credit assignment perspective: to prevent reward-irrelevant drift, intra-group objectives must maintain gradient exchangeability across token updates, enabling gradient cancellation on weak-credit/high-frequency tokens. We show that two common mechanisms disrupting exchangeability make "non-cancellation" a structural norm. Based on this, we propose minimal intra-group transformations to restore or approximate the cancellation structure in the shared token space. Experimental results demonstrate that these transformations stabilize training, improve sample efficiency, and enhance final performance, validating the value of this design condition.

Comment: Token-level gradient-cancellation condition gives a mechanistic design principle for stable intra-group RL training.

Topic Match: Its strongest contribution is a training-dynamics analysis and objective-design condition for stability and credit assignment, squarely in architecture/training dynamics.

Relevance: 8 Novelty: 8


10. Constraint-based Pre-training: From Structured Constraints to Scalable Model Initialization

ArXiv ID: 2604.14769

Primary Topic: Architecture and Training Dynamics

Also Matches: Efficiency, Compression, and Large-Scale Training

Authors: Fu Feng, Yucheng Xie, Ruixiao Shi, Jing Wang, Xin Geng

Abstract: The pre-training and fine-tuning paradigm has become the dominant approach for model adaptation. However, conventional pre-training typically yields models at a fixed scale, whereas practical deployment often requires models of varying sizes, exposing its limitations when target model scales differ from those used during pre-training. To address this, we propose an innovative constraint-based pre-training paradigm that imposes structured constraints during pre-training to disentangle size-agnostic knowledge into reusable weight templates, while assigning size-specific adaptation to lightweight weight scalers, thereby reformulating variable-sized model initialization as a multi-task adaptation problem. Within this paradigm, we further introduce WeiT, which employs Kronecker-based constraints to regularize the pre-training process. Specifically, model parameters are represented as compositions of weight templates via concatenation and weighted aggregation, with adaptive connections governed by lightweight weight scalers whose parameters are learned from limited data. This design enables flexible and efficient construction of model weights across diverse downstream scales. Extensive experiments demonstrate the efficiency and effectiveness of WeiT, achieving state-of-the-art performance in initializing models with varying depths and widths across a broad range of perception and embodied learning tasks, including Image Classification, Image Generation, and Embodied Control. Moreover, its effectiveness generalizes to both Transformer-based and Convolution-based architectures, consistently enabling faster convergence and improved performance even under full training.

Comment: Constraint-based pre-training learns reusable weight templates and scale-specific scalers for variable-size model initialization.

Topic Match: The paper mainly introduces a new pre-training and parameterization scheme for model initialization across scales, which is fundamentally an architecture/training design contribution.

Relevance: 8 Novelty: 8


Efficiency, Compression, and Large-Scale Training (6)

1. SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

ArXiv ID: 2604.13847

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia

Abstract: While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\times$ end-to-end speedup while still improving the long-context capability by 0.46\% on the LongBench benchmark.

Comment: Co-designs dynamic sparse attention with distributed workload balancing to jointly improve long-context training efficiency and accuracy.

Topic Match: The paper’s main idea is a systems-and-algorithm efficiency improvement for large-model long-context training under sparse attention.

Relevance: 9 Novelty: 8


2. AdaSplash-2: Faster Differentiable Sparse Attention

ArXiv ID: 2604.15180

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Nuno Gon\c{c}alves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, Marcos Treviso

Abstract: Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is $\alpha$-entmax attention, a differentiable sparse alternative to softmax that enables input-dependent sparsity yet has lagged behind softmax due to the computational overhead necessary to compute the normalizer $\tau$. In this paper, we introduce AdaSplash-2, which addresses this limitation through a novel histogram-based initialization that reduces the number of iterations needed to compute $\tau$ to typically 1--2. The key idea is to compute a coarse histogram of attention scores on the fly and store it in on-chip SRAM, yielding a more accurate initialization that enables fast forward and backward computation. Combined with a sparsity-aware GPU implementation that skips zero blocks with low overhead, AdaSplash-2 matches or improves per-step training time relative to FlashAttention-2 when block sparsity is moderate-to-high (e.g., $>$60\%), which often occurs at long-context lengths. On downstream tasks, models trained with our efficient $\alpha$-entmax attention match softmax baselines at short-context lengths and achieve substantial gains in long-context settings.

Comment: Makes differentiable sparse entmax attention practical with a faster normalizer initialization and sparsity-aware GPU execution.

Topic Match: The contribution is squarely about efficient attention computation and long-context training cost, with a real algorithmic and systems improvement.

Relevance: 9 Novelty: 8


3. ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

ArXiv ID: 2604.14626

Primary Topic: Efficiency, Compression, and Large-Scale Training

Also Matches: Architecture and Training Dynamics

Authors: Yuseon Choi, Jingu Lee, Jungjun Oh, Sunjoo Whang, Byeongcheol Kim, Minsung Kim, Hoi-Jun Yoo, Sangjin Kim

Abstract: Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average $6.6\times$ speedup and $4.4\times$ energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers $2.2\times$ speedup and $1.4\times$ energy efficiency gain over the best-performing prior accelerator baseline.

Comment: MoE serving co-design introduces elastic self-speculative decoding that jointly exploits expert and bit elasticity for substantial inference-speed gains.

Topic Match: The core contribution is a nontrivial efficiency idea for MoE inference and cache/speculative-decoding design, making efficiency and scaling the best primary topic.

Relevance: 9 Novelty: 8


4. Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

ArXiv ID: 2604.14825

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Yifan Zhao, Yuchen Yang, Matei Budiu, Sasa Misailovic

Abstract: We present Nautilus, a novel tensor compiler that moves toward fully automated math-to-kernel optimization. Nautilus compiles a high-level algebraic specification of tensor operators into efficient tiled GPU kernels. Nautilus's successive lowering design allows high-level optimizations, expression rewrites, and tile optimizations to be jointly applied in a single end-to-end system. Nautilus presents a novel auto-scheduler that discovers sequences of high-level optimizations, while preserving the regular program structure needed by tile optimizers. Nautilus's auto-scheduler captures complex interactions and trade-offs in the high-level optimizations, including aggressive global transformations like advanced reduction fusion. Nautilus is the first end-to-end tensor compiler capable of starting from a math-like description of attention and automatically discovering FlashAttention-3-like kernels, offloading the entire burden of optimization from the programmer to the compiler. Across five transformer-based models and 150 evaluation configurations on NVIDIA GH200 and RTX 5090 GPUs, Nautilus achieves up to 23% higher throughput than state-of-the-art compilers on GH200 and up to 42% on RTX 5090, while matching or exceeding manually written cuDNN kernels on many long-sequence configurations.

Comment: Auto-scheduling tensor compiler that can discover FlashAttention-3-like kernels from math-level operator descriptions.

Topic Match: The core contribution is a new compiler and scheduling method that materially changes GPU kernel efficiency for transformer workloads.

Relevance: 8 Novelty: 8


5. Prism: Symbolic Superoptimization of Tensor Programs

ArXiv ID: 2604.15272

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Mengdi Wu, Xiaoyu Jiang, Oded Padon, Zhihao Jia

Abstract: This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some execution parameters. Prism organizes optimization as a two-level search: it constructs symbolic graphs that represent families of programs, and then instantiates them into concrete implementations. This formulation enables structured pruning of provably suboptimal regions of the search space using symbolic reasoning over operator semantics, algebraic identities, and hardware constraints. We develop techniques for efficient symbolic graph generation, equivalence verification via e-graph rewriting, and parameter instantiation through auto-tuning. Together, these components allow Prism to bridge the rigor of exhaustive search with the scalability required for modern ML workloads. Evaluation on five commonly used LLM workloads shows that Prism achieves up to $2.2\times$ speedup over best superoptimizers and $4.9\times$ over best compiler-based approaches, while reducing end-to-end optimization time by up to $3.4\times$.

Comment: Introduces a symbolic superoptimizer for tensor programs using hierarchical symbolic graphs and provable pruning of tensor-program search spaces.

Topic Match: The core contribution is a new optimization system for tensor workloads that materially improves LLM execution efficiency through symbolic search and verification.

Relevance: 8 Novelty: 8


6. Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

ArXiv ID: 2604.14853

Primary Topic: Efficiency, Compression, and Large-Scale Training

Authors: Zhiyuan Zhai, Bingcong Li, Bingnan Xiao, Ming Li, Xin Wang

Abstract: Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy.

Comment: Formulates test-time compute allocation as constrained policy optimization and learns per-instance budget decisions with regret guarantees.

Topic Match: Its central contribution is adaptive inference-time compute allocation under explicit budget constraints, a direct efficiency/scaling problem.

Relevance: 8 Novelty: 8


Representation Learning Theory and Structure (6)

1. Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking

ArXiv ID: 2604.13123

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc

Abstract: Grokking -- delayed generalisation long after memorisation -- lacks a predictive mechanistic explanation. We identify the normalised spectral entropy $\tilde{H}(t)$ of the representation covariance as a scalar order parameter for this transition, validated on 1-layer Transformers on group-theoretic tasks. Five contributions: (i) Grokking follows a two-phase pattern: norm expansion then entropy collapse. (ii) $\tilde{H}$ crosses a stable threshold $\tilde{H}^ \approx 0.61$ before generalisation in 100% of runs (mean lead: 1,020 steps). (iii) A causal intervention preventing collapse delays grokking by +5,020 steps ($p=0.044$); a norm-matched control ($n=30$, $p=5\times10^{-5}$) confirms entropy -- not norm -- drives the transition. (iv) A power-law $\Delta T = C_1(\tilde{H}-\tilde{H}^)^\gamma+C_2$ ($R^2=0.543$) predicts grokking onset with 4.1% error. (v) The mechanism holds across abelian ($\mathbb{Z}/97\mathbb{Z}$) and non-abelian ($S_5$) groups. Crucially, MLPs show entropy collapse without grokking, proving collapse is necessary but not sufficient -- architecture matters. Code: https://anonymous.4open.science/r/grokking-entropy

Comment: Proposes spectral entropy collapse as an empirical order parameter and causal precursor for grokking in transformers.

Topic Match: The central contribution is mechanistic understanding of representation dynamics during delayed generalization.

Relevance: 9 Novelty: 8


2. Metric-Aware Principal Component Analysis (MAPCA):A Unified Framework for Scale-Invariant Representation Learning

ArXiv ID: 2604.14249

Primary Topic: Representation Learning Theory and Structure

Authors: Michael Leznik

Abstract: We introduce Metric-Aware Principal Component Analysis (MAPCA), a unified framework for scale-invariant representation learning based on the generalised eigenproblem max Tr(W^T Sigma W) subject to W^T M W = I, where M is a symmetric positive definite metric matrix. The choice of M determines the representation geometry. The canonical beta-family M(beta) = Sigma^beta, beta in [0,1], provides continuous spectral bias control between standard PCA (beta=0) and output whitening (beta=1), with condition number kappa(beta) = (lambda_1/lambda_p)^(1-beta) decreasing monotonically to isotropy. The diagonal metric M = D = diag(Sigma) recovers Invariant PCA (IPCA), a method rooted in Frisch (1928) diagonal regression, as a distinct member of the broader framework. We prove that scale invariance holds if and only if the metric transforms as M_tilde = CMC under rescaling C, a condition satisfied exactly by IPCA but not by the general beta-family at intermediate values. Beyond its classical interpretation, MAPCA provides a geometric language that unifies several self-supervised learning objectives. Barlow Twins and ZCA whitening correspond to beta=1 (output whitening); VICReg's variance term corresponds to the diagonal metric. A key finding is that W-MSE, despite being described as a whitening-based method, corresponds to M = Sigma^{-1} (beta = -1), outside the spectral compression range entirely and in the opposite spectral direction to Barlow Twins. This distinction between input and output whitening is invisible at the level of loss functions and becomes precise only within the MAPCA framework.

Comment: Unifies PCA, whitening, IPCA, and several SSL objectives through a metric-conditioned spectral framework.

Topic Match: This is directly about the geometry and invariance structure of learned representations, offering a unifying theoretical lens across classical PCA and self-supervised objectives.

Relevance: 9 Novelty: 8


3. Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

ArXiv ID: 2604.13304

Primary Topic: Representation Learning Theory and Structure

Authors: Gerasimos Chatzoudis, Konstantinos D. Polyzos, Zhuowei Li, Difei Gu, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

Abstract: Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.

Comment: Uses cross-layer transcoders as sparse proxy models to decompose ViT activations into layer-resolved additive contributions.

Topic Match: This is best categorized as mechanistic representation-structure work focused on interpretable decomposition of learned vision representations.

Relevance: 8 Novelty: 8


4. Identifiability of Potentially Degenerate Gaussian Mixture Models With Piecewise Affine Mixing

ArXiv ID: 2604.13218

Primary Topic: Representation Learning Theory and Structure

Authors: Danru Xu, S\'ebastien Lachapelle, Sara Magliacane

Abstract: Causal representation learning (CRL) aims to identify the underlying latent variables from high-dimensional observations, even when variables are dependent with each other. We study this problem for latent variables that follow a potentially degenerate Gaussian mixture distribution and that are only observed through the transformation via a piecewise affine mixing function. We provide a series of progressively stronger identifiability results for this challenging setting in which the probability density functions are ill-defined because of the potential degeneracy. For identifiability up to permutation and scaling, we leverage a sparsity regularization on the learned representation. Based on our theoretical results, we propose a two-stage method to estimate the latent variables by enforcing sparsity and Gaussianity in the learned representations. Experiments on synthetic and image data highlight our method's effectiveness in recovering the ground-truth latent variables.

Comment: Provides identifiability results for latent-variable recovery under degenerate Gaussian mixtures with piecewise affine mixing.

Topic Match: The paper is fundamentally theoretical representation-learning work on identifiability of latent structure under difficult generative assumptions.

Relevance: 8 Novelty: 8


5. Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

ArXiv ID: 2604.13694

Primary Topic: Representation Learning Theory and Structure

Also Matches: Architecture and Training Dynamics

Authors: Chenghao Sun, Chengsheng Zhang, Guanzheng Qin, Rui Dai, Xinmei Tian

Abstract: Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation space may merely aggregate or amplify upstream signals rather than encode the target capability in their own parameters. To address this gap, we propose Weight Patching, a parameter-space intervention method for source-oriented analysis in paired same-architecture models that differ in how strongly they express a target capability under the inputs of interest. Given a base model and a behavior-specialized counterpart, Weight Patching replaces selected module weights from the specialized model into the base model under a fixed input. We instantiate the method on instruction following and introduce a framework centered on a vector-anchor behavioral interface that provides a shared internal criterion for whether a task-relevant control state has been formed or recovered in open-ended generation. Under this framework, the analysis reveals a hierarchy from shallow candidate source-side carriers to aggregation and routing modules, and further to downstream execution circuits. The recovered component scores can also guide mechanism-aware model merging, improving selective fusion across the evaluated expert combinations and providing additional external validation.

Comment: Introduces weight patching as a parameter-space intervention for source-level mechanistic localization in LLMs.

Topic Match: The emphasis is mechanistic understanding of where capabilities are encoded in parameters, making representation/mechanism analysis the primary fit.

Relevance: 8 Novelty: 8


6. From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning

ArXiv ID: 2604.13460

Primary Topic: Representation Learning Theory and Structure

Authors: Zonghuan Xu, Xingjun Ma

Abstract: A central challenge in continual learning is forgetting, the loss of performance on previously learned tasks induced by sequential adaptation to new ones. While forgetting has been extensively studied empirically, rigorous theoretical characterizations remain limited. A notable step in this direction is \citet{evron2022catastrophic}, which analyzes forgetting under random orderings of a fixed task collection in overparameterized linear regression. We shift the perspective from order to distribution. Rather than asking how a fixed task collection behaves under random orderings, we study an exact-fit linear regime in which tasks are sampled i.i.d.\ from a task distribution~$\Pi$, and ask how the generating distribution itself governs forgetting. In this setting, we derive an exact operator identity for the forgetting quantity, revealing a recursive spectral structure. Building on this identity, we establish an unconditional upper bound, identify the leading asymptotic term, and, in generic nondegenerate cases, characterize the convergence rate up to constants. We further relate this rate to geometric properties of the task distribution, clarifying what drives slow or fast forgetting in this model.

Comment: Gives an exact spectral characterization of forgetting from the task distribution rather than random task order in continual linear learning.

Topic Match: This is a theoretical paper on the structure of forgetting and representation change in continual learning, with the main contribution being a spectral characterization.

Relevance: 8 Novelty: 8


Memory Structures and Agent Memory Systems (1)

1. Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

ArXiv ID: 2604.13085

Primary Topic: Memory Structures and Agent Memory Systems

Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Rajat Khanda, Mohammad Baqar Sambuddha Chakrabarti, Satyasaran Changdar

Abstract: Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an It\^o stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution. We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34--43\% over the strongest baseline), reductions in catastrophic forgetting (67--80\%), and a 62\% decrease in memory footprint.

Comment: Proposes a three-phase memory hierarchy with SDE/Fokker–Planck analysis and RL error bounds linking consolidation dynamics to continual-learning performance.

Topic Match: The core contribution is a new memory-consolidation mechanism with formal dynamics, capacity analysis, and forgetting control, making memory systems the clearest primary fit.

Relevance: 9 Novelty: 8


World Models, Exploration, and Open-Ended Reinforcement Learning (6)

1. Learning Ad Hoc Network Dynamics via Graph-Structured World Models

ArXiv ID: 2604.14811

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Also Matches: Architecture and Training Dynamics

Authors: Can Karacelebi, Yusuf Talha Sahin, Elif Surer, Ertan Onur

Abstract: Ad hoc wireless networks exhibit complex, innate and coupled dynamics: node mobility, energy depletion and topology change that are difficult to model analytically. Model-free deep reinforcement learning requires sustained online interaction whereas existing model based approaches use flat state representations that lose per node structure. Therefore we propose G-RSSM, a graph structured recurrent state space model that maintains per node latent states with cross node multi head attention to learn the dynamics jointly from offline trajectories. We apply the proposed method to the downstream task clustering where a cluster head selection policy trains entirely through imagined rollouts in the learned world model. Across 27 evaluation scenarios spanning MANET, VANET, FANET, WSN and tactical networks with N=30 to 1000 nodes, the learned policy maintains high connectivity with only trained for N=50. Herein, we propose the first multi physics graph structured world model applied to combinatorial per node decision making in size agnostic wireless ad hoc networks.

Comment: Graph-structured recurrent state-space world model learns per-node latent dynamics and supports imagined-rollout policy training.

Topic Match: The core contribution is a learned graph world model used for model-based RL with offline trajectories and imagination, directly matching foundational world-model research.

Relevance: 9 Novelty: 8


2. Reinforcement Learning via Value Gradient Flow

ArXiv ID: 2604.14265

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, Amy Zhang

Abstract: We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.

Comment: Recasts behavior-regularized RL as optimal transport and solves it via value-guided particle flow instead of explicit policy parameterization.

Topic Match: The main contribution is a new foundational RL optimization framework with a distinct transport-based learning principle.

Relevance: 8 Novelty: 8


3. Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

ArXiv ID: 2604.14243

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Sourav Ganguly, Kartik Pandit, Arnob Ghosh

Abstract: Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, $s_{h+1} = f(s_h, a_h, \bar{a}_h)+\omega_h$ where $\bar{a}_h$ is the adversary/external action, $a_h$ is the agent's action, and $\omega_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} $\bar{\pi}$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.

Comment: Presents a model-based RL algorithm for safety-constrained control under explicit adversarial dynamics with regret and violation guarantees.

Topic Match: It is foundational RL theory on model-based learning under adversarial environment dynamics rather than LLM post-training.

Relevance: 8 Novelty: 8


4. Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

ArXiv ID: 2604.14974

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Jean-Bastien Grill, Michal Valko, R\'emi Munos

Abstract: You are a robot and you live in a Markov decision process (MDP) with a finite or an infinite number of transitions from state-action to next states. You got brains and so you plan before you act. Luckily, your roboparents equipped you with a generative model to do some Monte-Carlo planning. The world is waiting for you and you have no time to waste. You want your planning to be efficient. Sample-efficient. Indeed, you want to exploit the possible structure of the MDP by exploring only a subset of states reachable by following near-optimal policies. You want guarantees on sample complexity that depend on a measure of the quantity of near-optimal states. You want something, that is an extension of Monte-Carlo sampling (for estimating an expectation) to problems that alternate maximization (over actions) and expectation (over next states). But you do not want to StOP with exponential running time, you want something simple to implement and computationally efficient. You want it all and you want it now. You want TrailBlazer.

Comment: Develops sample-efficient Monte Carlo planning with guarantees depending on near-optimal reachable states rather than exhaustive exploration.

Topic Match: The paper is squarely about foundational planning and sample complexity in model-based decision making.

Relevance: 8 Novelty: 8


5. Golden Handcuffs make safer AI agents

ArXiv ID: 2604.13609

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Aram Ebtekar, Michael K. Cohen

Abstract: Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value $-L$, while the true environment's rewards lie in $[0,1]$. After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $-L$. We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.

Comment: Provides a theoretical Bayesian safety mechanism for RL with mentor override and guarantees on regret and avoidance of low-complexity failure predicates.

Topic Match: This is a strong fit because it is foundational RL theory about agent behavior under uncertainty and safe exploration, not LLM post-training.

Relevance: 8 Novelty: 8


6. Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

ArXiv ID: 2604.14765

Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning

Authors: Mathias Dus (IRMA)

Abstract: We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto's calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.

Comment: Builds a Wasserstein geometric formulation of policy optimization with policy gradient flow and second-order analysis in policy space.

Topic Match: This is foundational RL theory centered on a new geometric view of policy optimization rather than downstream post-training.

Relevance: 8 Novelty: 8


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Relevant Topics

Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.

Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.

  1. Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.

  2. Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.

  3. Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.

  4. Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.

  5. World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.

Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains

Scoring Criteria

Relevance and Novelty are independent axes. Score both from 1 to 10.

Relevance Scoring

  • 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
  • 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
  • 5-6: touches the target topics, but the main contribution is elsewhere.
  • 3-4: largely outside the target topics, often application-focused or domain-specific.
  • 1-2: unrelated.

Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.

Novelty Scoring

  • 9-10: new paradigm, theory, or major methodological breakthrough.
  • 7-8: substantial methodological advance or strong new insight.
  • 5-6: meaningful but incremental extension or refinement.
  • 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
  • 1-2: little originality; mainly standard application of existing methods.

Topic Registry

Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.

Papers

[PAPER LIST HERE]

Instructions

Respond in JSONL. Output exactly one JSON object per paper, one per line:

{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}

Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only: daily_hot, new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return []. - daily_hot means the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. - new_frontier means the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.