Personalized Daily ArXiv Papers 2026-04-24
| Model | Metric | Usage | Papers | ||||
|---|---|---|---|---|---|---|---|
| Prompt | Completion | Total | Total arXiv | Scanned | Relevant | ||
gpt-5.4 |
Tokens | 173554 | 24750 | 198304 | 587 | 359 | 16 |
| Cost | $0.43 | $0.37 | $0.81 | ||||
Topic Coverage:
Table of contents by topic:
Architecture and Training Dynamics (8)
-
On Bayesian Softmax-Gated Mixture-of-Experts Models Authors: Nicola Bariletto, Huy Nguyen, Nhat Ho, Alessandro Rinaldo
-
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling Authors: Anif N. Shikder, Ramit Dey, Sayantan Auddy, Luisa Liboni, Alexandra N. Busch, Arthur Powanwe, J\'an Min\'a\v{c}, Roberto C. Budzinski, Lyle E. Muller
-
SGD at the Edge of Stability: The Stochastic Sharpness Gap Authors: Fangshuo Liao, Afroditi Kolomvaki, Anastasios Kyrillidis
-
EvoForest: A Novel Machine-Learning Paradigm via Open-Ended Evolution of Computational Graphs Authors: Kamer Ali Yuksel, Hassan Sawaf
-
Super Apriel: One Checkpoint, Many Speeds Authors: SLAM Labs, :, Oleksiy Ostapenko, Raymond Li, Torsten Scholak, Alireza Mousavi-Hosseini, Aman Tiwari, Denis Kocetkov, Joel Lamy Poirier, Kelechi Ogueji, Nanda H Krishna, Rafael Pardinas, Sathwik Tejaswi Madhusudhan, Shruthan Radhakrishna, Srinivas Sunkara, Valerie Becaert
-
Frequency-Forcing: From Scaling-as-Time to Soft Frequency Guidance Authors: Weitao Du
-
Geometric Layer-wise Approximation Rates for Deep Networks Authors: Shijun Zhang, Zuowei Shen, Yuesheng Xu
-
From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges Authors: Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, Yuexin Ma
Efficiency, Compression, and Large-Scale Training (2)
-
Continuous Semantic Caching for Low-Cost LLM Serving Authors: Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, Carlee Joe-Wong
-
Improved large-scale graph learning through ridge spectral sparsification Authors: Daniele Calandriello, Ioannis Koutis, Alessandro Lazaric, Michal Valko
Representation Learning Theory and Structure (3)
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders Authors: Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen
-
Convergent Evolution: How Different Language Models Learn Similar Number Representations Authors: Deqing Fu, Tianyi Zhou, Mikhail Belkin, Vatsal Sharan, Robin Jia
-
Rethinking Intrinsic Dimension Estimation in Neural Representations Authors: Rickmer Schulte, David R\"ugamer
Memory Structures and Agent Memory Systems (1)
- Absorber LLM: Harnessing Causal Synchronization for Test-Time Training Authors: Zhixin Zhang, Shabo Zhang, Chengcan Wu, Zeming Wei, Meng Sun
World Models, Exploration, and Open-Ended Reinforcement Learning (2)
-
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity Authors: Behrad Koohy, Jamie Bayne
-
Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning Authors: Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, Jeff Schneider
Architecture and Training Dynamics (8)
1. On Bayesian Softmax-Gated Mixture-of-Experts Models
ArXiv ID: 2604.20551
Primary Topic: Architecture and Training Dynamics
Authors: Nicola Bariletto, Huy Nguyen, Nhat Ho, Alessandro Rinaldo
Abstract: Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet their theoretical properties in the Bayesian framework remain largely unexplored. In this paper, we study Bayesian mixture-of-experts models, focusing on the ubiquitous softmax-based gating mechanism. Specifically, we investigate the asymptotic behavior of the posterior distribution for three fundamental statistical tasks: density estimation, parameter estimation, and model selection. First, we establish posterior contraction rates for density estimation, both in the regimes with a fixed, known number of experts and with a random learnable number of experts. We then analyze parameter estimation and derive convergence guarantees based on tailored Voronoi-type losses, which account for the complex identifiability structure of mixture-of-experts models. Finally, we propose and analyze two complementary strategies for selecting the number of experts. Taken together, these results provide one of the first systematic theoretical analyses of Bayesian mixture-of-experts models with softmax gating, and yield several theory-grounded insights for practical model design.
Comment: Provides one of the first systematic Bayesian theories for softmax-gated mixture-of-experts, including posterior contraction, identifiability-aware parameter estimation, and expert-number selection.
Topic Match: The paper is centrally about MoE gating as a core architectural mechanism and develops foundational statistical understanding rather than an application.
Relevance: 9 Novelty: 8
2. An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
ArXiv ID: 2604.20595
Primary Topic: Architecture and Training Dynamics
Authors: Anif N. Shikder, Ramit Dey, Sayantan Auddy, Luisa Liboni, Alexandra N. Busch, Arthur Powanwe, J\'an Min\'a\v{c}, Roberto C. Budzinski, Lyle E. Muller
Abstract: We establish a mathematical correspondence between state space models, a state-of-the-art architecture for capturing long-range dependencies in data, and an exactly solvable nonlinear oscillator network. As a specific example of this general correspondence, we analyze the diagonal linear time-invariant implementation of the Structured State Space Sequence model (S4). The correspondence embeds S4D, a specific implementation of S4, into a ring network topology, in which recent inputs are encoded, as waves of activity traveling over the one-dimensional spatial layout of the network. We then derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences. These results generalize across modern SSM architectures, and show that they admit an exact mathematical description with a clear physical interpretation. These insights enable a new level of interpretability for these systems in terms of nonlinear oscillator networks.
Comment: Derives an exact operator expression for the full S4D forward pass by mapping modern state-space models to nonlinear oscillator networks.
Topic Match: The paper gives mechanistic architectural understanding of state-space sequence models, directly matching core architecture-analysis interests.
Relevance: 9 Novelty: 8
3. SGD at the Edge of Stability: The Stochastic Sharpness Gap
ArXiv ID: 2604.21016
Primary Topic: Architecture and Training Dynamics
Authors: Fangshuo Liao, Afroditi Kolomvaki, Anastasios Kyrillidis
Abstract: When training neural networks with full-batch gradient descent (GD) and step size $\eta$, the largest eigenvalue of the Hessian -- the sharpness $S(\boldsymbol{\theta})$ -- rises to $2/\eta$ and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(\boldsymbol{\theta})\leq 2/\eta$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/\eta$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/\eta$. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $\Delta S = \eta \beta \sigma_{\boldsymbol{u}}^{2}/(4\alpha)$, where $\alpha$ is the progressive sharpening rate, $\beta$ is the self-stabilization strength, and $\sigma_{ \boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.
Comment: Provides a theory for SGD edge-of-stability sharpness suppression via gradient-noise-driven stochastic self-stabilization.
Topic Match: This is directly about core training dynamics, explaining how mini-batch noise alters sharpness equilibria relative to full-batch gradient descent.
Relevance: 9 Novelty: 8
4. EvoForest: A Novel Machine-Learning Paradigm via Open-Ended Evolution of Computational Graphs
ArXiv ID: 2604.19761
Primary Topic: Architecture and Training Dynamics
Also Matches: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Kamer Ali Yuksel, Hassan Sawaf
Abstract: Modern machine learning is still largely organized around a single recipe: choose a parameterized model family and optimize its weights. Although highly successful, this paradigm is too narrow for many structured prediction problems, where the main bottleneck is not parameter fitting but discovering what should be computed from the data. Success often depends on identifying the right transformations, statistics, invariances, interaction structures, temporal summaries, gates, or nonlinear compositions, especially when objectives are non-differentiable, evaluation is cross-validation-based, interpretability matters, or continual adaptation is required. We present EvoForest, a hybrid neuro-symbolic system for end-to-end open-ended evolution of computation. Rather than merely generating features, EvoForest jointly evolves reusable computational structure, callable function families, and trainable low-dimensional continuous components inside a shared directed acyclic graph. Intermediate nodes store alternative implementations, callable nodes encode reusable transformation families such as projections, gates, and activations, output nodes define candidate predictive computations, and persistent global parameters can be refined by gradient descent. For each graph configuration, EvoForest evaluates the discovered computation and uses a lightweight Ridge-based readout to score the resulting representation against a non-differentiable cross-validation target. The evaluator also produces structured feedback that guides future LLM-driven mutations. In the 2025 ADIA Lab Structural Break Challenge, EvoForest reached 94.13% ROC-AUC after 600 evolution steps, exceeding the publicly reported winning score of 90.14% under the same evaluation protocol.
Comment: Proposes open-ended evolution of reusable computational graphs rather than optimizing within a fixed parameterized model family.
Topic Match: The paper's core contribution is a new paradigm for discovering computational structure and modular graphs, squarely an architecture/mechanism topic.
Relevance: 8 Novelty: 9
5. Super Apriel: One Checkpoint, Many Speeds
ArXiv ID: 2604.19877
Primary Topic: Architecture and Training Dynamics
Also Matches: Efficiency, Compression, and Large-Scale Training
Authors: SLAM Labs, :, Oleksiy Ostapenko, Raymond Li, Torsten Scholak, Alireza Mousavi-Hosseini, Aman Tiwari, Denis Kocetkov, Joel Lamy Poirier, Kelechi Ogueji, Nanda H Krishna, Rafael Pardinas, Sathwik Tejaswi Madhusudhan, Shruthan Radhakrishna, Srinivas Sunkara, Valerie Becaert
Abstract: We release Super Apriel, a 15B-parameter supernet in which every decoder layer provides four trained mixer choices -- Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span $2.9\times$ to $10.7\times$ decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per-layer mixer assignment makes the speed-quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning. We release the supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.
Comment: A single supernet exposes per-layer mixer choices at serving time, enabling dynamic speed-quality tradeoffs from one checkpoint.
Topic Match: The main idea is architectural: a decoder with multiple trained mixer options per layer and switchable placements, with efficiency as an important downstream consequence.
Relevance: 8 Novelty: 8
6. Frequency-Forcing: From Scaling-as-Time to Soft Frequency Guidance
ArXiv ID: 2604.20902
Primary Topic: Architecture and Training Dynamics
Authors: Weitao Du
Abstract: While standard flow-matching models transport noise to data uniformly, incorporating an explicit generation order - specifically, establishing coarse, low-frequency structure before fine detail - has proven highly effective for synthesizing natural images. Two recent works offer distinct paradigms for this. K-Flow imposes a hard frequency constraint by reinterpreting a frequency scaling variable as flow time, running the trajectory inside a transformed amplitude space. Latent Forcing provides a soft ordering mechanism by coupling the pixel flow with an auxiliary semantic latent flow via asynchronous time schedules, leaving the pixel interpolation path itself untouched. Viewed from the angle of improving pixel generation, we observe that forcing - guiding generation with an earlier-maturing auxiliary stream - offers a highly compatible route to scale-ordered generation without rewriting the core flow coordinate. Building on this, we propose Frequency-Forcing, which realizes K-Flow's frequency ordering through Latent Forcing's soft mechanism: a standard pixel flow is guided by an auxiliary low-frequency stream that matures earlier in time. Unlike Latent Forcing, whose scratchpad relies on a heavy pretrained encoder (e.g., DINO), our frequency scratchpad is derived from the data itself via a lightweight learnable wavelet packet transform. We term this a self-forcing signal, which avoids external dependencies while learning a basis better adapted to data statistics than the fixed bases used in hard frequency flows. On ImageNet-256, Frequency-Forcing consistently improves FID over strong pixel- and latent-space baselines, and naturally composes with a semantic stream to yield further gains. This illustrates that forcing-based scale ordering is a versatile, path-preserving alternative to hard frequency flows.
Comment: Introduces a soft frequency-guidance mechanism for flow matching that preserves the base path while enforcing coarse-to-fine generation order.
Topic Match: This is a generative-model architecture/training mechanism paper about how to structure computation over frequencies, not an application paper.
Relevance: 8 Novelty: 8
7. Geometric Layer-wise Approximation Rates for Deep Networks
ArXiv ID: 2604.20219
Primary Topic: Architecture and Training Dynamics
Authors: Shijun Zhang, Zuowei Shen, Yuesheng Xu
Abstract: Depth is widely viewed as a central contributor to the success of deep neural networks, whereas standard neural network approximation theory typically provides guarantees only for the final output and leaves the role of intermediate layers largely unclear. We address this gap by developing a quantitative framework in which depth admits a precise scale-dependent interpretation. Specifically, we design a single shared mixed-activation architecture of fixed width $2dN+d+2$ and any prescribed finite depth such that each intermediate readout $\Phi_\ell$ is itself an approximant to the target function $f$. For $f\in L^p([0,1]^d)$ with $p\in [1,\infty)$, the approximation error of $\Phi_\ell$ is controlled by $(2d+1)$ times the $L^p$ modulus of continuity at the geometric scale $N^{-\ell}$ for all $\ell$. The estimate reduces to the geometric rate $(2d+1)N^{-\ell}$ if $f$ is $1$-Lipschitz. Our network design is inspired by multigrade deep learning, where depth serves as a progressive refinement mechanism: each new correction targets residual information at a finer scale while the earlier correction terms remain part of the later readouts, yielding a nested architecture that supports adaptive refinement without redesigning the preceding network.
Comment: Provides layer-wise approximation guarantees where each intermediate depth readout is itself a progressively finer approximation.
Topic Match: This is directly about what depth does in deep networks, giving a precise multiscale interpretation of intermediate layers.
Relevance: 8 Novelty: 8
8. From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges
ArXiv ID: 2604.21391
Primary Topic: Architecture and Training Dynamics
Authors: Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, Yuexin Ma
Abstract: Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.
Comment: Residual diffusion bridge with low-frequency intent anchoring is a new policy architecture for VLA control.
Topic Match: The core contribution is architectural: decomposing action generation into intent anchors plus residual dynamics to improve optimization and conditioning in generative policies.
Relevance: 8 Novelty: 8
Efficiency, Compression, and Large-Scale Training (2)
1. Continuous Semantic Caching for Low-Cost LLM Serving
ArXiv ID: 2604.20021
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, Carlee Joe-Wong
Abstract: As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs' pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.
Comment: Provides a first rigorous theory and online algorithm for semantic response caching in continuous query embedding spaces with regret guarantees.
Topic Match: The core contribution is a new cache design and learning algorithm that changes LLM serving cost behavior under continuous semantic query distributions.
Relevance: 8 Novelty: 8
2. Improved large-scale graph learning through ridge spectral sparsification
ArXiv ID: 2604.20078
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Daniele Calandriello, Ioannis Koutis, Alessandro Lazaric, Michal Valko
Abstract: Graph-based techniques and spectral graph theory have enriched the field of machine learning with a variety of critical advances. A central object in the analysis is the graph Laplacian L, which encodes the structure of the graph. We consider the problem of learning over this Laplacian in a distributed streaming setting, where new edges of the graph are observed in real time by a network of workers. In this setting, it is hard to learn quickly or approximately while keeping a distributed representation of L. To address this challenge, we present a novel algorithm, GSQUEAK, which efficiently sparsifies the Laplacian by maintaining a small subset of effective resistances. We show that our algorithm produces sparsifiers with strong spectral approximation guarantees, all while processing edges in a single pass and in a distributed fashion.
Comment: Distributed streaming algorithm for graph Laplacian spectral sparsification with single-pass processing and strong approximation guarantees.
Topic Match: Primary fit is efficiency/scaling because the paper introduces a nontrivial distributed sparsification algorithm that changes the computational footprint of large-scale graph learning.
Relevance: 8 Novelty: 8
Representation Learning Theory and Structure (3)
1. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
ArXiv ID: 2604.19974
Primary Topic: Representation Learning Theory and Structure
Authors: Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen
Abstract: Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.
Comment: Uses sparse autoencoders to functionally dissociate internal features for uncertainty, incorrectness, and their confounds in LLMs.
Topic Match: It directly studies the structure and function of learned internal features, asking how distinct representation subpopulations support correctness and uncertainty.
Relevance: 9 Novelty: 8
2. Convergent Evolution: How Different Language Models Learn Similar Number Representations
ArXiv ID: 2604.20817
Primary Topic: Representation Learning Theory and Structure
Authors: Deqing Fu, Tianyi Zhou, Mikhail Belkin, Vatsal Sharan, Robin Jia
Abstract: Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.
Comment: Shows diverse model classes converge to similar periodic number features and characterizes when these become geometrically separable representations.
Topic Match: This is a strong representation-formation paper about how numerical structure emerges across architectures and training signals.
Relevance: 9 Novelty: 8
3. Rethinking Intrinsic Dimension Estimation in Neural Representations
ArXiv ID: 2604.20276
Primary Topic: Representation Learning Theory and Structure
Authors: Rickmer Schulte, David R\"ugamer
Abstract: The analysis of neural representation has become an integral part of research aiming to better understand the inner workings of neural networks. While there are many different approaches to investigate neural representations, an important line of research has focused on doing so through the lens of intrinsic dimensions (IDs). Although this perspective has provided valuable insights and stimulated substantial follow-up research, important limitations of this approach have remained largely unaddressed. In this paper, we highlight a crucial discrepancy between theory and practice of IDs in neural representations, theoretically and empirically showing that common ID estimators are, in fact, not tracking the true underlying ID of the representation. We contrast this negative result with an investigation of the underlying factors that may drive commonly reported ID-related results on neural representation in the literature. Building on these insights, we offer a new perspective on ID estimation in neural representations.
Comment: Shows common intrinsic-dimension estimators do not track true representation ID and reframes what prior ID findings likely measure.
Topic Match: It directly interrogates a widely used lens for understanding learned representations and offers a corrective conceptual framework, making representation structure the clearest fit.
Relevance: 9 Novelty: 8
Memory Structures and Agent Memory Systems (1)
1. Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
ArXiv ID: 2604.20915
Primary Topic: Memory Structures and Agent Memory Systems
Also Matches: Architecture and Training Dynamics
Authors: Zhixin Zhang, Shabo Zhang, Chengcan Wu, Zeming Wei, Meng Sun
Abstract: Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.
Comment: Turns parameters into a long-context memory via self-supervised causal synchronization for test-time training.
Topic Match: The core contribution is a new memory mechanism: absorbing historical context into parameters while preserving causal effects, which squarely fits learned memory systems more than generic long-context modeling.
Relevance: 9 Novelty: 8
World Models, Exploration, and Open-Ended Reinforcement Learning (2)
1. Distributional Value Estimation Without Target Networks for Robust Quality-Diversity
ArXiv ID: 2604.20381
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Behrad Koohy, Jamie Bayne
Abstract: Quality-Diversity (QD) algorithms excel at discovering diverse repertoires of skills, but are hindered by poor sample efficiency and often require tens of millions of environment steps to solve complex locomotion tasks. Recent advances in Reinforcement Learning (RL) have shown that high Update-to-Data (UTD) ratios accelerate Actor-Critic learning. While effective, standard high-UTD algorithms typically utilise target networks to stabilise training. This requirement introduces a significant computational bottleneck, rendering them impractical for resource-intensive Quality-Diversity (QD) tasks where sample efficiency and rapid population adaptation are critical. In this paper, we introduce QDHUAC, a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines. Our results suggest that combining target-free distributional critics with dominance-based selection is a key enabler for the next generation of sample-efficient evolutionary RL algorithms.
Comment: Enables high-UTD quality-diversity training with a target-free distributional critic, improving sample efficiency for open-ended skill discovery.
Topic Match: The paper's core is a foundational RL algorithm for sample-efficient quality-diversity and open-ended behavior discovery, not LLM post-training.
Relevance: 9 Novelty: 8
2. Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
ArXiv ID: 2604.20627
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, Jeff Schneider
Abstract: The temporal lag between actions and their long-term consequences makes credit assignment a challenge when learning goal-directed behaviors from data. Generative world models capture the distribution of future states an agent may visit, indicating that they have captured temporal information. How can that temporal information be extracted to perform credit assignment? In this paper, we formalize how the temporal information stored in world models encodes the underlying geometry of the world. Leveraging optimal transport, we extract this geometry from a learned model of the occupancy measure into a reward function that captures goal-reaching information. Our resulting method, Occupancy Reward Shaping, largely mitigates the problem of credit assignment in sparse reward settings. ORS provably does not alter the optimal policy, yet empirically improves performance by 2.2x across 13 diverse long-horizon locomotion and manipulation tasks. Moreover, we demonstrate the effectiveness of ORS in the real world for controlling nuclear fusion on 3 Tokamak control tasks. Code: https://github.com/aravindvenu7/occupancy_reward_shaping; Website: https://aravindvenu7.github.io/website/ors/
Comment: Extracts world geometry from learned occupancy models to shape rewards for better long-horizon credit assignment without changing the optimal policy.
Topic Match: The paper is a direct fit: it uses learned world/occupancy models to improve goal-conditioned RL and long-horizon generalization.
Relevance: 9 Novelty: 8
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Relevant Topics
Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.
Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.
Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.
Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.
Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.
Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.
World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.
Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains
Scoring Criteria
Relevance and Novelty are independent axes. Score both from 1 to 10.
Relevance Scoring
- 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
- 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
- 5-6: touches the target topics, but the main contribution is elsewhere.
- 3-4: largely outside the target topics, often application-focused or domain-specific.
- 1-2: unrelated.
Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.
Novelty Scoring
- 9-10: new paradigm, theory, or major methodological breakthrough.
- 7-8: substantial methodological advance or strong new insight.
- 5-6: meaningful but incremental extension or refinement.
- 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
- 1-2: little originality; mainly standard application of existing methods.
Topic Registry
Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.
Papers
[PAPER LIST HERE]
Instructions
Respond in JSONL. Output exactly one JSON object per paper, one per line:
{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}
Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only:
daily_hot,new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return[]. -daily_hotmeans the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. -new_frontiermeans the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.