Personalized Daily ArXiv Papers 2026-04-23
| Model | Metric | Usage | Papers | ||||
|---|---|---|---|---|---|---|---|
| Prompt | Completion | Total | Total arXiv | Scanned | Relevant | ||
gpt-5.4 |
Tokens | 193486 | 20298 | 213784 | 576 | 346 | 13 |
| Cost | $0.48 | $0.30 | $0.79 | ||||
Topic Coverage:
Table of contents by topic:
Architecture and Training Dynamics (3)
-
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling Authors: Anif N. Shikder, Ramit Dey, Sayantan Auddy, Luisa Liboni, Alexandra N. Busch, Arthur Powanwe, J\'an Min\'a\v{c}, Roberto C. Budzinski, Lyle E. Muller
-
On Bayesian Softmax-Gated Mixture-of-Experts Models Authors: Nicola Bariletto, Huy Nguyen, Nhat Ho, Alessandro Rinaldo
-
On the Stability and Generalization of First-order Bilevel Minimax Optimization Authors: Xuelin Zhang, Peipei Yuan
Efficiency, Compression, and Large-Scale Training (4)
-
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, Kang Liu
-
Continuous Semantic Caching for Low-Cost LLM Serving Authors: Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, Carlee Joe-Wong
-
Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms Authors: Yaiza Bermudez, Samir Perlaza, I\~naki Esnaola
-
Onyx: Cost-Efficient Disk-Oblivious ANN Search Authors: Deevashwer Rathee, Jean-Luc Watson, Zirui Neil Zhao, G. Edward Suh, Raluca Ada Popa
Representation Learning Theory and Structure (4)
-
Rethinking Intrinsic Dimension Estimation in Neural Representations Authors: Rickmer Schulte, David R\"ugamer
-
Convergent Evolution: How Different Language Models Learn Similar Number Representations Authors: Deqing Fu, Tianyi Zhou, Mikhail Belkin, Vatsal Sharan, Robin Jia
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders Authors: Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen
-
Sheaf Neural Networks on SPD Manifolds: Second-Order Geometric Representation Learning Authors: Yuhan Peng, Junwen Dong, Yuzhi Zeng, Hao Li, Ce Ju, Huitao Feng, Diaaeldin Taha, Anna Wienhard, Kelin Xia
Memory Structures and Agent Memory Systems (1)
- Knowledge Capsules: Structured Nonparametric Memory Units for LLMs Authors: Bin Ju, Shenfeng Weng, Danying Zhou, Kunkai Su, Rongkai Xu
World Models, Exploration, and Open-Ended Reinforcement Learning (1)
- Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models Authors: Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano, Aviv Tamar
Architecture and Training Dynamics (3)
1. An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
ArXiv ID: 2604.20595
Primary Topic: Architecture and Training Dynamics
Authors: Anif N. Shikder, Ramit Dey, Sayantan Auddy, Luisa Liboni, Alexandra N. Busch, Arthur Powanwe, J\'an Min\'a\v{c}, Roberto C. Budzinski, Lyle E. Muller
Abstract: We establish a mathematical correspondence between state space models, a state-of-the-art architecture for capturing long-range dependencies in data, and an exactly solvable nonlinear oscillator network. As a specific example of this general correspondence, we analyze the diagonal linear time-invariant implementation of the Structured State Space Sequence model (S4). The correspondence embeds S4D, a specific implementation of S4, into a ring network topology, in which recent inputs are encoded, as waves of activity traveling over the one-dimensional spatial layout of the network. We then derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences. These results generalize across modern SSM architectures, and show that they admit an exact mathematical description with a clear physical interpretation. These insights enable a new level of interpretability for these systems in terms of nonlinear oscillator networks.
Comment: Derives an explicit operator for the full forward pass of S4D, giving an analytical input-output description of modern state-space sequence models.
Topic Match: This is centrally about architectural mechanism and interpretability of state-space sequence models, not an application of them.
Relevance: 9 Novelty: 8
2. On Bayesian Softmax-Gated Mixture-of-Experts Models
ArXiv ID: 2604.20551
Primary Topic: Architecture and Training Dynamics
Authors: Nicola Bariletto, Huy Nguyen, Nhat Ho, Alessandro Rinaldo
Abstract: Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet their theoretical properties in the Bayesian framework remain largely unexplored. In this paper, we study Bayesian mixture-of-experts models, focusing on the ubiquitous softmax-based gating mechanism. Specifically, we investigate the asymptotic behavior of the posterior distribution for three fundamental statistical tasks: density estimation, parameter estimation, and model selection. First, we establish posterior contraction rates for density estimation, both in the regimes with a fixed, known number of experts and with a random learnable number of experts. We then analyze parameter estimation and derive convergence guarantees based on tailored Voronoi-type losses, which account for the complex identifiability structure of mixture-of-experts models. Finally, we propose and analyze two complementary strategies for selecting the number of experts. Taken together, these results provide one of the first systematic theoretical analyses of Bayesian mixture-of-experts models with softmax gating, and yield several theory-grounded insights for practical model design.
Comment: Provides one of the first systematic Bayesian theories for softmax-gated mixture-of-experts, including contraction, estimation, and model selection.
Topic Match: MoE routing and expert-count selection are core architectural topics, and the paper contributes foundational statistical theory for them.
Relevance: 9 Novelty: 8
3. On the Stability and Generalization of First-order Bilevel Minimax Optimization
ArXiv ID: 2604.20115
Primary Topic: Architecture and Training Dynamics
Authors: Xuelin Zhang, Peipei Yuan
Abstract: Bilevel optimization and bilevel minimax optimization have recently emerged as unifying frameworks for a range of machine-learning tasks, including hyperparameter optimization and reinforcement learning. The existing literature focuses on empirical efficiency and convergence guarantees, leaving a critical theoretical gap in understanding how well these algorithms generalize. To bridge this gap, we provide the first systematic generalization analysis for first-order gradient-based bilevel minimax solvers with lower-level minimax problems. Specifically, by leveraging algorithmic stability arguments, we derive fine-grained generalization bounds for three representative algorithms, including single-timescale stochastic gradient descent-ascent, and two variants of two-timescale stochastic gradient descent-ascent. Our results reveal a precise trade-off among algorithmic stability, generalization gaps, and practical settings. Furthermore, extensive empirical evaluations corroborate our theoretical insights on realistic optimization tasks with bilevel minimax structures.
Comment: Provides the first algorithmic-stability generalization bounds for first-order bilevel minimax solvers.
Topic Match: This is directly about foundational training dynamics and generalization behavior of an important optimization class.
Relevance: 8 Novelty: 8
Efficiency, Compression, and Large-Scale Training (4)
1. From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
ArXiv ID: 2604.19884
Primary Topic: Efficiency, Compression, and Large-Scale Training
Also Matches: Architecture and Training Dynamics
Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, Kang Liu
Abstract: Post-Training Quantization (PTQ) is critical for the efficient deployment of Large Language Models (LLMs). While 4-bit quantization is widely regarded as an optimal trade-off, reducing the precision to 2-bit usually triggers a catastrophic ``performance cliff.'' It remains unclear whether the underlying mechanisms differ fundamentally. Consequently, we conduct a systematic mechanistic analysis, revealing two qualitatively distinct failure modes: Signal Degradation, where the computational patterns remain intact but information precision is impaired by cumulative error; and Computation Collapse, where key components fail to function, preventing correct information processing and destroying the signal in the early layers. Guided by this diagnosis, we conduct mechanism-aware interventions, demonstrating that targeted, training-free repair can mitigate Signal Degradation, but remains ineffective for Computation Collapse. Our findings provide a systematic diagnostic framework for PTQ failures and suggest that addressing Computation Collapse requires structural reconstruction rather than mere compensation.
Comment: Identifies two distinct failure modes in LLM post-training quantization—signal degradation versus computation collapse—and tests mechanism-aware repairs.
Topic Match: Quantization mechanics and why low-bit PTQ breaks are squarely in efficiency/compression, with strong mechanistic insight beyond routine tuning.
Relevance: 9 Novelty: 8
2. Continuous Semantic Caching for Low-Cost LLM Serving
ArXiv ID: 2604.20021
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, Carlee Joe-Wong
Abstract: As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs' pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.
Comment: Develops the first continuous-space theory for semantic caching with dynamic epsilon-net discretization and regret guarantees.
Topic Match: This is fundamentally about cache design and efficient LLM serving with a new theoretical and algorithmic treatment of semantic reuse.
Relevance: 8 Novelty: 8
3. Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms
ArXiv ID: 2604.20492
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Yaiza Bermudez, Samir Perlaza, I\~naki Esnaola
Abstract: In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.
Comment: Shows decentralized ERM with Gibbs-measure passing can match centralized performance by sharing inductive bias instead of data.
Topic Match: This is a foundational distributed-learning result about a new communication/training scheme with centralized guarantees.
Relevance: 8 Novelty: 8
4. Onyx: Cost-Efficient Disk-Oblivious ANN Search
ArXiv ID: 2604.20401
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Deevashwer Rathee, Jean-Luc Watson, Zirui Neil Zhao, G. Edward Suh, Raluca Ada Popa
Abstract: Approximate nearest neighbor (ANN) search in AI systems increasingly handles sensitive data on third-party infrastructure. Trusted execution environments (TEEs) offer protection, but cost-efficient deployments must rely on external SSDs, which leaks user queries through disk access patterns to the host. Oblivious RAM (ORAM) can hide these access patterns but at a high cost; when paired with existing disk-based ANN search techniques, it makes poor use of SSD resources, yielding high latency and poor cost-efficiency. The core challenge for efficient oblivious ANN search over SSDs is balancing both bandwidth and access count. The state-of-the-art ORAM-ANN design minimizes access count at the ANN level and bandwidth at the ORAM level, each trading-off the other, leaving the combined system with both resources overutilized. We propose inverting this design, minimizing bandwidth consumption in the ANN layer and access count in the ORAM layer, since each component is better suited for its new role: ANN's inherent approximation allows for more bandwidth efficiency, while ORAM has no fundamental lower bounds on access count (as opposed to bandwidth). To this end, we propose a cost-efficient approach, Onyx, with two new co-designed components: Onyx-ANNS introduces a compact intermediate representation that proactively prunes the majority of bandwidth-intensive accesses without hurting recall, and Onyx-ORAM proposes a locality-aware shallow tree design that reduces access count while remaining compatible with bandwidth-efficient ORAM techniques. Compared to the state-of-the-art oblivious ANN search system, Onyx achieves $1.7-9.9\times$ lower cost and $2.3-12.3\times$ lower latency.
Comment: Co-design of ANN pruning and locality-aware ORAM changes the bandwidth/access-count tradeoff for secure large-scale retrieval systems.
Topic Match: The core contribution is a new systems/algorithmic design that materially improves efficiency under a hard storage-access privacy constraint.
Relevance: 8 Novelty: 8
Representation Learning Theory and Structure (4)
1. Rethinking Intrinsic Dimension Estimation in Neural Representations
ArXiv ID: 2604.20276
Primary Topic: Representation Learning Theory and Structure
Authors: Rickmer Schulte, David R\"ugamer
Abstract: The analysis of neural representation has become an integral part of research aiming to better understand the inner workings of neural networks. While there are many different approaches to investigate neural representations, an important line of research has focused on doing so through the lens of intrinsic dimensions (IDs). Although this perspective has provided valuable insights and stimulated substantial follow-up research, important limitations of this approach have remained largely unaddressed. In this paper, we highlight a crucial discrepancy between theory and practice of IDs in neural representations, theoretically and empirically showing that common ID estimators are, in fact, not tracking the true underlying ID of the representation. We contrast this negative result with an investigation of the underlying factors that may drive commonly reported ID-related results on neural representation in the literature. Building on these insights, we offer a new perspective on ID estimation in neural representations.
Comment: Challenges intrinsic-dimension estimators for neural representations, showing common estimators fail to track true representation dimension and reframing what ID results mean.
Topic Match: The core contribution is a methodological and theoretical critique of a widely used representation-analysis tool, directly about how neural representations are characterized.
Relevance: 9 Novelty: 8
2. Convergent Evolution: How Different Language Models Learn Similar Number Representations
ArXiv ID: 2604.20817
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Deqing Fu, Tianyi Zhou, Mikhail Belkin, Vatsal Sharan, Robin Jia
Abstract: Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.
Comment: Studies how number representations emerge across architectures, distinguishing Fourier periodicity from geometrically separable modular features and analyzing training conditions that produce them.
Topic Match: The paper is centrally about the structure and formation of learned representations, not about task performance.
Relevance: 9 Novelty: 8
3. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
ArXiv ID: 2604.19974
Primary Topic: Representation Learning Theory and Structure
Authors: Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen
Abstract: Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.
Comment: Uses sparse autoencoders to functionally dissociate internal features tied to uncertainty, incorrectness, and their confounding in LLMs.
Topic Match: This is primarily about mechanistic structure in learned representations and what internal features encode different behavioral properties.
Relevance: 8 Novelty: 8
4. Sheaf Neural Networks on SPD Manifolds: Second-Order Geometric Representation Learning
ArXiv ID: 2604.20308
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Yuhan Peng, Junwen Dong, Yuzhi Zeng, Hao Li, Ce Ju, Huitao Feng, Diaaeldin Taha, Anna Wienhard, Kelin Xia
Abstract: Graph neural networks face two fundamental challenges rooted in the linear structure of Euclidean vector spaces: (1) Current architectures represent geometry through vectors (directions, gradients), yet many tasks require matrix-valued representations that capture relationships between directions-such as how atomic orientations covary in a molecule. These second-order representations are naturally captured by points on the symmetric positive definite matrices (SPD) manifold; (2) Standard message passing applies shared transformations across edges. Sheaf neural networks address this via edge-specific transformations, but existing formulations remain confined to vector spaces and therefore cannot propagate matrix-valued features. We address both challenges by developing the first sheaf neural network operates natively on the SPD manifold. Our key insight is that the SPD manifold admits a Lie group structure, enabling well-posed analogs of sheaf operators without projecting to Euclidean space. Theoretically, we prove that SPD-valued sheaves are strictly more expressive than Euclidean sheaves: they admit consistent configurations (global sections) that vector-valued sheaves cannot represent, directly translating to richer learned representations. Empirically, our sheaf convolution transforms effectively rank-1 directional inputs into full-rank matrices encoding local geometric structure. Our dual-stream architecture achieves SOTA on 6/7 MoleculeNet benchmarks, with the sheaf framework providing consistent depth robustness.
Comment: Extends sheaf neural networks to SPD manifolds and argues matrix-valued sheaf representations are strictly more expressive than Euclidean ones.
Topic Match: Primary fit is representation structure because the main claim is about richer representational geometry and expressivity beyond standard vector-space message passing.
Relevance: 8 Novelty: 8
Memory Structures and Agent Memory Systems (1)
1. Knowledge Capsules: Structured Nonparametric Memory Units for LLMs
ArXiv ID: 2604.20487
Primary Topic: Memory Structures and Agent Memory Systems
Also Matches: Architecture and Training Dynamics
Authors: Bin Ju, Shenfeng Weng, Danying Zhou, Kunkai Su, Rongkai Xu
Abstract: Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.
Comment: Proposes structured nonparametric memory units that inject external key-values directly into attention instead of using text-based retrieval context.
Topic Match: The central idea is a new external memory mechanism for storing and integrating knowledge through attention-compatible memory units.
Relevance: 9 Novelty: 8
World Models, Exploration, and Open-Ended Reinforcement Learning (1)
1. Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
ArXiv ID: 2604.20472
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano, Aviv Tamar
Abstract: Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.
Comment: Connects sequential calibration to temporal-difference value estimation, making policy value learning a principled uncertainty-calibration mechanism in VLA sequential tasks.
Topic Match: The core contribution is an RL-grounded formulation of calibration in sequential decision processes, centered on TD value estimation rather than downstream robotics benchmarking alone.
Relevance: 8 Novelty: 8
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Relevant Topics
Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.
Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.
Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.
Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.
Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.
Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.
World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.
Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains
Scoring Criteria
Relevance and Novelty are independent axes. Score both from 1 to 10.
Relevance Scoring
- 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
- 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
- 5-6: touches the target topics, but the main contribution is elsewhere.
- 3-4: largely outside the target topics, often application-focused or domain-specific.
- 1-2: unrelated.
Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.
Novelty Scoring
- 9-10: new paradigm, theory, or major methodological breakthrough.
- 7-8: substantial methodological advance or strong new insight.
- 5-6: meaningful but incremental extension or refinement.
- 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
- 1-2: little originality; mainly standard application of existing methods.
Topic Registry
Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.
Papers
[PAPER LIST HERE]
Instructions
Respond in JSONL. Output exactly one JSON object per paper, one per line:
{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}
Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only:
daily_hot,new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return[]. -daily_hotmeans the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. -new_frontiermeans the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.