Personalized Daily ArXiv Papers 2026-03-23

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	133668	4971	138639	479	321	33
`gpt-5.4`	Cost	$0.33	$0.07	$0.41	479	321	33

Table of contents with paper titles:

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels Authors: Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero
The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference Authors: Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang
Any-Subgroup Equivariant Networks via Symmetry Breaking Authors: Abhinav Goel, Derek Lim, Hannah Lawrence, Stefanie Jegelka, Ningyuan Huang
Transformers are Stateless Differentiable Neural Computers Authors: Bo Tang, Weiwei Xie
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $\lambda$-Calculus Authors: Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar
Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition Authors: Calvin Ang, Sungyoon Kim, Mert Pilanci
Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture -- Bridging Predictive and Generative Self-Supervised Learning Authors: Moritz G\"ogl, Christopher Yau
TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly Authors: Toshiaki Koike-Akino, Jing Liu, Ye Wang
Speculating Experts Accelerates Inference for Mixture-of-Experts Authors: Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda
On the Ability of Transformers to Verify Plans Authors: Yash Sarrof, Yupei Du, Katharina Stein, Alexander Koller, Sylvie Thi\'ebaux, Michael Hahn
Neural Dynamics Self-Attention for Spiking Transformers Authors: Dehao Zhang, Fukai Guo, Shuai Wang, Jingya Wang, Jieyuan Zhang, Yimeng Shan, Malu Zhang, Yang Yang, Haizhou Li
Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination Authors: Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang, Jun Zhu, Deyu Meng
Hyperagents Authors: Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tatiana Shavrina
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD Authors: Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, Tim Salimans
Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis Authors: Siddharth Chandak, Anuj Yadav, Ayfer Ozgur, Nicholas Bambos
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment Authors: Simone Magistri, Dipam Goswami, Marco Mistretta, Bart{\l}omiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov
Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training Authors: Giacomo Borghi, Hyesung Im, Lorenzo Pareschi
Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States Authors: Yurun Yuan, Tengyang Xie
Minimax Generalized Cross-Entropy Authors: Kartheek Bondugula, Santiago Mazuelas, Aritz P\'erez, Anqi Liu
Pitfalls in Evaluating Interpretability Agents Authors: Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, Yonatan Belinkov
Spectral Alignment in Forward-Backward Representations via Temporal Abstraction Authors: Seyed Mahdi B. Azad, Jasper Hoffmann, Iman Nematollahi, Hao Zhu, Abhinav Valada, Joschka Boedecker
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL Authors: Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang
RiboSphere: Learning Unified and Efficient Representations of RNA Structures Authors: Zhou Zhang, Hanqun Cao, Cheng Tan, Fang Wu, Pheng Ann Heng, Tianfan Fu
Growing Networks with Autonomous Pruning Authors: Charles De Lambilly, Stefan Duffner
CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing Authors: Manit Baser, Alperen Yildiz, Dinil Mon Divakaran, Mohan Gurusamy
Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging Authors: Azam Nouri
Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions Authors: Xiaoyi Li
DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training Authors: Maoyang Xiang, Bo Wang
Warm-Start Flow Matching for Guaranteed Fast Text/Image Generation Authors: Minyoung Kim
Spectral Tempering for Embedding Compression in Dense Passage Retrieval Authors: Yongkang Li, Panagiotis Eustratiadis, Evangelos Kanoulas
Scalable Prompt Routing via Fine-Grained Latent Task Discovery Authors: Yunyi Zhang, Soji Adeshina, Patrick Guan, Ashwin Ganesh, Zhen Han, Vassilis N. Ioannidis, Huzefa Rangwala, George Karypis
Towards Solving Polynomial-Objective Integer Programming with Hypergraph Neural Networks Authors: Minshuo Li, Yaoxin Wu, Pavel Troubil, Yingqian Zhang, Wim P. M. Nuijten
Inducing Sustained Creativity and Diversity in Large Language Models Authors: Queenie Luo, Gary King, Michael Puett, Michael D. Smith

1. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

ArXiv ID: 2603.19312

Authors: Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero

Abstract: Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

Comment: Author match

2. The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

ArXiv ID: 2603.19664

Authors: Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang

Abstract: The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.

Comment: Transformer systems insight showing KV cache is exactly reconstructible from residual streams, yielding a new bounded-memory inference scheme.

Relevance: 10 Novelty: 9

3. Any-Subgroup Equivariant Networks via Symmetry Breaking

ArXiv ID: 2603.19486

Authors: Abhinav Goel, Derek Lim, Hannah Lawrence, Stefanie Jegelka, Ningyuan Huang

Abstract: The inclusion of symmetries as an inductive bias, known as equivariance, often improves generalization on geometric data (e.g. grids, sets, and graphs). However, equivariant architectures are usually highly constrained, designed for symmetries chosen a priori, and not applicable to datasets with other symmetries. This precludes the development of flexible, multi-modal foundation models capable of processing diverse data equivariantly. In this work, we build a single model -- the Any-Subgroup Equivariant Network (ASEN) -- that can be simultaneously equivariant to several groups, simply by modulating a certain auxiliary input feature. In particular, we start with a fully permutation-equivariant base model, and then obtain subgroup equivariance by using a symmetry-breaking input whose automorphism group is that subgroup. However, finding an input with the desired automorphism group is computationally hard. We overcome this by relaxing from exact to approximate symmetry breaking, leveraging the notion of 2-closure to derive fast algorithms. Theoretically, we show that our subgroup-equivariant networks can simulate equivariant MLPs, and their universality can be guaranteed if the base model is universal. Empirically, we validate our method on symmetry selection for graph and image tasks, as well as multitask and transfer learning for sequence tasks, showing that a single network equivariant to multiple permutation subgroups outperforms both separate equivariant models and a single non-equivariant model.

Comment: Architecture theory for equivariant networks: a single model attains any subgroup equivariance through symmetry-breaking inputs with universality guarantees.

Relevance: 10 Novelty: 9

4. Transformers are Stateless Differentiable Neural Computers

ArXiv ID: 2603.19272

Authors: Bo Tang, Weiwei Xie

Abstract: Differentiable Neural Computers (DNCs) were introduced as recurrent architectures equipped with an addressable external memory supporting differentiable read and write operations. Transformers, in contrast, are nominally feedforward architectures based on multi-head self-attention. In this work we give a formal derivation showing that a causal Transformer layer is exactly a stateless Differentiable Neural Computer (sDNC) where (1) the controller has no recurrent internal state, (2) the external memory is a write-once matrix of value vectors, (3) content-based addressing via keys implements attention, and (4) multi-head attention corresponds to multiple parallel read heads. We further extend this equivalence to cross-attention, showing that encoder-decoder Transformers are precisely sDNCs with distinct read-from and write-to memories. Our results provide a unified memory-centric interpretation of Transformers and contribute to the ongoing effort to place modern large language models in a principled computational framework.

Comment: Model architecture/theory: formally derives causal Transformers as stateless differentiable neural computers with external memory semantics.

Relevance: 10 Novelty: 8

5. The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $\lambda$-Calculus

ArXiv ID: 2603.20105

Authors: Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar

Abstract: LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse. We introduce $\lambda$-RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in $\lambda$-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that $\lambda$-RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, $\lambda$-RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of $\lambda$-RLM, is open-sourced for the community at: https://github.com/lambda-calculus-LLM/lambda-RLM.

Comment: Model architecture/systems: replaces free-form recursive control with a typed λ-calculus runtime for long-context reasoning, with formal guarantees on termination and cost.

Relevance: 9 Novelty: 9

6. Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition

ArXiv ID: 2603.19559

Authors: Calvin Ang, Sungyoon Kim, Mert Pilanci

Abstract: We study entrywise scalar quantization of two matrices prior to multiplication. Given $A\in R^{m\times k}$ and $B\in R^{k\times n}$, we quantize entries of $A$ and $B$ independently using scalar quantizers with $K_X$ and $K_Y$ levels per entry, and form $\widehat C=\widehat A\,\widehat B$. The objective is to minimize the matrix multiplication mean-squared error (MSE) $E[|{AB-\widehat A\widehat B}|F^2]$ under a pair-i.i.d.\ inner-product model. In the high-resolution regime $K_X,K_Y\to\infty$, we derive a sharp $K^{-2}$ asymptotic expansion for $\mathcal{E}$, identify the exact optimal leading constants, and characterize asymptotically optimal quantization center densities in terms of conditional second moments. We then specialize to correlated Gaussian multiplicative pairs, obtaining a closed-form optimal point density [ \lambda^\star(u)\ \propto\ \exp!\left(-\frac{u^2}{6}\right)\bigl((1-\rho^2)+\rho^2u^2\bigr)^{1/3}, \qquad u=\frac{x}{\sigma_X}, ] with the same form for $y/\sigma_Y$, and prove a correlation-driven phase transition: the density is unimodal at the origin for $|\rho|\leq 1/\sqrt{3}$ and becomes bimodal for $|\rho|>1/\sqrt{3}$ with peaks at $u$. We show our method's applicability in synthetic experiments such as matrix multiplication quantization and least squares optimization, as well as quantization of large language model key and query activations.}}=\pm\sqrt{3-1/\rho^2

Comment: Compression theory for matrix multiplication: derives optimal scalar quantization densities and phase transitions with closed-form analysis.

Relevance: 9 Novelty: 9

7. Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture -- Bridging Predictive and Generative Self-Supervised Learning

ArXiv ID: 2603.20111

Authors: Moritz G\"ogl, Christopher Yau

Abstract: The Joint-Embedding Predictive Architecture (JEPA) is often seen as a non-generative alternative to likelihood-based self-supervised learning, emphasizing prediction in representation space rather than reconstruction in observation space. We argue that the resulting separation from probabilistic generative modeling is largely rhetorical rather than structural: the canonical JEPA design, coupled encoders with a context-to-target predictor, mirrors the variational posteriors and learned conditional priors obtained when variational inference is applied to a particular class of coupled latent-variable models, and standard JEPA can be viewed as a deterministic specialization in which regularization is imposed via architectural and training heuristics rather than an explicit likelihood. Building on this view, we derive the Variational JEPA (Var-JEPA), which makes the latent generative structure explicit by optimizing a single Evidence Lower Bound (ELBO). This yields meaningful representations without ad-hoc anti-collapse regularizers and allows principled uncertainty quantification in the latent space. We instantiate the framework for tabular data (Var-T-JEPA) and achieve strong representation learning and downstream performance, consistently improving over T-JEPA while remaining competitive with strong raw-feature baselines.

Comment: Representation learning: gives a variational reformulation of JEPA as an explicit latent-variable model, removing heuristic anti-collapse regularization.

Relevance: 9 Novelty: 8

8. TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

ArXiv ID: 2603.19296

Authors: Toshiaki Koike-Akino, Jing Liu, Ye Wang

Abstract: To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.

Comment: Model compression/efficiency: proposes on-the-fly activation-aware test-time quantization that adapts per prompt without retraining.

Relevance: 9 Novelty: 8

9. Speculating Experts Accelerates Inference for Mixture-of-Experts

ArXiv ID: 2603.19289

Authors: Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda

Abstract: Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts. Integrated into an optimized inference engine, our approach achieves up to 14\% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open-source at https://github.com/axonn-ai/yalis/tree/offload_prefetch.

Comment: MoE inference systems method that speculates future experts to overlap CPU-GPU transfers with compute under expert offloading.

Relevance: 9 Novelty: 8

10. On the Ability of Transformers to Verify Plans

ArXiv ID: 2603.19954

Authors: Yash Sarrof, Yupei Du, Katharina Stein, Alexander Koller, Sylvie Thi\'ebaux, Michael Hahn

Abstract: Transformers have shown inconsistent success in AI planning tasks, and theoretical understanding of when generalization should be expected has been limited. We take important steps towards addressing this gap by analyzing the ability of decoder-only models to verify whether a given plan correctly solves a given planning instance. To analyse the general setting where the number of objects -- and thus the effective input alphabet -- grows at test time, we introduce C*-RASP, an extension of C-RASP designed to establish length generalization guarantees for transformers under the simultaneous growth in sequence length and vocabulary size. Our results identify a large class of classical planning domains for which transformers can provably learn to verify long plans, and structural properties that significantly affects the learnability of length generalizable solutions. Empirical experiments corroborate our theory.

Comment: Transformer theory: introduces C*-RASP and proves length-generalization guarantees for plan verification with growing vocabulary size.

Relevance: 9 Novelty: 8

11. Neural Dynamics Self-Attention for Spiking Transformers

ArXiv ID: 2603.19290

Authors: Dehao Zhang, Fukai Guo, Shuai Wang, Jingya Wang, Jieyuan Zhang, Yimeng Shan, Malu Zhang, Yang Yang, Haizhou Li

Abstract: Integrating Spiking Neural Networks (SNNs) with Transformer architectures offers a promising pathway to balance energy efficiency and performance, particularly for edge vision applications. However, existing Spiking Transformers face two critical challenges: (i) a substantial performance gap compared to their Artificial Neural Networks (ANNs) counterparts and (ii) high memory overhead during inference. Through theoretical analysis, we attribute both limitations to the Spiking Self-Attention (SSA) mechanism: the lack of locality bias and the need to store large attention matrices. Inspired by the localized receptive fields (LRF) and membrane-potential dynamics of biological visual neurons, we propose LRF-Dyn, which uses spiking neurons with localized receptive fields to compute attention while reducing memory requirements. Specifically, we introduce a LRF method into SSA to assign higher weights to neighboring regions, strengthening local modeling and improving performance. Building on this, we approximate the resulting attention computation via charge-fire-reset dynamics, eliminating explicit attention-matrix storage and reducing inference-time memory. Extensive experiments on visual tasks confirm that our method reduces memory overhead while delivering significant performance improvements. These results establish it as a key unit for achieving energy-efficient Spiking Transformers.

Comment: Introduces a new spiking self-attention mechanism that adds locality bias and removes explicit attention-matrix storage to cut inference memory.

Relevance: 9 Novelty: 8

12. Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

ArXiv ID: 2603.19562

Authors: Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang, Jun Zhu, Deyu Meng

Abstract: Adversarial vulnerability in vision and hallucination in large language models are conventionally viewed as separate problems, each addressed with modality-specific patches. This study first reveals that they share a common geometric origin: the input and its loss gradient are conjugate observables subject to an irreducible uncertainty bound. Formalizing a Neural Uncertainty Principle (NUP) under a loss-induced state, we find that in near-bound regimes, further compression must be accompanied by increased sensitivity dispersion (adversarial fragility), while weak prompt-gradient coupling leaves generation under-constrained (hallucination). Crucially, this bound is modulated by an input-gradient correlation channel, captured by a specifically designed single-backward probe. In vision, masking highly coupled components improves robustness without costly adversarial training; in language, the same prefill-stage probe detects hallucination risk before generating any answer tokens. NUP thus turns two seemingly separate failure taxonomies into a shared uncertainty-budget view and provides a principled lens for reliability analysis. Guided by this NUP theory, we propose ConjMask (masking high-contribution input components) and LogitReg (logit-side regularization) to improve robustness without adversarial training, and use the probe as a decoding-free risk signal for LLMs, enabling hallucination detection and prompt selection. NUP thus provides a unified, practical framework for diagnosing and mitigating boundary anomalies across perception and generation tasks.

Comment: Representation learning/theory: proposes a unified geometric uncertainty principle linking adversarial fragility and LLM hallucination through input-gradient coupling.

Relevance: 8 Novelty: 9

13. Hyperagents

ArXiv ID: 2603.19461

Authors: Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tatiana Shavrina

Abstract: Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed, handcrafted meta-level mechanisms, fundamentally limiting how fast such systems can improve. The Darwin G\"odel Machine (DGM) demonstrates open-ended self-improvement in coding by repeatedly generating and evaluating self-modified variants. Because both evaluation and self-modification are coding tasks, gains in coding ability can translate into gains in self-improvement ability. However, this alignment does not generally hold beyond coding domains. We introduce \textbf{hyperagents}, self-referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta-level modification procedure is itself editable, enabling metacognitive self-modification, improving not only the task-solving behavior, but also the mechanism that generates future improvements. We instantiate this framework by extending DGM to create DGM-Hyperagents (DGM-H), eliminating the assumption of domain-specific alignment between task performance and self-modification skill to potentially support self-accelerating progress on any computable task. Across diverse domains, the DGM-H improves performance over time and outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. Furthermore, the DGM-H improves the process by which it generates new agents (e.g., persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. DGM-Hyperagents offer a glimpse of open-ended AI systems that do not merely search for better solutions, but continually improve their search for how to improve.

Comment: Proposes a self-referential architecture where the meta-level modification mechanism is itself editable, a foundational systems design for open-ended self-improvement.

Relevance: 8 Novelty: 9

14. Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

ArXiv ID: 2603.20155

Authors: Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, Tim Salimans

Abstract: It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.

Comment: Model compression/efficiency: introduces a new distillation objective for discrete diffusion models using discrete MMD, tackling a known methodological gap in fast sampling.