Personalized Daily ArXiv Papers 2026-03-27
| Model | Metric | Usage | Papers | ||||
|---|---|---|---|---|---|---|---|
| Prompt | Completion | Total | Total arXiv | Scanned | Relevant | ||
gpt-5.4 |
Tokens | 116876 | 4798 | 121674 | 540 | 324 | 33 |
| Cost | $0.29 | $0.07 | $0.36 | ||||
Table of contents with paper titles:
-
Labeled Compression Schemes for Concept Classes of Finite Functions Authors: Benchong Li
-
Perturbation: A simple and efficient adversarial tracer for representation learning in language models Authors: Joshua Rozner, Cory Shain
-
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens Authors: Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen
-
Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic Authors: Anand Swaroop
-
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding Authors: Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu
-
Manifold Generalization Provably Proceeds Memorization in Diffusion Models Authors: Zebang Shen, Ya-Ping Hsieh, Niao He
-
MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning Authors: Andrea Manzoni
-
StateLinFormer: Stateful Training Enhancing Long-term Memory in Navigation Authors: Zhiyuan Chen, Yuxuan Zhong, Fan Wang, Bo Yu, Pengtao Shao, Shaoshan Liu, Ning Ding
-
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct Authors: Christopher Ackerman, Nina Panickssery
-
The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations Authors: Long Zhang, Dai-jun Lin, Wei-neng Chen
-
Identification of NMF by choosing maximum-volume basis vectors Authors: Qianqian Qi, Zhongming Chen, Peter G. M. van der Heijden
-
Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes Authors: Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, Jiacheng Sun
-
Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method Authors: Arthur Jacot
-
Deep Neural Regression Collapse Authors: Akshay Rangamani, Altay Unal
-
DVM: Real-Time Kernel Generation for Dynamic AI Models Authors: Jingzhi Fang, Xiong Gao, Renwei Zhang, Zichun Ye, Lei Chen, Jie Zhao, Chengnuo Huang, Hui Xu, Xuefeng Jin
-
Minimal Sufficient Representations for Self-interpretable Deep Neural Networks Authors: Zhiyao Tan, Liu Li, Huazhen Lin
-
Likelihood hacking in probabilistic program synthesis Authors: Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton
-
AVO: Agentic Variation Operators for Autonomous Evolutionary Search Authors: Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, Humphrey Shi
-
The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation Authors: Mingyi Liu
-
Kirchhoff-Inspired Neural Networks for Evolving High-Order Perception Authors: Tongfei Chen, Jingying Yang, Linlin Yang, Jinhu L\"u, David Doermann, Chunyu Xie, Long He, Tian Wang, Juan Zhang, Guodong Guo, Baochang Zhang
-
A Theory of LLM Information Susceptibility Authors: Zhuo-Yang Song, Hua Xing Zhu
-
Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness Authors: Yunrui Yu, Hang Su, Jun Zhu
-
The Diminishing Returns of Early-Exit Decoding in Modern LLMs Authors: Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li, Zhaozhuo Xu, Hao Wang
-
Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score Authors: Jimyung Hong, Jaehyung Kim
-
Resolving gradient pathology in physics-informed epidemiological models Authors: Nickson Golooba, Woldegebriel Assefa Woldegerima
-
Self-Distillation for Multi-Token Prediction Authors: Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun
-
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? Authors: Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
-
Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep Authors: Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
-
Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification Authors: Han Sun, Qin Li, Peixin Wang, Min Zhang
-
Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization Authors: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu
-
Sparse Autoencoders for Interpretable Medical Image Representation Learning Authors: Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis
-
Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL Authors: Igor Jankowski
-
Upper Entropy for 2-Monotone Lower Probabilities Authors: Tuan-Anh Vu, S\'ebastien Destercke, Fr\'ed\'eric Pichon
1. Labeled Compression Schemes for Concept Classes of Finite Functions
ArXiv ID: 2603.23561
Authors: Benchong Li
Abstract: The sample compression conjecture is: Each concept class of VC dimension d has a compression scheme of size d.In this paper, for any concept class of finite functions, we present a labeled sample compression scheme of size equals to its VC dimension d. That is, the long standing open sample compression conjecture is resolved.
Comment: Foundational learning theory result resolving the sample compression conjecture for concept classes of finite functions.
Relevance: 8 Novelty: 10
2. Perturbation: A simple and efficient adversarial tracer for representation learning in language models
ArXiv ID: 2603.23821
Authors: Joshua Rozner, Cory Shain
Abstract: Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.
Comment: Introduces a new perturbation-based method to trace learned linguistic representations via transfer from single-example adversarial fine-tuning.
Relevance: 9 Novelty: 8
3. MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
ArXiv ID: 2603.23516
Authors: Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen
Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.
Comment: Introduces Memory Sparse Attention with scalable sparse attention, document-wise RoPE, and KV-cache compression for end-to-end long-context training and inference efficiency.
Relevance: 9 Novelty: 8
4. Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic
ArXiv ID: 2603.23784
Authors: Anand Swaroop
Abstract: Grokking-the phenomenon where validation accuracy of neural networks on modular addition of two integers rises long after training data has been memorized-has been characterized in previous works as producing sinusoidal input weight distributions in transformers and multi-layer perceptrons (MLPs). We find empirically that ReLU MLPs in our experimental setting instead learn near-binary square wave input weights, where intermediate-valued weights appear exclusively near sign-change boundaries, alongside output weight distributions whose dominant Fourier phases satisfy a phase-sum relation $\phi_{\mathrm{out}} = \phi_a + \phi_b$; this relation holds even when the model is trained on noisy data and fails to grok. We extract the frequency and phase of each neuron's weights via DFT and construct an idealized MLP: Input weights are replaced by perfect binary square waves and output weights by cosines, both parametrized by the frequencies, phases, and amplitudes extracted from the dominant Fourier components of the real model weights. This idealized model achieves 95.5% accuracy when the frequencies and phases are extracted from the weights of a model trained on noisy data that itself achieves only 0.23% accuracy. This suggests that grokking does not discover the correct algorithm, but rather sharpens an algorithm substantially encoded during memorization, progressively binarizing the input weights into cleaner square waves and aligning the output weights, until generalization becomes possible.
Comment: Mechanistic training-dynamics study of grokking showing latent algorithmic structure emerges before generalization in ReLU MLPs.
Relevance: 9 Novelty: 8
5. Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
ArXiv ID: 2603.23914
Authors: Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.
Comment: Memory-efficient inference via low-rank multi-head attention compaction and token-specific decompression for KV storage.
Relevance: 9 Novelty: 8
6. Manifold Generalization Provably Proceeds Memorization in Diffusion Models
ArXiv ID: 2603.23792
Authors: Zebang Shen, Ya-Ping Hsieh, Niao He
Abstract: Diffusion models often generate novel samples even when the learned score is only \emph{coarse} -- a phenomenon not accounted for by the standard view of diffusion training as density estimation. In this paper, we show that, under the \emph{manifold hypothesis}, this behavior can instead be explained by coarse scores capturing the \emph{geometry} of the data while discarding the fine-scale distributional structure of the population measure~$\mu_{\scriptscriptstyle\mathrm{data}}$. Concretely, whereas estimating the full data distribution $\mu_{\scriptscriptstyle\mathrm{data}}$ supported on a $k$-dimensional manifold is known to require the classical minimax rate $\tilde{\mathcal{O}}(N^{-1/k})$, we prove that diffusion models trained with coarse scores can exploit the \emph{regularity of the manifold support} and attain a near-parametric rate toward a \emph{different} target distribution. This target distribution has density uniformly comparable to that of~$\mu_{\scriptscriptstyle\mathrm{data}}$ throughout any $\tilde{\mathcal{O}}\bigl(N^{-\beta/(4k)}\bigr)$-neighborhood of the manifold, where $\beta$ denotes the manifold regularity. Our guarantees therefore depend only on the smoothness of the underlying support, and are especially favorable when the data density itself is irregular, for instance non-differentiable. In particular, when the manifold is sufficiently smooth, we obtain that \emph{generalization} -- formalized as the ability to generate novel, high-fidelity samples -- occurs at a statistical rate strictly faster than that required to estimate the full population distribution~$\mu_{\scriptscriptstyle\mathrm{data}}$.
Comment: Provides theory that diffusion models can learn manifold geometry and generalize before memorizing the full data distribution, a strong representation-learning insight.
Relevance: 8 Novelty: 9
7. MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning
ArXiv ID: 2603.24044
Authors: Andrea Manzoni
Abstract: Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.
Comment: Routing-guided LoRA for MoE fine-tuning directly studies expert skew and adapts only highly routed experts, a clear architecture-mechanism and efficient adaptation contribution.
Relevance: 9 Novelty: 7
8. StateLinFormer: Stateful Training Enhancing Long-term Memory in Navigation
ArXiv ID: 2603.23571
Authors: Zhiyuan Chen, Yuxuan Zhong, Fan Wang, Bo Yu, Pengtao Shao, Shaoshan Liu, Ning Ding
Abstract: Effective navigation intelligence relies on long-term memory to support both immediate generalization and sustained adaptation. However, existing approaches face a dilemma: modular systems rely on explicit mapping but lack flexibility, while Transformer-based end-to-end models are constrained by fixed context windows, limiting persistent memory across extended interactions. We introduce StateLinFormer, a linear-attention navigation model trained with a stateful memory mechanism that preserves recurrent memory states across consecutive training segments instead of reinitializing them at each batch boundary. This training paradigm effectively approximates learning on infinitely long sequences, enabling the model to achieve long-horizon memory retention. Experiments across both MAZE and ProcTHOR environments demonstrate that StateLinFormer significantly outperforms its stateless linear-attention counterpart and standard Transformer baselines with fixed context windows. Notably, as interaction length increases, persistent stateful training substantially improves context-dependent adaptation, suggesting an enhancement in the model's In-Context Learning (ICL) capabilities for navigation tasks.
Comment: Stateful training for linear attention preserves recurrent memory across batch boundaries, directly targeting architecture/training dynamics for long-horizon sequence learning.
Relevance: 9 Novelty: 7
9. Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
ArXiv ID: 2410.02064
Authors: Christopher Ackerman, Nina Panickssery
Abstract: It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.
Comment: Mechanistic interpretability of residual-stream directions controlling self-authorship judgments in Llama-3 aligns with representation structure and causal circuit analysis.
Relevance: 8 Novelty: 8
10. The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations
ArXiv ID: 2603.23577
Authors: Long Zhang, Dai-jun Lin, Wei-neng Chen
Abstract: Large language models (LLMs) generalize smoothly across continuous semantic spaces, yet strict logical reasoning demands the formation of discrete decision boundaries. Prevailing theories relying on linear isometric projections fail to resolve this fundamental tension. In this work, we argue that task context operates as a non-isometric dynamical operator that enforces a necessary "topological distortion." By applying Gram-Schmidt decomposition to residual-stream activations , we reveal a dual-modulation mechanism driving this process: a class-agnostic topological preservation that anchors global structure to prevent semantic collapse, and a specific algebraic divergence that directionally tears apart cross-class concepts to forge logical boundaries. We validate this geometric evolution across a gradient of tasks, from simple mapping to complex primality testing. Crucially, targeted specific vector ablation establishes a strict causal binding between this topology and model function: algebraically erasing the divergence component collapses parity classification accuracy from 100% to chance levels (38.57%). Furthermore, we uncover a three-phase layer-wise geometric dynamic and demonstrate that under social pressure prompts, models fail to generate sufficient divergence. This results in a "manifold entanglement" that geometrically explains sycophancy and hallucination. Ultimately, our findings revise the linear-isometric presumption, demonstrating that the emergence of discrete logic in LLMs is purchased at an irreducible cost of topological deformation.
Comment: Analyzes how context induces geometric/topological deformation in residual representations to form discrete logical boundaries, with causal ablations.
Relevance: 8 Novelty: 8
11. Identification of NMF by choosing maximum-volume basis vectors
ArXiv ID: 2603.24227
Authors: Qianqian Qi, Zhongming Chen, Peter G. M. van der Heijden
Abstract: In nonnegative matrix factorization (NMF), minimum-volume-constrained NMF is a widely used framework for identifying the solution of NMF by making basis vectors as similar as possible. This typically induces sparsity in the coefficient matrix, with each row containing zero entries. Consequently, minimum-volume-constrained NMF may fail for highly mixed data, where such sparsity does not hold. Moreover, the estimated basis vectors in minimum-volume-constrained NMF may be difficult to interpret as they may be mixtures of the ground truth basis vectors. To address these limitations, in this paper we propose a new NMF framework, called maximum-volume-constrained NMF, which makes the basis vectors as distinct as possible. We further establish an identifiability theorem for maximum-volume-constrained NMF and provide an algorithm to estimate it. Experimental results demonstrate the effectiveness of the proposed method.
Comment: Provides an identifiability theorem and algorithm for maximum-volume-constrained NMF, directly relevant to representation structure and sparse factorization theory.
Relevance: 8 Novelty: 8
12. Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes
ArXiv ID: 2603.23507
Authors: Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, Jiacheng Sun
Abstract: While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) tokens inherent to the paradigm, and 2) tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.
Comment: Introduces a new deletion-insertion diffusion architecture for language modeling with algorithmic gains in variable-length generation efficiency.
Relevance: 8 Novelty: 8
13. Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method
ArXiv ID: 2603.24594
Authors: Arthur Jacot
Abstract: We introduce the Multilevel Euler-Maruyama (ML-EM) method compute solutions of SDEs and ODEs using a range of approximators $f^1,\dots,f^k$ to the drift $f$ with increasing accuracy and computational cost, only requiring a few evaluations of the most accurate $f^k$ and many evaluations of the less costly $f^1,\dots,f^{k-1}$. If the drift lies in the so-called Harder than Monte Carlo (HTMC) regime, i.e. it requires $\epsilon^{-\gamma}$ compute to be $\epsilon$-approximated for some $\gamma>2$, then ML-EM $\epsilon$-approximates the solution of the SDE with $\epsilon^{-\gamma}$ compute, improving over the traditional EM rate of $\epsilon^{-\gamma-1}$. In other terms it allows us to solve the SDE at the same cost as a single evaluation of the drift. In the context of diffusion models, the different levels $f^{1},\dots,f^{k}$ are obtained by training UNets of increasing sizes, and ML-EM allows us to perform sampling with the equivalent of a single evaluation of the largest UNet. Our numerical experiments confirm our theory: we obtain up to fourfold speedups for image generation on the CelebA dataset downscaled to 64x64, where we measure a $\gamma\approx2.5$. Given that this is a polynomial speedup, we expect even stronger speedups in practical applications which involve orders of magnitude larger networks.
Comment: Proposes a multilevel Euler-Maruyama sampling method that gives a principled polynomial speedup for diffusion inference using multi-fidelity drifts.
Relevance: 8 Novelty: 8
14. Deep Neural Regression Collapse
ArXiv ID: 2603.23805
Authors: Akshay Rangamani, Altay Unal
Abstract: Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.
Comment: Shows Neural Regression Collapse extends into hidden layers, characterizing low-rank feature formation and training dynamics in deep regression networks.
Relevance: 8 Novelty: 8
15. DVM: Real-Time Kernel Generation for Dynamic AI Models
ArXiv ID: 2603.24239
Authors: Jingzhi Fang, Xiong Gao, Renwei Zhang, Zichun Ye, Lei Chen, Jie Zhao, Chengnuo Huang, Hui Xu, Xuefeng Jin
Abstract: Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a dynamic model, or sacrifice optimization opportunities for usability. In this paper, we rethink the feasibility of runtime compilation for dynamic models and identify that the key for it to work is to speed up the compilation or hide the compilation overhead. To do this, we propose a real-time compiler, DVM. In DVM, we design a runtime operator compiler based on a bytecode virtual machine to perform effective and efficient compilation for each dynamic operator instance given its input. Specifically, instead of compiling programs into machine code, we encode the operator program into bytecode on the CPU and decode the bytecode into virtual instructions for direct execution on the NPU. Based on the runtime operator compiler, we further propose an operator fuser, which performs symbol-deduction-based fusion on static graphs and runtime fusion on dynamic graphs. Both pattern- and stacking-based fusion are supported to increase fusion opportunities. Evaluation on operators, subgraphs, and models shows that, compared with TorchInductor, PyTorch-eager and MindSpore-graph-O0, we are up to 11.77$\times$ better in terms of the operator/model efficiency and up to 5 orders of magnitude faster in terms of the maximum compilation time.
Comment: Presents a real-time compiler with bytecode VM execution and dynamic fusion for materially changing execution efficiency of dynamic AI models.
Relevance: 8 Novelty: 8
16. Minimal Sufficient Representations for Self-interpretable Deep Neural Networks
ArXiv ID: 2603.24041
Authors: Zhiyao Tan, Liu Li, Huazhen Lin
Abstract: Deep neural networks (DNNs) achieve remarkable predictive performance but remain difficult to interpret, largely due to overparameterization that obscures the minimal structure required for interpretation. Here we introduce DeepIn, a self-interpretable neural network framework that adaptively identifies and learns the minimal representation necessary for preserving the full expressive capacity of standard DNNs. We show that DeepIn can correctly identify the minimal representation dimension, select relevant variables, and recover the minimal sufficient network architecture for prediction. The resulting estimator achieves optimal non-asymptotic error rates that adapt to the learned minimal dimension, demonstrating that recovering minimal sufficient structure fundamentally improves generalization error. Building on these guarantees, we further develop hypothesis testing procedures for both selected variables and learned representations, bridging deep representation learning with formal statistical inference. Across biomedical and vision benchmarks, DeepIn improves both predictive accuracy and interpretability, reducing error by up to 30% on real-world datasets while automatically uncovering human-interpretable discriminative patterns. Our results suggest that interpretability and statistical rigor can be embedded directly into deep architectures without sacrificing performance.
Comment: Studies minimal sufficient representations in deep networks with theory for recovering minimal dimension and variables, squarely about representation structure.
Relevance: 8 Novelty: 8
17. Likelihood hacking in probabilistic program synthesis
ArXiv ID: 2603.24126
Authors: Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton
Abstract: When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}{\text{safe}}$ satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement $\mathcal{L}$, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.}}$'s conditions as $\texttt{SafeStan
Comment: Identifies and formalizes a training failure mode in RL for probabilistic program synthesis, with language-level constraints preventing reward hacking.
Relevance: 8 Novelty: 8
18. AVO: Agentic Variation Operators for Autonomous Evolutionary Search
ArXiv ID: 2603.24517
Authors: Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, Humphrey Shi
Abstract: Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today's most advanced GPU hardware.
Comment: Autonomous evolutionary search discovers micro-architectural attention kernel optimizations, a strong systems/efficient inference contribution beyond routine tuning.
Relevance: 8 Novelty: 8
19. The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
ArXiv ID: 2603.24124
Authors: Mingyi Liu
Abstract: RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding -- response homogenization -- is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.
Comment: Analyzes how alignment training induces response homogenization, giving concrete insight into training dynamics and uncertainty behavior.
Relevance: 8 Novelty: 8
20. Kirchhoff-Inspired Neural Networks for Evolving High-Order Perception
ArXiv ID: 2603.23977
Authors: Tongfei Chen, Jingying Yang, Linlin Yang, Jinhu L\"u, David Doermann, Chunyu Xie, Long He, Tian Wang, Juan Zhang, Guodong Guo, Baochang Zhang
Abstract: Deep learning architectures are fundamentally inspired by neuroscience, particularly the structure of the brain's sensory pathways, and have achieved remarkable success in learning informative data representations. Although these architectures mimic the communication mechanisms of biological neurons, their strategies for information encoding and transmission are fundamentally distinct. Biological systems depend on dynamic fluctuations in membrane potential; by contrast, conventional deep networks optimize weights and biases by adjusting the strengths of inter-neural connections, lacking a systematic mechanism to jointly characterize the interplay among signal intensity, coupling structure, and state evolution. To tackle this limitation, we propose the Kirchhoff-Inspired Neural Network (KINN), a state-variable-based network architecture constructed based on Kirchhoff's current law. KINN derives numerically stable state updates from fundamental ordinary differential equations, enabling the explicit decoupling and encoding of higher-order evolutionary components within a single layer while preserving physical consistency, interpretability, and end-to-end trainability. Extensive experiments on partial differential equation (PDE) solving and ImageNet image classification validate that KINN outperforms state-of-the-art existing methods.
Comment: Proposes a new state-variable neural layer derived from Kirchhoff-style ODE updates, squarely in core architectural mechanism design.
Relevance: 8 Novelty: 8
21. A Theory of LLM Information Susceptibility
ArXiv ID: 2603.23626
Authors: Zhuo-Yang Song, Hua Xing Zhu
Abstract: Large language models (LLMs) are increasingly deployed as optimization modules in agentic systems, yet the fundamental limits of such LLM-mediated improvement remain poorly understood. Here we propose a theory of LLM information susceptibility, centred on the hypothesis that when computational resources are sufficiently large, the intervention of a fixed LLM does not increase the performance susceptibility of a strategy set with respect to budget. We develop a multi-variable utility-function framework that generalizes this hypothesis to architectures with multiple co-varying budget channels, and discuss the conditions under which co-scaling can exceed the susceptibility bound. We validate the theory empirically across structurally diverse domains and model scales spanning an order of magnitude, and show that nested, co-scaling architectures open response channels unavailable to fixed configurations. These results clarify when LLM intervention helps and when it does not, demonstrating that tools from statistical physics can provide predictive constraints for the design of AI systems. If the susceptibility hypothesis holds generally, the theory suggests that nested architectures may be a necessary structural condition for open-ended agentic self-improvement.
Comment: Representation/training theory: proposes a formal theory of LLM information susceptibility with scaling-law-style constraints on when fixed vs nested architectures can improve performance.
Relevance: 8 Novelty: 8
22. Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness
ArXiv ID: 2603.23860
Authors: Yunrui Yu, Hang Su, Jun Zhu
Abstract: This work investigates the critical role of activation function curvature -- quantified by the maximum second derivative $\max|\sigma''|$ -- in adversarial robustness. Using the Recursive Curvature-Tunable Activation Family (RCT-AF), which enables precise control over curvature through parameters $\alpha$ and $\beta$, we systematically analyze this relationship. Our study reveals a fundamental trade-off: insufficient curvature limits model expressivity, while excessive curvature amplifies the normalized Hessian diagonal norm of the loss, leading to sharper minima that hinder robust generalization. This results in a non-monotonic relationship where optimal adversarial robustness consistently occurs when $\max|\sigma''|$ falls within 4 to 10, a finding that holds across diverse network architectures, datasets, and adversarial training methods. We provide theoretical insights into how activation curvature affects the diagonal elements of the hessian matrix of the loss, and experimentally demonstrate that the normalized Hessian diagonal norm exhibits a U-shaped dependence on $\max|\sigma''|$, with its minimum within the optimal robustness range, thereby validating the proposed mechanism.
Comment: Studies activation curvature as a mechanistic driver of robustness via Hessian behavior, connecting architectural nonlinearity design to training/generalization dynamics.
Relevance: 8 Novelty: 7
23. The Diminishing Returns of Early-Exit Decoding in Modern LLMs
ArXiv ID: 2603.23701
Authors: Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li, Zhaozhuo Xu, Hao Wang
Abstract: In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model's intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.
Comment: Provides an analysis and benchmark of when early-exit decoding works across architectures, directly about efficient inference and architectural redundancy.
Relevance: 8 Novelty: 7
24. Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score
ArXiv ID: 2603.23985
Authors: Jimyung Hong, Jaehyung Kim
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade-offs: task-agnostic approaches cannot adapt to task-specific requirements, while task-aware methods require costly training to learn task adaptability. We propose DIET (Dimension-wise global pruning of LLMs via merging Task-wise importance scores), a training-free structured pruning method that combines dimension-level granularity with task-aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre-computation or training. Experiments on seven zero-shot benchmarks using Gemma-2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma-2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state-of-the-art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.
Comment: Training-free structured pruning via task-merged dimension-wise importance scores is a clear compression and sparsity contribution.
Relevance: 8 Novelty: 7
25. Resolving gradient pathology in physics-informed epidemiological models
ArXiv ID: 2603.23799
Authors: Nickson Golooba, Woldegebriel Assefa Woldegerima
Abstract: Physics-informed neural networks (PINNs) are increasingly used in mathematical epidemiology to bridge the gap between noisy clinical data and compartmental models, such as the susceptible-exposed-infected-removed (SEIR) model. However, training these hybrid networks is often unstable due to competing optimization objectives. As established in recent literature on ``gradient pathology," the gradient vectors derived from the data loss and the physical residual often point in conflicting directions, leading to slow convergence or optimization deadlock. While existing methods attempt to resolve this by balancing gradient magnitudes or projecting conflicting vectors, we propose a novel method, conflict-gated gradient scaling (CGGS), to address gradient conflicts in physics-informed neural networks for epidemiological modelling, ensuring stable and efficient training and a computationally efficient alternative. This method utilizes the cosine similarity between the data and physics gradients to dynamically modulate the penalty weight. Unlike standard annealing schemes that only normalize scales, CGGS acts as a geometric gate: it suppresses the physical constraint when directional conflict is high, allowing the optimizer to prioritize data fidelity, and restores the constraint when gradients align. We prove that this gating mechanism preserves the standard $O(1/T)$ convergence rate for smooth non-convex objectives, a guarantee that fails under fixed-weight or magnitude-balanced training when gradients conflict. We demonstrate that this mechanism autonomously induces a curriculum learning effect, improving parameter estimation in stiff epidemiological systems compared to magnitude-based baselines. Our empirical results show improved peak recovery and convergence over magnitude-based methods.
Comment: Introduces conflict-gated gradient scaling to resolve gradient conflict in physics-informed training, directly targeting training stability via gradient geometry.
Relevance: 8 Novelty: 7
26. Self-Distillation for Multi-Token Prediction
ArXiv ID: 2603.23911
Authors: Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun
Abstract: As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.
Comment: Inference-efficiency method for multi-token prediction using self-distillation to raise head acceptance and stabilize multi-head training.
Relevance: 8 Novelty: 7
27. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
ArXiv ID: 2603.24472
Authors: Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
Abstract: Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
Comment: Analyzes a post-training dynamics mechanism: self-distillation suppresses epistemic verbalization, harming OOD reasoning.
Relevance: 8 Novelty: 7
28. Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
ArXiv ID: 2603.24260
Authors: Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Abstract: Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67$\times$ latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.
Comment: Heterogeneous caching for diffusion transformers introduces a new token-selection/cache design for reducing redundant attention computation at inference.
Relevance: 8 Novelty: 7
29. Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification
ArXiv ID: 2603.24058
Authors: Han Sun, Qin Li, Peixin Wang, Min Zhang
Abstract: Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.
Comment: Identifies attention imbalance as a mechanistic cause of LVLM hallucination and proposes a decoding-time attention reallocation intervention.
Relevance: 8 Novelty: 7
30. Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization
ArXiv ID: 2603.24093
Authors: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu
Abstract: Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~(\textbf{DGO}), a unified framework that leverages \emph{external} and \emph{internal experience} to improve training effectiveness. Specifically, DGO first constructs an experience bank from previously explored trajectories. The policy then performs exploration under the joint guidance of the experience bank and the model's internal knowledge. The resulting trajectories are further used to refine the experience bank and optimize model parameters, forming a closed loop of experience utilization and internalization. Experiments show that DGO consistently outperforms baseline methods, suggesting that better utilization and internalization of experience lead to more effective reasoning.
Comment: Large-scale training dynamics: introduces a closed-loop RLVR framework that explicitly uses and internalizes past trajectories via an experience bank to change exploration and optimization behavior.
Relevance: 8 Novelty: 7
31. Sparse Autoencoders for Interpretable Medical Image Representation Learning
ArXiv ID: 2603.23794
Authors: Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis
Abstract: Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, (c) correspond to specific concepts that can be expressed in language using large language model (LLM)-based auto-interpretation. (d) bridge clinical language and abstract latent representations in zero-shot language-driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept-driven medical vision systems. Code repository: https://github.com/pwesp/sail.
Comment: Uses uncertainty-gated asynchronous execution in MARL to reduce inference frequency, a dynamic-computation mechanism tied to training and credit assignment.
Relevance: 8 Novelty: 7
32. Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL
ArXiv ID: 2603.23722
Authors: Igor Jankowski
Abstract: While Multi-Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro-frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge-devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time-Dilation MAPPO (ETD-MAPPO), augmented with a Dual-Gated Epistemic Trigger. Instead of depending on rigid frame-skipping (macro-actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state-value divergence in a Twin-Critic architecture). To format this, we structure the environment as a Semi-Markov Decision Process (SMDP) and build the SMDP-Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements (> 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115-dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off-ball execution without deteriorating centralized task dominance.
Comment: Introduces an SMDP-aligned asynchronous training method with autonomous compute modulation, targeting efficient inference and training dynamics.
Relevance: 8 Novelty: 7
33. Upper Entropy for 2-Monotone Lower Probabilities
ArXiv ID: 2603.23558
Authors: Tuan-Anh Vu, S\'ebastien Destercke, Fr\'ed\'eric Pichon
Abstract: Uncertainty quantification is a key aspect in many tasks such as model selection/regularization, or quantifying prediction uncertainties to perform active learning or OOD detection. Within credal approaches that consider modeling uncertainty as probability sets, upper entropy plays a central role as an uncertainty measure. This paper is devoted to the computational aspect of upper entropies, providing an exhaustive algorithmic and complexity analysis of the problem. In particular, we show that the problem has a strongly polynomial solution, and propose many significant improvements over past algorithms proposed for 2-monotone lower probabilities and their specific cases.
Comment: Computational analysis of upper entropy for 2-monotone lower probabilities gives a strongly polynomial algorithm for uncertainty quantification.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Respond in JSONL. Output exactly one JSON object per paper, one per line:
{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0}
Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - Do not output markdown, code fences, or any extra text.
Scoring Criteria
Relevance and Novelty are independent axes. Score both from 1 to 10.
Relevance Scoring
- 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
- 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
- 5-6: touches the target topics, but the main contribution is elsewhere.
- 3-4: largely outside the target topics, often application-focused or domain-specific.
- 1-2: unrelated.
Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.
Novelty Scoring
- 9-10: new paradigm, theory, or major methodological breakthrough.
- 7-8: substantial methodological advance or strong new insight.
- 5-6: meaningful but incremental extension or refinement.
- 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
- 1-2: little originality; mainly standard application of existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Focus on specialized foundational research that is worth reading even if it is not a daily hotspot.
Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the paper strongly matches the specialized topics below.
Architecture Mechanisms and Training Dynamics - Keep: work that introduces or analyzes core architectural mechanisms such as MoE routing, attention variants, normalization or residual design, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly package an existing architecture into a new task or benchmark without new mechanistic insight.
Compression, Sparsity, and Efficient Inference - Keep: quantization, sparsity, pruning, low-rank adaptation, cache design, memory-efficient inference, or compression methods with clear algorithmic novelty. - Filter: straightforward application or tuning of standard efficiency methods without a new method, analysis, or principle.
Large-Scale Training Systems and Memory Efficiency - Keep: distributed training algorithms, communication or optimizer improvements, memory-saving methods, and training-system designs that materially change large-model training behavior. - Filter: routine engineering optimization, infrastructure reporting, or deployment work without a clear new algorithmic or systems idea.
Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, or identifiability and mechanistic understanding. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.
Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains