Personalized Daily ArXiv Papers 2026-04-28
| Model | Metric | Usage | Papers | ||||
|---|---|---|---|---|---|---|---|
| Prompt | Completion | Total | Total arXiv | Scanned | Relevant | ||
gpt-5.4 |
Tokens | 394647 | 32093 | 426740 | 1082 | 671 | 42 |
| Cost | $0.99 | $0.48 | $1.47 | ||||
Topic Coverage:
Table of contents by topic:
Architecture and Training Dynamics (11)
-
Can an MLP Absorb Its Own Skip Connection? Authors: Antonij Mijoski, Marko Karbevski
-
Mixture of Heterogeneous Grouped Experts for Language Modeling Authors: Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws Authors: Jun Shu, Junxiong Jia, Deyu Meng, Zongben Xu
-
Towards Understanding the Expressive Power of GNNs with Global Readout Authors: Maurice Funk, Daumantas Kojelis
-
DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models Authors: Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation Authors: Shuaizhi Cheng, Xiang Shi, Mingwei Li
-
Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions Authors: Qinhan Hou, Jing Tang
-
Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks Authors: Kevin McKee, Thomas Hazy, Yicong Zheng, Zacharie Bugaud, Thomas Miconi
-
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing Authors: Kaisheng Fan, Weizhe Zhang, Yishu Gao, Tegawend\'e F. Bissyand\'e, Xunzhu Tang
-
FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers Authors: Haopeng Jin
-
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies Authors: Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He
Efficiency, Compression, and Large-Scale Training (9)
-
FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training Authors: Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed
-
Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation Authors: Irene Tenison, Stella Ahn, Miriam Kim, Ebtisam Alshehri, Lalana Kagal
-
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
-
ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers Authors: Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling Authors: Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum
-
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns Authors: Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna
-
Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models Authors: Hailing Cheng, Tao Huang, Chen Zhu, Antonio Alonso
-
Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators Authors: Animan Naskar
-
Inference of Online Newton Methods with Nesterov's Accelerated Sketching Authors: Haoxuan Wang, Xinchen Du, Sen Na
Representation Learning Theory and Structure (13)
-
Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features Authors: John Winnicki, Abeynaya Gnanasekaran, Eric Darve
-
Representational Curvature Modulates Behavioral Uncertainty in Large Language Models Authors: Jack King, Evelina Fedorenko, Eghbal A. Hosseini
-
On the Memorization of Consistency Distillation for Diffusion Models Authors: Bingqing Jiang, Difan Zou
-
Causal Representation Learning from General Environments under Nonparametric Mixing Authors: Ignavier Ng, Shaoan Xie, Xinshuai Dong, Peter Spirtes, Kun Zhang
-
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data Authors: K. Michael Martini, Eslam Abdelaleem, Paarth Gulati, Ilya Nemenman
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning Authors: Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu
-
Learning Curves and Benign Overfitting of Spectral Algorithms in Large Dimensions Authors: Weihao Lu, Qian Lin, Yingcun Xia, Dongming Huang
-
Quasi-Equivariant Metanetworks Authors: Viet-Hoang Tran, An Nguyen, Beno\^it Gu\'erand, Thieu N. Vo, Tan M. Nguyen
-
Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks Authors: Vugar Ismailov
-
A General Representation-Based Approach to Multi-Source Domain Adaptation Authors: Ignavier Ng, Yan Li, Zijian Li, Yujia Zheng, Guangyi Chen, Kun Zhang
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance Authors: Xinzhu Chen, Wei He, Huichuan Fan, Wenzhe Niu, Zhongxiang Sun, Xuanru Wang, Jiuchong Gao, Jinghua Hao, Renqing He, Weijie Yu
-
Constraint-Based Analysis of Reasoning Shortcuts in Neurosymbolic Learning Authors: Akihiro Takemura, Katsumi Inoue, Masaaki Nishino
-
Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models Authors: Sharan Ramjee
Memory Structures and Agent Memory Systems (5)
-
ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems Authors: Alexander Bering
-
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation Authors: Mofei Li, Taozhi Chen, Guowei Yang, Jia Li
-
A Parametric Memory Head for Continual Generative Retrieval Authors: Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke
-
Graph Memory Transformer (GMT) Authors: Nicola Zanarini, Niccol`o Ferrari
-
Skill Retrieval Augmentation for Agentic AI Authors: Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, Yiqun Liu
World Models, Exploration, and Open-Ended Reinforcement Learning (4)
-
Hierarchical Behaviour Spaces Authors: Michael Tryfan Matthews, Anssi Kanervisto, Jakob Foerster, Pierluca D'Oro, Scott Fujimoto, Mikael Henaff
-
Efficient learning by implicit exploration in bandit problems with side observations Authors: Tomas Kocak, Gergely Neu, Michal Valko, Remi Munos
-
CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning Authors: Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong
-
Dual Control of Linear Systems from Bilinear Observations with Belief Space Model Predictive Control Authors: Daniel Cao, Beixi Du, Andrew Lowitt, Sunmook Choi, Sarah Dean, Yahya Sattar
Architecture and Training Dynamics (11)
1. Can an MLP Absorb Its Own Skip Connection?
ArXiv ID: 2604.23705
Primary Topic: Architecture and Training Dynamics
Authors: Antonij Mijoski, Marko Karbevski
Abstract: We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree $k \neq 1$, such as ReLU$^2$ and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with $g(0) = 0$, including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of $L$ residual blocks using such activations cannot be replicated by any composition of $L$ residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set $S$ of size at least $d$ such that $W_{\mathrm{down}}[:,S]\,W_{\mathrm{up}}[S,:] = -I_d$. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. Whether this disjointness persists for deep compositions of ReLU or GELU blocks remains open.
Comment: Characterizes when residual skip connections can or cannot be absorbed into same-width residual-free MLPs, giving a sharp expressivity separation for common activations.
Topic Match: This is directly about a core architectural mechanism—residual connections—and gives formal representational results rather than an application.
Relevance: 9 Novelty: 8
2. Mixture of Heterogeneous Grouped Experts for Language Modeling
ArXiv ID: 2604.23108
Primary Topic: Architecture and Training Dynamics
Also Matches: Efficiency, Compression, and Large-Scale Training
Authors: Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian
Abstract: Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios.
Comment: Introduces heterogeneous grouped experts with two-level routing and GPU-aware balancing to make non-uniform MoE architectures practical at scale.
Topic Match: The core idea is a new MoE architectural/routing design; efficiency matters, but as a consequence of the architectural mechanism.
Relevance: 9 Novelty: 8
3. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
ArXiv ID: 2604.24037
Primary Topic: Architecture and Training Dynamics
Authors: Jun Shu, Junxiong Jia, Deyu Meng, Zongben Xu
Abstract: Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.
Comment: Develops a limit-theoretic framework for emergent intelligence and scaling laws, tying emergence to existence of a parameter-limit architecture and a critical Lipschitz condition.
Topic Match: Its main contribution is a theoretical account of model scaling and architectural limits, best placed under architecture and training dynamics.
Relevance: 8 Novelty: 8
4. Towards Understanding the Expressive Power of GNNs with Global Readout
ArXiv ID: 2604.22870
Primary Topic: Architecture and Training Dynamics
Authors: Maurice Funk, Daumantas Kojelis
Abstract: We study the expressive power of message-passing aggregate-combine-readout graph neural networks (ACR-GNNs). Particularly, we focus on the first-order (FO) properties expressible by this formalism. While a tight logical characterisation remains a difficult open question, we make two contributions towards answering it. First, we show that sum aggregation and readout suffice for GNNs to capture FO properties that cannot be expressed in the logic C2 on both directed and undirected graphs. This strengthens known results by Hauke and Wa{\l}{\k e}ga (2026) where aggregation and readout functions are specially crafted for the task. Second, we identify two natural ways of restoring characterisability (with regard to C2) for ACR-GNNs. One option is to limit local aggregation (without imposing restrictions on global readout), whilst the second is to run ACR-GNNs over graphs of bounded degree (but unbounded size). In both cases, the FO properties captured by GNNs are exactly those definable by a formula in graded modal logic with global counting modalities. Our results thus establish an innate lower- and upper-bound in terms of how far (fragments of) C2 can be taken to characterise GNNs, and imply that is indeed the unbounded interaction of aggregation and readout that pushes the logical expressive power of GNNs above C2.
Comment: Logical expressivity results for GNNs with global readout give mechanistic understanding of what aggregate-combine-readout architectures can represent.
Topic Match: This is primarily a foundational architecture paper analyzing the expressive power induced by aggregation and readout design in GNNs.
Relevance: 8 Novelty: 8
5. DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models
ArXiv ID: 2604.24357
Primary Topic: Architecture and Training Dynamics
Authors: Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda
Abstract: Diffusion language models generate without a fixed left-to-right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence-driven ordering. Random masking creates train--test mismatch, while confidence-only rules are efficient but can be myopic and suppress useful exploration. We introduce DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove O(1/N) convergence of the stagewise Soft-BoN approximation, and show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates. Under tractable optimization assumptions, DPRM also yields a sample-complexity advantage over random and confidence-only ordering. DPRM improves over confidence-based baselines in pretraining, post-training, test-time scaling, and single-cell masked diffusion, with particularly strong gains on harder reasoning subsets. In protein, molecular generation and DNA design, the effect is more multi-objective: ordering-aware variants significantly improve selected structural or fragment-constrained metrics while not uniformly dominating the host baseline on every quality metric. These results identify token ordering as a fundamental control axis in diffusion language models and establish DPRM as a general-purpose module for improving it. Code is available at https://github.com/DakeBU/DPRM-DLLM.
Comment: Token-ordering in diffusion language models is treated as a core algorithmic control mechanism with theory and plug-in policy design.
Topic Match: The paper changes a fundamental generation/computation mechanism in diffusion language models, making architecture/mechanism the clearest fit.
Relevance: 8 Novelty: 8
6. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
ArXiv ID: 2604.23750
Primary Topic: Architecture and Training Dynamics
Also Matches: Memory Structures and Agent Memory Systems
Authors: Shuaizhi Cheng, Xiang Shi, Mingwei Li
Abstract: Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM's weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model's log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.
Comment: Shows instant hypernetwork adaptation fails on knowledge conflicts because adapter magnitude cannot override strong pretrained priors.
Topic Match: The paper identifies a mechanistic failure mode in parameter-space adaptation and proposes a training-free architectural/control fix, making training/architecture the best fit.
Relevance: 8 Novelty: 8
7. Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions
ArXiv ID: 2604.24293
Primary Topic: Architecture and Training Dynamics
Authors: Qinhan Hou, Jing Tang
Abstract: Graph neural ordinary differential equations (Graph ODEs) extend graph learning from discrete message-passing layers to continuous-time representation flows. While it supports adaptive long-range propagation, we show that Graph ODEs with strictly positive irreducible mixing operators face an inherent \emph{monostability trap}: in the long-time regime, information leakage is unavoidable and the dynamics converge to a single global consensus attractor. We propose the \textbf{Hysteresis Graph ODE (HGODE)}, which couples feature evolution with a latent topological potential driven by a learned pairwise force. A double-well edge potential and bipolarized gate allow edge states to polarize into connected or insulated phases while preserving differentiability. We provide asymptotic analysis of the collapse mechanism and the proposed hysteretic topology dynamics, and validate HGODE on theory-driven synthetic diagnostics and real-world graph benchmarks.
Comment: Analyzes the long-time consensus collapse of Graph ODEs and introduces hysteretic latent topology dynamics to avoid monostable oversmoothing.
Topic Match: The paper is centrally about a new continuous-time graph architecture motivated by asymptotic dynamical analysis, fitting architectural mechanism and training-dynamics interests.
Relevance: 8 Novelty: 8
8. Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks
ArXiv ID: 2604.24637
Primary Topic: Architecture and Training Dynamics
Also Matches: Memory Structures and Agent Memory Systems
Authors: Kevin McKee, Thomas Hazy, Yicong Zheng, Zacharie Bugaud, Thomas Miconi
Abstract: Block-sequential continual learning demands that a single model both protect prior solutions from catastrophic forgetting and efficiently infer at inference time which prior solution matches the current input without task labels. We present Functional Task Networks (FTN), a parameter-isolation method inspired by structural and dynamical motifs found in the mammalian neocortex. Similar to mixture-of-experts, this method uses a high dimensional, self-organizing binary mask over a large population of small but deep networks, inspired by dendritic models of pyramidal neurons. The mask is produced by a three-stage procedure: (1) gradient descent on a continuous mask identifies task-relevant neurons, (2) a smoothing kernel biases the result toward spatial contiguity, (3) and k-winner-take-all binarizes the resulting group at a fixed capacity budget. Like mixture-of-experts, each neuron is an independent deep network, so disjoint masks give exactly disjoint gradient updates, providing structural guarantees against catastrophic forgetting. This three-stage procedure recovers the sub-network of a previously-trained task in a single gradient step, providing unsupervised task segmentation at inference time. We test it on three continual-learning benchmarks: (1) a synthetic multi-task classification/regression generator, (2) MNIST with shuffled class labels (pure concept shift), and (3) Permuted MNIST (domain shift). On all three, FTN with fine grained smoothing (FTN-Slow) results in nearly zero forgetting. FTN with a large kernel and only 2 iterations of smoothing (FTN-Fast) trades off some retention for increased speed. We show that the spatial organization mechanism reduces the effective mask search from the combinatorial top-k subset problem in O(C(H,K)) to the complexity of a near-linear scan in O(H) over compact cortical neighborhoods, which is parallelized by the gradient-based update.
Comment: Introduces a cortex-inspired parameter-isolation architecture with self-organizing binary masks that recover task-specific subnetworks and nearly eliminate forgetting in continual learning.
Topic Match: The core contribution is a new architectural mechanism for modular computation and task-specific subnetwork activation, with continual-learning behavior emerging from the design.
Relevance: 8 Novelty: 8
9. Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
ArXiv ID: 2604.24162
Primary Topic: Architecture and Training Dynamics
Authors: Kaisheng Fan, Weizhe Zhang, Yishu Gao, Tegawend\'e F. Bissyand\'e, Xunzhu Tang
Abstract: Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain correction preserves semantic anchoring, while a stronger full-row contraction disrupts trigger-dominant routing. Finally, a controlled full-row write-back reconstructs the attention matrix to ensure inference stability. Extensive evaluations demonstrate that TIGS substantially suppresses attack success rates while strictly preserving clean reasoning and open-ended semantic consistency. Crucially, this favorable security-utility-latency equilibrium persists across diverse architectures, including dense, reasoning-oriented, and sparse mixture-of-experts models. By structurally disrupting adversarial routing with marginal latency overhead, TIGS establishes a highly practical, deployment-ready defense standard for state-of-the-art LLMs.
Comment: Detects trigger-induced attention collapse and performs targeted attention smoothing inside the forward pass to defend backdoored LLMs with low latency.
Topic Match: The mechanism operates directly on internal attention routing behavior, making it most relevant as a mechanistic architecture-level intervention.
Relevance: 8 Novelty: 8
10. FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
ArXiv ID: 2604.22808
Primary Topic: Architecture and Training Dynamics
Also Matches: Efficiency, Compression, and Large-Scale Training
Authors: Haopeng Jin
Abstract: Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens provide cheap residual exchange. FreqFormer is paired with a fused GPU execution plan that co-schedules dense, sparse, and local branches to cut kernel launches and memory traffic. We give a consistent complexity model, an orthonormal-decomposition view of approximation, and simulation-based systems numbers (throughput, arithmetic intensity, memory traffic, duration scaling). In simulations from 64K to 1M tokens, FreqFormer substantially reduces estimated attention FLOPs and KV-related memory traffic versus dense attention while keeping a hardware-friendly pattern, supporting spectrally structured heterogeneous attention as a practical direction for long-video diffusion transformers.
Comment: Allocates attention operators by frequency band and diffusion timestep, using spectral routing to reduce long-sequence video attention cost.
Topic Match: The primary contribution is a new attention architecture with dynamic heterogeneous routing across spectral bands, making architecture the best fit.
Relevance: 8 Novelty: 8
11. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
ArXiv ID: 2604.24622
Primary Topic: Architecture and Training Dynamics
Also Matches: Efficiency, Compression, and Large-Scale Training
Authors: Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He
Abstract: Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $\pi_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4\%, and achieves the best average real-robot success rate of 83.0\%, outperforming MIP by 19.5 points and $\pi_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.
Comment: Reframes flow-based VLA action generation as coarse-to-fine inference, replacing long denoising from noise with structured initialization plus one-step refinement.
Topic Match: The main contribution is a new generative action architecture that changes the computational mechanism of action synthesis, with efficiency as a downstream consequence.
Relevance: 8 Novelty: 8
Efficiency, Compression, and Large-Scale Training (9)
1. FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training
ArXiv ID: 2604.24013
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed
Abstract: The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed Flash-Overlap that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.
Comment: Eliminates overlap tail latency in distributed LLM training by replacing collectives with decomposed P2P communication and fine-grained scheduled compute partitions.
Topic Match: This is squarely a large-scale training systems paper proposing a concrete communication algorithm that changes distributed training efficiency.
Relevance: 9 Novelty: 8
2. Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation
ArXiv ID: 2604.22783
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Irene Tenison, Stella Ahn, Miriam Kim, Ebtisam Alshehri, Lalana Kagal
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting large language models (LLMs). In this work we challenge the wide-spread assumption that parameter efficiency equates memory efficiency and on-device adaptability. We show that this is not true - while methods like LoRA and IA3 significantly reduce trainable parameters, they remain bound by intermediate tensors that scale linearly with sequence length, often triggering out-of-memory errors on-device. In this work, we introduce LARS (Low-memory Activation-Rank Subspace), a novel adaptation framework that decouples memory consumption from sequence length. While prior PEFT methods apply low-rank constraints to model parameters, LARS instead constrains the activation subspace used during training, directly targeting the dominant source of memory consumption and fundamentally flattening the memory growth rate. LARS reduces the memory footprint by an average of 33.54% on GPUs and 51.95% on CPUs in comparison to LoRA across reasoning, understanding and long-context datasets using different models while maintaining competitive accuracy and throughput. Besides GPUs, we deploy on Raspberry Pi and consumer-grade CPUs to demonstrate that LARS provides a scalable path for sophisticated LLM personalization on resource-constrained hardware and edge devices.
Comment: LARS targets activation-memory growth directly, showing parameter-efficient fine-tuning is not the same as memory-efficient adaptation.
Topic Match: The central idea is a new low-memory adaptation method that changes training memory scaling behavior, which fits efficiency/compression exactly.
Relevance: 9 Novelty: 8
3. Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels
ArXiv ID: 2604.24008
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Abstract: Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30\%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.
Comment: Reframes PTQ calibration-set selection as weighted outlier-channel coverage and gives a submodular greedy algorithm with surrogate-loss justification.
Topic Match: The main idea is a principled quantization calibration algorithm that materially improves low-bit compression behavior, squarely fitting efficiency and compression.
Relevance: 9 Novelty: 8
4. ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
ArXiv ID: 2604.23798
Primary Topic: Efficiency, Compression, and Large-Scale Training
Also Matches: Architecture and Training Dynamics
Authors: Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee
Abstract: Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable} $\mathcal{O}(u\log n)$ FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid $(m,S,W)$, yielding $O(n)$ extra memory and $O(\log n)$ parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 -- making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to $O(\log n)$ at full precision. On A100 FP32 benchmarks (1K--16K tokens), ELSA delivers $1.3$--$3.5\times$ speedup over memory-efficient SDPA and $1.97$--$2.27\times$ on BERT; on Jetson TX2, ELSA achieves $1.5$--$1.6\times$ over Math (64--900 tokens), with $17.8$--$20.2\%$ throughput gains under LLaMA-13B offloading at $\ge$32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.
Comment: Presents an exact softmax attention reformulation as an associative prefix scan, reducing memory to O(n) with O(log n) parallel depth and provable FP32 error bounds.
Topic Match: This is a core attention-computation improvement for fast, memory-light inference/training kernels, making efficiency the clearest primary fit.
Relevance: 9 Novelty: 8
5. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
ArXiv ID: 2604.24715
Primary Topic: Efficiency, Compression, and Large-Scale Training
Also Matches: Architecture and Training Dynamics, Memory Structures and Agent Memory Systems
Authors: Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum
Abstract: Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90\%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.
Comment: Proposes long-context upcycling of pretrained Transformers into hybrid MLA-plus-linear-sequence models, greatly reducing KV-cache memory while extending context.
Topic Match: The strongest contribution is an efficiency-oriented scaling recipe for long-context hybridization and KV-cache reduction using existing checkpoints.
Relevance: 9 Novelty: 8
6. Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
ArXiv ID: 2604.23150
Primary Topic: Efficiency, Compression, and Large-Scale Training
Also Matches: Architecture and Training Dynamics
Authors: Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna
Abstract: Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.
Comment: Characterizes expert activation patterns in frontier MoE models and uses them for workload-aware batching and expert placement to reduce multi-node all-to-all traffic.
Topic Match: The central contribution is systems-efficient scaling of MoE inference via analysis-driven communication reduction, even though it also sheds light on routing behavior.
Relevance: 9 Novelty: 8
7. Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models
ArXiv ID: 2604.24708
Primary Topic: Efficiency, Compression, and Large-Scale Training
Also Matches: Architecture and Training Dynamics
Authors: Hailing Cheng, Tao Huang, Chen Zhu, Antonio Alonso
Abstract: Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates -- a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget. Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture -- such as dropout rate, attention scale temperature, or weight-decay coefficient -- can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch's OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline.
Comment: Repurposes data-parallel replicas for hyperparameter-divergent training with periodic averaging, enabling online learning-rate exploration without extra sweep cost.
Topic Match: The main value is in large-scale training efficiency and optimizer/schedule search under fixed hardware budget, not in model architecture itself.
Relevance: 8 Novelty: 8
8. Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators
ArXiv ID: 2604.23205
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Animan Naskar
Abstract: Deploying proprietary Deep Neural Networks (DNNs) on commodity edge devices demands hardware-backed Digital Rights Management (DRM) capable of withstanding both software-level and physical adversaries. In Unified Memory Architecture (UMA) systems, the host CPU and Neural Processing Unit (NPU) share physical DRAM, leaving plaintext model weights directly readable by a compromised OS kernel. Existing defenses fail in this constrained setting: trusted execution environments monopolize scarce memory with permanently reserved regions, while full-memory encryption operates at page granularity. This forces the system to fetch massive 4 KB memory pages for sub-page tensor tiles, severely crippling bandwidth. We present Tessera, a reference architecture for inline, cache-line granularity weight decryption on UMA edge accelerators. The design intercepts 64-byte AXI bursts, computing AES-256-CTR keystreams in parallel with DRAM fetches. This streams plaintext directly into isolated NPU SRAM, creating a transient memory footprint confined to the active tile and eliminating the need for permanent memory carve-outs. Measurements across three distinct SoC platforms demonstrate that this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations. Consequently, Tessera is projected to achieve 98.4\% of the theoretical memory bandwidth ceiling (a mere 1.6\% overhead). Across standard vision and language models, page-level memory encryption suffers up to a 32x bandwidth penalty, whereas Tessera maintains an optimal 1x footprint for all layer geometries. Finally, Tessera neutralizes major UMA-specific attack vectors -- including physical DRAM extraction, rogue DMA, and compute hijacking -- and formally prevents plaintext leakage across sparse tensors.
Comment: Streams encrypted weights at cache-line granularity directly into isolated NPU SRAM, avoiding page-level memory-encryption bandwidth blowups.
Topic Match: Although security-motivated, the key technical contribution is a systems design that materially changes memory efficiency and bandwidth behavior for model execution.
Relevance: 8 Novelty: 8
9. Inference of Online Newton Methods with Nesterov's Accelerated Sketching
ArXiv ID: 2604.23436
Primary Topic: Efficiency, Compression, and Large-Scale Training
Authors: Haoxuan Wang, Xinchen Du, Sen Na
Abstract: Reliable decision-making with streaming data requires principled uncertainty quantification of online methods. While first-order methods enable efficient iterate updates, their inference procedures still require updating proper (covariance) matrices, incurring $O(d^2)$ time and memory complexity, and are sensitive to ill-conditioning and noise heterogeneity of the problem. This costly inference task offers an opportunity for more robust second-order methods, which are, however, bottlenecked by solving Newton systems with $O(d^3)$ complexity. In this paper, we address this gap by studying an online Newton method with Hessian averaging, where the Newton direction at each step is approximately computed using a sketch-and-project solver with Nesterov's acceleration, matching $O(d^2)$ complexity of first-order methods. For the proposed method, we quantify its uncertainty arising from both random data and randomized computation. Under standard smoothness and moment conditions, we establish global almost-sure convergence, prove asymptotic normality of the last iterate with a limiting covariance characterized by a Lyapunov equation, and develop a fully online covariance estimator with non-asymptotic convergence guarantees. We also connect the resulting uncertainty quantification to that of exact and sketched Newton methods without Nesterov's acceleration. Extensive experiments on regression models demonstrate the superiority of the proposed method for online inference.
Comment: Provides online Newton inference with accelerated sketching, preserving second-order uncertainty quantification at first-order-like O(d^2) cost.
Topic Match: The core contribution is an algorithmic efficiency improvement for online second-order optimization and uncertainty estimation under resource constraints.
Relevance: 8 Novelty: 8
Representation Learning Theory and Structure (13)
1. Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features
ArXiv ID: 2604.23829
Primary Topic: Representation Learning Theory and Structure
Also Matches: Memory Structures and Agent Memory Systems
Authors: John Winnicki, Abeynaya Gnanasekaran, Eric Darve
Abstract: Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren't very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there's no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than unlabeled layouts. In a case study on a biology textbook, these graphs recover coherent chapter and subchapter-level structure, reveal concepts that bridge neighboring topics, and transform messy sentence-level activity containing thousands of features into compact, readable views that illustrate the model's local activity. Taken together, this reframes a flat SAE inventory as an internal knowledge graph that converts feature-level interpretability into a global map of model knowledge and enables audits of reasoning faithfulness.
Comment: Builds domain-filtered knowledge graphs from SAE features, turning sparse features into structured maps of internal model concepts and pathways.
Topic Match: The main value is organizing and interpreting learned SAE features and their relations, which is a direct representation-structure contribution.
Relevance: 9 Novelty: 8
2. Representational Curvature Modulates Behavioral Uncertainty in Large Language Models
ArXiv ID: 2604.23985
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Jack King, Evelina Fedorenko, Eghbal A. Hosseini
Abstract: In autoregressive large language models (LLMs), temporal straightening offers an account of how the next-token prediction objective shapes representations. Models learn to progressively straighten the representational trajectory of input sequences across layers, potentially facilitating next-token prediction via linear extrapolation. However, a direct link between this trajectory and token-level behavior has been missing. We provide such a link by relating contextual curvature-a geometric measure of how sharply the representational trajectory bends over recent context-to next-token entropy. Across two models (GPT-2 XL and Pythia-2.8B), contextual curvature is correlated with entropy, and this relationship emerges during training. Perturbation experiments reveal selective dependence: manipulating curvature through trajectory-aligned interventions reliably modulates entropy, while geometrically misaligned perturbations have no effect. Finally, regularizing representations to be straighter during training modestly reduces token-level entropy without degrading validation loss. These results identify trajectory curvature as a task-aligned representational feature that influences behavioral uncertainty in LLMs.
Comment: Links representational curvature to token-level entropy and shows curvature interventions causally modulate behavioral uncertainty.
Topic Match: The paper directly studies geometric structure in learned representations and its behavioral consequences, making representation structure the clearest topic.
Relevance: 9 Novelty: 8
3. On the Memorization of Consistency Distillation for Diffusion Models
ArXiv ID: 2604.23552
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Bingqing Jiang, Difan Zou
Abstract: Diffusion models are central to modern generative modeling, and understanding how they balance memorization and generalization is critical for reliable deployment. Recent work has shown that memorization in diffusion models is shaped by training dynamics, with generalization and memorization emerging at different stages of training. However, deployed diffusion models are often further distilled, introducing an additional training phase whose impact on memorization is not well understood. In this work, we analyze how distillation reshapes memorization behavior in diffusion models, taking consistency distillation as a representative framework. Empirically, we show that when applied to a teacher model that has memorized data, consistency distillation significantly reduces transferred memorization in the student while preserving, and sometimes improving, sample quality. To explain this behavior, we provide a theoretical analysis using a random feature neural network model [Bonnaire et al., 2025], showing that consistency distillation suppresses unstable feature directions associated with memorization while preserving stable, generalizable modes. Our findings suggest that distillation can serve not only as an acceleration tool, but also as a mechanism for improving the memorization-generalization trade-off.
Comment: Analyzes how consistency distillation suppresses unstable memorization directions while preserving generalizable modes in diffusion models.
Topic Match: The heart of the paper is a mechanistic and theoretical account of memorization versus generalization in learned features under distillation.
Relevance: 9 Novelty: 8
4. Causal Representation Learning from General Environments under Nonparametric Mixing
ArXiv ID: 2604.23800
Primary Topic: Representation Learning Theory and Structure
Authors: Ignavier Ng, Shaoan Xie, Xinshuai Dong, Peter Spirtes, Kun Zhang
Abstract: Causal representation learning aims to recover the latent causal variables and their causal relations, typically represented by directed acyclic graphs (DAGs), from low-level observations such as image pixels. A prevailing line of research exploits multiple environments, which assume how data distributions change, including single-node interventions, coupled interventions, or hard interventions, or parametric constraints on the mixing function or the latent causal model, such as linearity. Despite the novelty and elegance of the results, they are often violated in real problems. Accordingly, we formalize a set of desiderata for causal representation learning that applies to a broader class of environments, referred to as general environments. Interestingly, we show that one can fully recover the latent DAG and identify the latent variables up to minor indeterminacies under a nonparametric mixing function and nonlinear latent causal models, such as additive (Gaussian) noise models or heteroscedastic noise models, by properly leveraging sufficient change conditions on the causal mechanisms up to third-order derivatives. These represent, to our knowledge, the first results to fully recover the latent DAG from general environments under nonparametric mixing. Notably, our results match or improve upon many existing works, but require less restrictive assumptions about changing environments.
Comment: Shows latent DAG and variable recovery under general environments with nonparametric mixing and weaker assumptions than prior causal representation learning work.
Topic Match: The main result is foundational theory for recovering latent causal representations and their structure, squarely about representation identifiability.
Relevance: 8 Novelty: 9
5. Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
ArXiv ID: 2604.24662
Primary Topic: Representation Learning Theory and Structure
Authors: K. Michael Martini, Eslam Abdelaleem, Paarth Gulati, Ilya Nemenman
Abstract: Identifying the dynamical state variables of a system from high-dimensional observations is a central problem across physical sciences. The challenge is that the state variables are not directly observable and must be inferred from raw high-dimensional data without supervision. Here we introduce DySIB (Dynamical Symmetric Information Bottleneck) as a method to learn low-dimensional representations of time-series data by maximizing predictive mutual information between past and future observation windows while penalizing representation complexity. This objective operates entirely in latent space and avoids reconstruction of the observations. We apply DySIB to an experimental video dataset of a physical pendulum, where the underlying state space is known. The method, with hyperparameters of the learning architecture set self-consistently by the data, recovers a two-dimensional representation that matches the dimensionality, topology, and geometry of the pendulum phase space, with the learned coordinates aligning smoothly with the canonical angle and angular velocity. These results demonstrate, on a well-characterized experimental system, that predictive information in latent space can be used to recover interpretable dynamical coordinates directly from high-dimensional data.
Comment: Uses a predictive information bottleneck objective to recover low-dimensional phase-space coordinates directly from high-dimensional time-series observations.
Topic Match: The paper centers on discovering interpretable latent state structure and dynamics-preserving representations from raw data.
Relevance: 8 Novelty: 8
6. The Power of Power Law: Asymmetry Enables Compositional Reasoning
ArXiv ID: 2604.22951
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu
Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.
Comment: Provides theory showing power-law training distributions can improve compositional reasoning by reshaping the loss landscape through asymmetry.
Topic Match: The paper is best read as foundational theory about how data distribution shapes learned compositional structure and training behavior.
Relevance: 8 Novelty: 8
7. Learning Curves and Benign Overfitting of Spectral Algorithms in Large Dimensions
ArXiv ID: 2604.23212
Primary Topic: Representation Learning Theory and Structure
Authors: Weihao Lu, Qian Lin, Yingcun Xia, Dongming Huang
Abstract: Existing large-dimensional theory for spectral algorithms resolves either the optimally tuned point or the interpolation limit, but leaves the under-regularized regime unexplored. We study the learning curve and benign overfitting of spectral algorithms in the large-dimensional setting where the sample size and dimension are of comparable order, i.e., $n \asymp d^{\gamma}$ for some $\gamma>0$. We first consider inner-product kernels on the sphere $\mathbb{S}^{d-1}$ and establish a sharp asymptotic characterization of the excess risk across the full regularization path under various source conditions $s \geq 0$, where $s$ measures the relative smoothness of the regression function. Our results reveal that the learning curve is not simply U-shaped but instead consists of three distinct regimes: over-regularized, under-regularized, and interpolation regimes. This characterization allows us to fully capture the benign overfitting phenomenon, demonstrating that benign overfitting arises consistently across both the under-regularized and interpolation regimes whenever $s$ is positive but no larger than a critical threshold. We further show that, in the sufficiently regularized regime, the kernel learning curve is recovered by an associated sequence model. Finally, we extend the learning-curve analysis to large-dimensional KRR for a class of kernels on general domains in $\mathbb{R}^d$ whose low-degree eigenspaces satisfy spectral-scaling and hyper-contractivity conditions.
Comment: Characterizes full-regularization-path learning curves and benign overfitting regimes for spectral algorithms in large dimensions.
Topic Match: This is foundational statistical learning theory about representation-learning algorithms and their generalization dynamics in high dimensions.
Relevance: 8 Novelty: 8
8. Quasi-Equivariant Metanetworks
ArXiv ID: 2604.23720
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Viet-Hoang Tran, An Nguyen, Beno\^it Gu\'erand, Thieu N. Vo, Tan M. Nguyen
Abstract: Metanetworks are neural architectures designed to operate directly on pretrained weights to perform downstream tasks. However, the parameter space serves only as a proxy for the underlying function class, and the parameter-function mapping is inherently non-injective: distinct parameter configurations may yield identical input-output behaviors. As a result, metanetworks that rely solely on raw parameters risk overlooking the intrinsic symmetries of the architecture. Reasoning about functional identity is therefore essential for effective metanetwork design, motivating the development of equivariant metanetworks, which incorporate equivariance principles to respect architectural symmetries. Existing approaches, however, typically enforce strict equivariance, which imposes rigid constraints and often leads to sparse and less expressive models. To address this limitation, we introduce the novel concept of quasi-equivariance, which allows metanetworks to move beyond the rigidity of strict equivariance while still preserving functional identity. We lay down a principled basis for this framework and demonstrate its broad applicability across diverse neural architectures, including feedforward, convolutional, and transformer networks. Through empirical evaluation, we show that quasi-equivariant metanetworks achieve good trade-offs between symmetry preservation and representational expressivity. These findings advance the theoretical understanding of weight-space learning and provide a principled foundation for the design of more expressive and functionally robust metanetworks.
Comment: Quasi-equivariant metanetworks provide a principled symmetry-aware framework for learning over weight space without strict equivariance.
Topic Match: The core insight concerns the structure of neural-network parameter space and functional symmetries, best viewed as representation structure rather than a standard architecture tweak.
Relevance: 8 Novelty: 8
9. Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks
ArXiv ID: 2604.23765
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Vugar Ismailov
Abstract: We analyze the universal approximation property of Kolmogorov-Arnold Networks (KANs) in terms of their edge functions. If these functions are all affine, then universality clearly fails. How many non-affine functions are needed, in addition to affine ones, to ensure universality? We show that a single one suffices. More precisely, we prove that deep KANs in which all edge functions are either affine or equal to a fixed continuous function $\sigma$ are dense in $C(K)$ for every compact set $K\subset\mathbb{R}^n$ if and only if $\sigma$ is non-affine. In contrast, for KANs with exactly two hidden layers, universality holds if and only if $\sigma$ is nonpolynomial. We further show that the full class of affine functions is not required; it can be replaced by a finite set without affecting universality. In particular, in the nonpolynomial case, a fixed family of five affine functions suffices when the depth is arbitrary. More generally, for every continuous non-affine function $\sigma$, there exists a finite affine family $A_\sigma$ such that deep KANs with edge functions in $A_\sigma\cup{\sigma}$ remain universal. We also prove that KANs with the spline-based edge parameterization introduced by Liu et al.~\cite{Liu2024} are universal approximators in the classical sense, even when the spline degree and knot sequence are fixed in advance.
Comment: Provides necessary and sufficient universality conditions for KANs, showing when a single non-affine edge function suffices.
Topic Match: The contribution is fundamentally theoretical, characterizing expressive power of a learned representation class rather than proposing an application system.
Relevance: 8 Novelty: 8
10. A General Representation-Based Approach to Multi-Source Domain Adaptation
ArXiv ID: 2604.23790
Primary Topic: Representation Learning Theory and Structure
Authors: Ignavier Ng, Yan Li, Zijian Li, Yujia Zheng, Guangyi Chen, Kun Zhang
Abstract: A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label's Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label's parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.
Comment: Identifiability theory for which latent representation components must be transferred in general multi-source domain adaptation.
Topic Match: The core contribution is a theoretical characterization of representation structure via Markov blanket partitioning and identifiability, not an application-specific adaptation result.
Relevance: 8 Novelty: 8
11. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
ArXiv ID: 2604.23318
Primary Topic: Representation Learning Theory and Structure
Also Matches: Architecture and Training Dynamics
Authors: Xinzhu Chen, Wei He, Huichuan Fan, Wenzhe Niu, Zhongxiang Sun, Xuanru Wang, Jiuchong Gao, Jinghua Hao, Renqing He, Weijie Yu
Abstract: Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.
Comment: Uses hidden-state Wasserstein divergence to derive span-level credit assignment for RLVR without extra reward models or annotations.
Topic Match: The method hinges on a structural signal in hidden-state distributions and turns representation divergence into a mechanistic training signal.
Relevance: 8 Novelty: 8
12. Constraint-Based Analysis of Reasoning Shortcuts in Neurosymbolic Learning
ArXiv ID: 2604.23377
Primary Topic: Representation Learning Theory and Structure
Authors: Akihiro Takemura, Katsumi Inoue, Masaaki Nishino
Abstract: Neurosymbolic systems can satisfy logical constraints during learning without achieving the intended concept-label correspondence; this is a problem known as reasoning shortcuts. We formalize reasoning shortcuts as a constraint satisfaction problem and investigate under which conditions concept mappings are uniquely determined by the constraints. We prove that a discrimination property (requiring that no valid concept mapping can be transformed into another valid mapping by swapping two concept values) is necessary for shortcut-freeness under bijective mappings, but demonstrate via a counterexample that it is insufficient even when the constraint graph is connected. We develop an ASP-based algorithm that verifies whether a given constraint set uniquely determines the intended concept mapping, with proven soundness and completeness. When shortcuts are detected, a greedy repair algorithm eliminates them by augmenting the constraint set, converging in at most $k$ iterations, where $k$ is the number of alternative valid mappings. We further provide a complexity classification: deciding shortcut-freeness is coNP-complete, counting shortcuts is #P-complete, and finding minimal repairs is NP-hard. We also establish sample complexity bounds showing that logarithmically many label queries suffice for disambiguation in favorable cases, while querying all ambiguous positions suffices in the worst case. Experiments across eight benchmark domains validate our approach.
Comment: Formalizes reasoning shortcuts in neurosymbolic learning and characterizes when constraints uniquely determine intended concept mappings.
Topic Match: The work is about identifiability and structural determination of learned concepts, which is central to representation learning theory and structure.
Relevance: 8 Novelty: 8
13. Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
ArXiv ID: 2604.23460
Primary Topic: Representation Learning Theory and Structure
Authors: Sharan Ramjee
Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the "planning" phase of latent reasoning.
Comment: Shows that misaligned latent reasoning in continuous-thought models can be detected via early-token geometry and linear probes despite aligned outputs.
Topic Match: The core contribution is mechanistic structure in latent reasoning representations: identifying geometric separation of aligned versus misaligned internal states and probing them.
Relevance: 8 Novelty: 8
Memory Structures and Agent Memory Systems (5)
1. ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems
ArXiv ID: 2604.23878
Primary Topic: Memory Structures and Agent Memory Systems
Authors: Alexander Bering
Abstract: Despite a century of empirical memory research, existing AI agent memory systems rely on system-engineering metaphors (virtual-memory paging, flat LLM storage, Zettelkasten notes), none integrating principles of consolidation, forgetting, and reconsolidation. We present ZenBrain, a multi-layer memory architecture integrating fifteen neuroscience models. It implements seven memory layers (working, short-term, episodic, semantic, procedural, core, cross-context) orchestrated by nine foundational algorithms (Two-Factor Synaptic Model, vmPFC-coupled FSRS, Simulation-Selection sleep, Bayesian confidence, and five more) plus six new Predictive Memory Architecture (PMA) components: a four-channel NeuromodulatorEngine, prediction-error-gated ReconsolidationEngine, TripleCopyMemory with divergent decay, four-dimensional PriorityMap with amygdala fast-path, StabilityProtector (NogoA/HDAC3 analogue), and MetacognitiveMonitor for bias detection. The 15-algorithm ablation reveals a cooperative survival network: under stress, 9 of 15 algorithms become individually critical (delta-Q up to -93.7%, Wilcoxon, 10 seeds, alpha=0.005). Simulation-Selection sleep achieves 37% stability improvement (p<0.005) with 47.4% storage reduction. TripleCopyMemory retains S(t)=0.912 at 30 days; PriorityMap reaches NDCG@10=0.997. Multi-layer routing beats a flat single-layer baseline by 20.7% F1 on LoCoMo (p<0.005) and 19.5% on MemoryArena (p=0.015). On LongMemEval-500, ZenBrain holds the highest mean rank on all 12 system-judge cells (4 systems x 3 LLM judges), three-judge mean J=0.545 vs letta=0.485, a-mem=0.414, mem0=0.394; all 9 pair-wise contrasts clear Bonferroni (alpha=0.05/18, min p=6.2e-31, d in [0.18, 0.52]). Under LongMemEval's binary judge, ZenBrain reaches 91.3% of oracle accuracy at 1/106th the per-query token budget. Open-source with 11,589 automated test cases.
Comment: Seven-layer agent memory architecture introduces explicit mechanisms for consolidation, forgetting, reconsolidation, and routing across memory types.
Topic Match: The paper is squarely about agent memory structure and update principles, not generic agent tooling or chat-history management.
Relevance: 10 Novelty: 8
2. MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
ArXiv ID: 2604.24222
Primary Topic: Memory Structures and Agent Memory Systems
Authors: Mofei Li, Taozhi Chen, Guowei Yang, Jia Li
Abstract: Large Language Models (LLMs) excel at general code generation, but their performance drops sharply in enterprise settings that rely on internal private libraries absent from public pre-training corpora. While Retrieval-Augmented Generation (RAG) offers a training-free alternative by providing static API documentation, we find that such documentation typically provides only isolated definitions, leaving a fundamental knowledge gap. Specifically, LLMs struggle with a task-level lack of coordination patterns between APIs and an API-level misunderstanding of parameter constraints and boundary conditions. To address this, we propose MEMCoder, a novel framework that enables LLMs to autonomously accumulate and evolve Usage Guidelines across these two dimensions. MEMCoder introduces a Multi-dimensional Evolving Memory that captures distilled lessons from the model's own problem-solving trajectories. During inference, MEMCoder employs a dual-source retrieval mechanism to inject both static documentation and relevant historical guidelines into the context. The framework operates in an automated closed loop by using objective execution feedback to reflect on successes and failures, resolve knowledge conflicts, and dynamically update memory. Extensive evaluations on the NdonnxEval and NumbaEval benchmarks demonstrate that MEMCoder substantially enhances existing RAG systems, yielding an average absolute pass@1 gain of 16.31%. Furthermore, MEMCoder exhibits vastly superior domain-specific adaptation compared to existing memory-based continual learning methods.
Comment: Introduces an evolving external memory that accumulates and updates task- and API-level usage guidelines from execution feedback for code generation.
Topic Match: This is a genuine agent-memory mechanism with explicit storage, updating, conflict resolution, and retrieval, rather than standard RAG plumbing.
Relevance: 9 Novelty: 8
3. A Parametric Memory Head for Continual Generative Retrieval
ArXiv ID: 2604.23388
Primary Topic: Memory Structures and Agent Memory Systems
Also Matches: Efficiency, Compression, and Large-Scale Training
Authors: Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke
Abstract: Generative information retrieval (GenIR) consolidates retrieval into a single neural model that decodes document identifiers (docids) directly from queries. While this model-as-index paradigm offers architectural simplicity, it is poorly suited to dynamic document collections. Unlike modular systems, where indexes are easily updated, GenIR's knowledge is parametrically encoded in its weights; consequently, standard adaptation methods such as full and parameter-efficient fine-tuning can induce catastrophic forgetting. We show that sequential adaptation improves retrieval on newly added documents but substantially degrades performance on earlier slices, exposing a pronounced stability-plasticity trade-off. To address this, we propose post-adaptation memory tuning (PAMT), a memory-only stabilization stage that augments an adapted model with a modular parametric memory head (PMH). PAMT freezes the backbone and attaches a product-key memory with fixed addressing. During prefix-trie constrained decoding, decoder hidden states sparsely query PMH to produce residual corrections in hidden space; these corrections are mapped to score adjustments via the frozen output embedding matrix, computed only over trie-valid tokens. This guides docid generation while keeping routing and backbone parameters fixed. To limit cross-slice interference, PAMT updates only a fixed budget of memory values selected using decoding-time access statistics, prioritizing entries frequently activated by the current slice and rarely used in prior sessions. Experiments on MS MARCO and Natural Questions under sequential, disjoint corpus increments show that PAMT substantially improves retention on earlier slices with minimal impact on retrieval performance for newly added documents, while modifying only a sparse subset of memory values per session.
Comment: A modular parametric memory head addresses catastrophic forgetting in continual generative retrieval through sparse memory-only updates.
Topic Match: The main contribution is a new external/parametric memory mechanism for updating retrieval models over time while preserving old knowledge.
Relevance: 9 Novelty: 8
4. Graph Memory Transformer (GMT)
ArXiv ID: 2604.23862
Primary Topic: Memory Structures and Agent Memory Systems
Also Matches: Architecture and Training Dynamics
Authors: Nicola Zanarini, Niccol`o Ferrari
Abstract: We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.
Comment: Replaces transformer FFN sublayers with an explicit learned graph memory that routes token states over centroids and transitions, exposing inspectable memory navigation.
Topic Match: The defining idea is an explicit internal memory mechanism inside a decoder-only LM, making memory structures the most direct fit.
Relevance: 9 Novelty: 8
5. Skill Retrieval Augmentation for Agentic AI
ArXiv ID: 2604.24594
Primary Topic: Memory Structures and Agent Memory Systems
Authors: Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, Yiqun Liu
Abstract: As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.
Comment: Formulates skill retrieval augmentation as a scalable external skill-memory problem and diagnoses skill incorporation as the main bottleneck.
Topic Match: The paper's core is dynamic retrieval and loading of reusable external skills as an agent memory mechanism, not standard RAG plumbing.
Relevance: 8 Novelty: 8
World Models, Exploration, and Open-Ended Reinforcement Learning (4)
1. Hierarchical Behaviour Spaces
ArXiv ID: 2604.24558
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Michael Tryfan Matthews, Anssi Kanervisto, Jakob Foerster, Pierluca D'Oro, Scott Fujimoto, Mikael Henaff
Abstract: Recent work in hierarchical reinforcement learning has shown success in scaling to billions of timesteps when learning over a set of predefined option reward functions. We show that, instead of using a single reward function per option, the reward functions can be effectively used to induce a space of behaviours, by letting the controller specify linear combinations over reward functions, allowing a more expressive set of policies to be represented. We call this method Hierarchical Behaviour Spaces (HBS). We evaluate HBS on the NetHack Learning Environment, demonstrating strong performance. We conduct a series of experiments and determine that, perhaps going against conventional wisdom, the benefits of hierarchy in our method come from increased exploration rather than long term reasoning.
Comment: Replaces fixed option rewards with a controllable behavior space in hierarchical RL and argues the resulting gains arise mainly from improved exploration.
Topic Match: The paper is centrally about hierarchical RL and transferable behavior acquisition, with exploration as the main explanatory mechanism.
Relevance: 9 Novelty: 8
2. Efficient learning by implicit exploration in bandit problems with side observations
ArXiv ID: 2604.24555
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Tomas Kocak, Gergely Neu, Michal Valko, Remi Munos
Abstract: We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.
Comment: Introduces implicit exploration algorithms for partial-observability online learning with side observations, achieving strong regret without prior knowledge of the observation graph.
Topic Match: Though framed in bandits/online learning, the key contribution is a foundational new exploration principle closely aligned with the exploration part of the target topic.
Relevance: 8 Novelty: 8
3. CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning
ArXiv ID: 2604.23308
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Authors: Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong
Abstract: Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.
Comment: Diffusion-based on-policy trajectory augmentation tackles coordination failure in offline MARL by enabling co-adaptation as policies evolve.
Topic Match: This is foundational multi-agent RL work on improving offline coordination through generated on-policy-like experience, not an LLM post-training paper.
Relevance: 8 Novelty: 8
4. Dual Control of Linear Systems from Bilinear Observations with Belief Space Model Predictive Control
ArXiv ID: 2604.24663
Primary Topic: World Models, Exploration, and Open-Ended Reinforcement Learning
Also Matches: Memory Structures and Agent Memory Systems
Authors: Daniel Cao, Beixi Du, Andrew Lowitt, Sunmook Choi, Sarah Dean, Yahya Sattar
Abstract: We study finite-horizon quadratic control of linear systems with bilinear observations, in which the control input affects not only the state dynamics but also the partial observations of the state. In this setting, the separation principle can fail because control inputs influence the future quality of state estimates. State estimation requires an input-dependent Kalman filter whose gain and error covariance evolve as functions of the control inputs. To address this challenge, we propose a belief-space model predictive control ($\texttt{B-MPC}$) method that plans directly over both the estimated state and its error covariance. In particular, $\texttt{B-MPC}$ plans with a deterministic surrogate of the belief evolution defined by the input-dependent Kalman filter. Through numerical experiments in two synthetic settings, we show that $\texttt{B-MPC}$ can outperform both the separation-principle controller and its MPC variant in favorable regimes, and that these gains are accompanied by lower estimation covariance and more uncertainty-aware action choices.
Comment: Plans directly in belief space for systems with input-dependent observations, tackling dual control where actions alter future observability.
Topic Match: This is fundamentally about model-based decision-making under uncertainty with belief-state planning, which best fits world models and RL foundations.
Relevance: 8 Novelty: 8
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Relevant Topics
Focus on specialized foundational research that remains worth reading even when it is not a daily hotspot.
Do not keep papers only because they are broadly frontier-relevant, widely discussed, or part of a major launch cycle. Broad daily frontier movement belongs in the hotspot digest unless the core contribution strongly matches the specialized topics below.
Architecture and Training Dynamics - Keep: work that introduces or analyzes core architectural or computational mechanisms such as MoE routing, attention variants, normalization or residual design, recurrent or state-space sequence modeling, dynamic or modular computation, or training-stability mechanisms. - Filter: papers that mainly apply an existing architecture to a new task or benchmark without new mechanistic insight.
Efficiency, Compression, and Large-Scale Training - Keep: quantization, sparsity, pruning, low-rank adaptation, KV-cache or cache design, memory-efficient inference or training, distributed training algorithms, communication or optimizer improvements, and training-system designs that materially change large-model training cost or behavior. - Filter: routine infrastructure optimization, deployment work, or straightforward tuning of standard efficiency methods without a clear new algorithmic or systems idea.
Representation Learning Theory and Structure - Keep: work on feature formation, sparse or dictionary learning, contrastive or self-supervised representation structure, training dynamics, identifiability, or other mechanistic understanding of learned representations. - Filter: papers that use representation-learning methods as standard components in downstream applications without new theoretical or methodological content.
Memory Structures and Agent Memory Systems - Keep: internal or external memory mechanisms, differentiable memory, recurrent or latent memory, long-context memory organization, memory compression or eviction, retrieval as a learned memory mechanism, episodic or semantic memory for agents, memory consolidation, forgetting, and agent memory systems whose core contribution is a new principle for storing, updating, recalling, or reasoning over memory. - Filter: standard RAG pipelines, vector-database plumbing, context stuffing, chat-history management, or agent products that add memory without a new memory mechanism, learning principle, or analysis.
World Models, Exploration, and Open-Ended Reinforcement Learning - Keep: model-based RL, action-conditioned world models, imagination or planning-based agents, open-ended exploration, automatic curriculum or environment generation, continual RL, reward-free skill discovery, and RL methods aimed at learning new behaviors or transferable knowledge through interaction. Also keep foundational work on pre-training agents or world models, foundation world models, generative interactive environments, or theoretical arguments about why world models or exploration are necessary for general-purpose agents. - Filter: RLHF, DPO, GRPO, RFT, instruction-following or alignment fine-tuning for LLMs; papers where RL is mainly a post-training optimizer for language models, reasoning traces, or tool-use agents without a new world-model, exploration, or generalization contribution; routine benchmark gains on a fixed environment without a new learning principle.
Usually leave these to the hotspot digest unless the core contribution is clearly foundational: - major model or product releases - broadly trendy agent or tooling launches - benchmark, leaderboard, or evaluation-only papers - downstream applications in medical imaging, segmentation, 3D vision, video understanding, information retrieval, summarization, recommendation, machine translation, speech recognition, time series, knowledge graphs, and similar domains
Scoring Criteria
Relevance and Novelty are independent axes. Score both from 1 to 10.
Relevance Scoring
- 9-10: directly centered on the target foundational topics; highest when the core contribution is clearly within them.
- 7-8: substantially related, but partly peripheral or focused on a narrower aspect.
- 5-6: touches the target topics, but the main contribution is elsewhere.
- 3-4: largely outside the target topics, often application-focused or domain-specific.
- 1-2: unrelated.
Important: Broad frontier relevance, major launch status, or daily buzz is not enough for a high Relevance score here. Those cases belong in the hotspot digest unless the paper strongly matches the specialized paper topics.
Novelty Scoring
- 9-10: new paradigm, theory, or major methodological breakthrough.
- 7-8: substantial methodological advance or strong new insight.
- 5-6: meaningful but incremental extension or refinement.
- 3-4: minor, narrow, or mostly engineering or domain-specific improvement.
- 1-2: little originality; mainly standard application of existing methods.
Topic Registry
Use exactly one PRIMARY_TOPIC_ID chosen from the stable topic IDs below. - architecture_training: Architecture and Training Dynamics - Core architectural or computational mechanisms, dynamic computation, and training-stability dynamics. - efficiency_scaling: Efficiency, Compression, and Large-Scale Training - Compression, sparsity, memory or cache efficiency, and large-scale training systems that materially change cost or behavior. - representation_structure: Representation Learning Theory and Structure - How learned representations form, organize, and support generalization or mechanistic understanding. - memory_systems: Memory Structures and Agent Memory Systems - Internal or external memory mechanisms, learned retrieval memory, consolidation, forgetting, and agent memory systems. - world_models_open_ended_rl: World Models, Exploration, and Open-Ended Reinforcement Learning - World models, model-based RL, exploration, continual learning, and RL for transferable knowledge acquisition rather than LLM post-training.
Papers
[PAPER LIST HERE]
Instructions
Respond in JSONL. Output exactly one JSON object per paper, one per line:
{"ARXIVID":"...","COMMENT":"...","RELEVANCE":0,"NOVELTY":0,"PRIMARY_TOPIC_ID":"...","MATCHED_TOPIC_IDS":[],"TOPIC_MATCH_COMMENT":"...","HOTSPOT_PAPER_TAGS":[],"HOTSPOT_PAPER_COMMENT":"..."}
Rules: - ARXIVID: the arXiv ID. - COMMENT: identify the single strongest matching criterion. Be brief and specific. Do not rely on generic phrases like "language modeling" or "advancement". Do not mention non-matching criteria. - RELEVANCE: integer from 1 to 10. - NOVELTY: integer from 1 to 10. - PRIMARY_TOPIC_ID: exactly one stable topic ID from the allowed topic registry. - MATCHED_TOPIC_IDS: zero or more stable topic IDs from the same allowed set. Include PRIMARY_TOPIC_ID when there are multiple matches. - TOPIC_MATCH_COMMENT: briefly explain why the primary topic is the best fit. - HOTSPOT_PAPER_TAGS: zero or more tags from this exact set only:
daily_hot,new_frontier. - HOTSPOT_PAPER_COMMENT: briefly explain why the paper belongs in the daily hotspot paper feed when HOTSPOT_PAPER_TAGS is non-empty; otherwise use an empty string. - Use HOTSPOT_PAPER_TAGS sparingly. Most papers should return[]. -daily_hotmeans the paper feels broadly important to the day and belongs in the daily hotspot paper section even if it is not part of the personalized foundational reading list. -new_frontiermeans the paper appears to open a genuinely new direction, paradigm, or field, even if the work is still early. - Do not output markdown, code fences, or any extra text.