Personalized Daily ArXiv Papers 2025-12-08

[gpt-5]	Prompt	Completion	Total
Token	31035	29728	60763
Cost	$0.04	$0.3	$0.34

Total arXiv papers: 377

Total scanned papers: 234

Total relevant papers: 14

Table of contents with paper titles:

Sparse Attention Post-Training for Mechanistic Interpretability Authors: Florent Draye, Anson Lei, Ingmar Posner, Bernhard Sch\"olkopf
KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity Authors: Damien Lesens, Beheshteh T. Rakhshan, Guillaume Rabusseau
Learnability Window in Gated Recurrent Neural Networks Authors: Lorenzo Livi
On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability Authors: Yiming Tang, Harshvardhan Saini, Yizhen Liao, Dianbo Liu
InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models Authors: Zihao Wu
HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies Authors: Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang
Interaction Tensor Shap Authors: Hiroki Hasegawa, Yukihiko Okada
Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws Authors: Zhengquan Luo, Zhiqiang Xu
CFO: Learning Continuous-Time PDE Dynamics via Flow-Matched Neural Operators Authors: Xianglong Hou, Xinquan Huang, Paris Perdikaris
LDLT $\mathcal{L}$-Lipschitz Network: Generalized Deep End-To-End Lipschitz Network Construction Authors: Marius F. R. Juston, Ramavarapu S. Sreenivas, Dustin Nottage, Ahmet Soylemezoglu
One-Step Diffusion Samplers via Self-Distillation and Deterministic Flow Authors: Pascal Jutras-Dube, Jiaru Zhang, Ziran Wang, Ruqi Zhang
Uncertainty Quantification for Scientific Machine Learning using Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN) Authors: Y. Sungtaek Ju
LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning Authors: \"Omer Faruk Akg\"ul, Yusuf Hakan Kalayc{\i}, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna
Continuous-Time Homeostatic Dynamics for Reentrant Inference Models Authors: Byung Gyu Chae

1. Sparse Attention Post-Training for Mechanistic Interpretability

ArXiv ID: 2512.05865

Authors: Florent Draye, Anson Lei, Ingmar Posner, Bernhard Sch\"olkopf

Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.3 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

Comment: Matches Compression/Efficiency and Representation Learning: post-training sparsity regularization makes transformer attention extremely sparse without loss, exposing interpretable connectivity.

Relevance: 10 Novelty: 8

2. KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

ArXiv ID: 2512.05916

Authors: Damien Lesens, Beheshteh T. Rakhshan, Guillaume Rabusseau

Abstract: The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.

Comment: Matches Compression/Efficiency: provable KV-cache compression via optimal low-rank approximation of the attention matrix (attention fidelity guarantees).

Relevance: 10 Novelty: 8

3. Learnability Window in Gated Recurrent Neural Networks

ArXiv ID: 2512.05790

Authors: Lorenzo Livi

Abstract: We develop a theoretical framework that explains how gating mechanisms determine the learnability window $\mathcal{H}N$ of recurrent neural networks, defined as the largest temporal horizon over which gradient information remains statistically recoverable. While classical analyses emphasize numerical stability of Jacobian products, we show that stability alone is insufficient: learnability is governed instead by the \emph{effective learning rates} $\mu_N$ by slowing statistical concentration. By linking gate-induced time-scale structure, gradient noise, and sample complexity, the framework identifies the effective learning rates as the fundamental quantities that govern when -- and for how long -- gated recurrent networks can learn long-range temporal dependencies.}$, per-lag and per-neuron quantities obtained from first-order expansions of gate-induced Jacobian products in Backpropagation Through Time. These effective learning rates act as multiplicative filters that control both the magnitude and anisotropy of gradient transport. Under heavy-tailed ($\alpha$-stable) gradient noise, we prove that the minimal sample size required to detect a dependency at lag~$\ell$ satisfies $N(\ell)\propto f(\ell)^{-\alpha}$, where $f(\ell)=|\mu_{t,\ell}|_1$ is the effective learning rate envelope. This leads to an explicit formula for $\mathcal{H}_N$ and closed-form scaling laws for logarithmic, polynomial, and exponential decay of $f(\ell)$. The theory predicts that broader or more heterogeneous gate spectra produce slower decay of $f(\ell)$ and hence larger learnability windows, whereas heavier-tailed noise compresses $\mathcal{H

Comment: Matches Model Architecture/Training Dynamics: theoretical analysis linking gating spectra in RNNs to gradient transport and the learnability window under heavy-tailed noise.

Relevance: 9 Novelty: 8

4. On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

ArXiv ID: 2512.05534

Authors: Yiming Tang, Harshvardhan Saini, Yizhen Liao, Dianbo Liu

Abstract: As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode many concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into interpretable features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one unified optimization problem. We demonstrate how diverse methods instantiate the theoretical framwork and provide rigorous analysis on the optimization landscape. We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique. We further design controlled experiments to validate our theoretical results.

Comment: Matches Representation Learning: unified theoretical framework and optimization landscape for sparse dictionary learning (sparse autoencoders/transcoders), explaining phenomena like feature absorption and dead neurons.

Relevance: 9 Novelty: 8

5. InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models

ArXiv ID: 2512.05134

Authors: Zihao Wu

Abstract: Diffusion models deliver high-fidelity synthesis but remain slow due to iterative sampling. We empirically observe there exists feature invariance in deterministic sampling, and present InvarDiff, a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale. From a few deterministic runs, we compute a per-timestep, per-layer, per-module binary cache plan matrix and use a re-sampling correction to avoid drift when consecutive caches occur. Using quantile-based change metrics, this matrix specifies which module at which step is reused rather than recomputed. The same invariance criterion is applied at the step scale to enable cross-timestep caching, deciding whether an entire step can reuse cached results. During inference, InvarDiff performs step-first and layer-wise caching guided by this matrix. When applied to DiT and FLUX, our approach reduces redundant compute while preserving fidelity. Experiments show that InvarDiff achieves $2$-$3\times$ end-to-end speed-ups with minimal impact on standard quality metrics. Qualitatively, we observe almost no degradation in visual quality compared with full computations.

Comment: Strong match to Efficiency: training-free cross-timestep and cross-layer caching exploiting invariances to accelerate diffusion models 2–3x.

Relevance: 9 Novelty: 8

6. HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

ArXiv ID: 2512.05693

Authors: Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang

Abstract: The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.

Comment: Strong match to Model Architecture: Hierarchical Mixture-of-Experts (MoE) action module for heterogeneous VLA policies.

Relevance: 9 Novelty: 7

7. Interaction Tensor Shap

ArXiv ID: 2512.05338

Authors: Hiroki Hasegawa, Yukihiko Okada

Abstract: Machine learning models have grown increasingly deep and high dimensional, making it difficult to understand how individual and combined features influence their predictions. While Shapley value based methods provide principled feature attributions, existing formulations cannot tractably evaluate higher order interactions: the Shapley Taylor Interaction Index (STII) requires exponential scale enumeration of subsets, and current tensor based approaches such as the Marginal SHAP Tensor (MST) are restricted to first order effects. The central problem is that no existing framework simultaneously preserves the axiomatic exactness of STII and avoids the exponential computational blow up inherent to high order discrete derivatives. Here we show that high order Shapley interactions can be represented exactly as tensor network contractions, enabling polynomial time and polylog depth computation under Tensor Train (TT) structure. We introduce Interaction Tensor SHAP (IT SHAP), which reformulates STII as the contraction of a Value Tensor and a Weight Tensor, and assume a finite state TT representation of the Weight Tensor with polynomial TT ranks. Under TT structured model and distribution tensors, we show that IT SHAP reduces the exponential complex Theta(4^n) of STII to NC2 parallel time. These results demonstrate that IT SHAP provides a unified, axiomatic, and computationally tractable formulation of main effects and higher order interactions in high dimensional models. This framework establishes a foundation for scalable interaction aware explainable AI, with implications for large black box models whose combinatorial structure has previously rendered interaction analysis infeasible.

Comment: Matches Representation Learning/interpretability: exact high-order Shapley interactions via tensor-network contractions enabling polynomial-time computation under TT structure.