Personalized Daily ArXiv Papers 2025-04-23

[gpt-4o]	Prompt	Completion	Total
Token	33658	4226	37884
Cost	$0.08	$0.04	$0.13

Total arXiv papers: 346

Total scanned papers: 212

Total relevant papers: 16

Table of contents with paper titles:

Markov Kernels, Distances and Optimal Control: A Parable of Linear Quadratic Non-Gaussian Distribution Steering Authors: Alexis M. H. Teter, Wenqing Wang, Sachin Shivakumar, Abhishek Halder
Transport f divergences Authors: Wuchen Li
Universal Approximation with Softmax Attention Authors: Jerry Yao-Chieh Hu, Hude Liu, Hong-Yu Chen, Weimin Wu, Han Liu
Shannon invariants: A scalable approach to information decomposition Authors: Aaron J. Gutknecht, Fernando E. Rosas, David A. Ehrlich, Abdullah Makkeh, Pedro A. M. Mediano, Michael Wibral
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement Authors: Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin
Learning Adaptive Parallel Reasoning with Language Models Authors: Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr
Riemannian Neural Geodesic Interpolant Authors: Jiawen Wu, Bingguang Chen, Yuyi Zhou, Qi Meng, Rongchan Zhu, Zhi-Ming Ma
SUPRA: Subspace Parameterized Attention for Neural Operator on General Domains Authors: Zherui Yang, Zhengyang Xue, Ligang Liu
Deep learning with missing data Authors: Tianyi Ma, Tengyao Wang, Richard J. Samworth
Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks Authors: Jeremy Goldwasser, Giles Hooker
An XAI-based Analysis of Shortcut Learning in Neural Networks Authors: Phuong Quynh Le, J\"org Schl\"otterer, Christin Seifert
W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models Authors: Shang Wang
Emergence and Evolution of Interpretable Concepts in Diffusion Models Authors: Berk Tinaz, Zalan Fabian, Mahdi Soltanolkotabi
Low-Rank Adaptation of Neural Fields Authors: Anh Truong, Ahmed H. Mahmoud, Mina Konakovi\'c Lukovi\'c, Justin Solomon
Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification Authors: Tatsuhito Hasegawa, Shunsuke Sakai
Improving Learning to Optimize Using Parameter Symmetries Authors: Guy Zamir, Aryan Dokania, Bo Zhao, Rose Yu

1. Markov Kernels, Distances and Optimal Control: A Parable of Linear Quadratic Non-Gaussian Distribution Steering

ArXiv ID: 2504.15753

Authors: Alexis M. H. Teter, Wenqing Wang, Sachin Shivakumar, Abhishek Halder

Abstract: For a controllable linear time-varying (LTV) pair $(\boldsymbol{A}t,\boldsymbol{B}_t)$ and $\boldsymbol{Q}}$ positive semidefinite, we derive the Markov kernel for the It\^{o} diffusion ${\mathrm{d}}\boldsymbol{x{t}=\boldsymbol{A}}\boldsymbol{xt {\mathrm{d}} t + \sqrt{2}\boldsymbol{B}}{\mathrm{d}}\boldsymbol{w{t}$ with an accompanying killing of probability mass at rate $\frac{1}{2}\boldsymbol{x}^{\top}\boldsymbol{Q}dinger bridge problem.}\boldsymbol{x}$. This Markov kernel is the Green's function for an associated linear reaction-advection-diffusion partial differential equation. Our result generalizes the recently derived kernel for the special case $\left(\boldsymbol{A}_t,\boldsymbol{B}_t\right)=\left(\boldsymbol{0},\boldsymbol{I}\right)$, and depends on the solution of an associated Riccati matrix ODE. A consequence of this result is that the linear quadratic non-Gaussian Schr\"{o}dinger bridge is exactly solvable. This means that the problem of steering a controlled LTV diffusion from a given non-Gaussian distribution to another over a fixed deadline while minimizing an expected quadratic cost can be solved using dynamic Sinkhorn recursions performed with the derived kernel. Our derivation for the $\left(\boldsymbol{A}_t,\boldsymbol{B}_t,\boldsymbol{Q}_t\right)$-parametrized kernel pursues a new idea that relies on finding a state-time dependent distance-like functional given by the solution of a deterministic optimal control problem. This technique breaks away from existing methods, such as generalizing Hermite polynomials or Weyl calculus, which have seen limited success in the reaction-diffusion context. Our technique uncovers a new connection between Markov kernels, distances, and optimal control. This connection is of interest beyond its immediate application in solving the linear quadratic Schr\"{o

Comment: The paper explores Markov kernels and optimal control, introducing a novel connection between Markov kernels, distances, and control. This is a cutting-edge theoretical contribution with potential foundational impact.

Relevance: 9 Novelty: 9

2. Transport f divergences

ArXiv ID: 2504.15515

Authors: Wuchen Li

Abstract: We define a class of divergences to measure differences between probability density functions in one-dimensional sample space. The construction is based on the convex function with the Jacobi operator of mapping function that pushforwards one density to the other. We call these information measures {\em transport $f$-divergences}. We present several properties of transport $f$-divergences, including invariances, convexities, variational formulations, and Taylor expansions in terms of mapping functions. Examples of transport $f$-divergences in generative models are provided.

Comment: The paper introduces transport f-divergences, a novel theoretical framework for measuring differences between probability densities, aligning with emerging trends and foundational research.

Relevance: 9 Novelty: 9

3. Universal Approximation with Softmax Attention

ArXiv ID: 2504.15956

Authors: Jerry Yao-Chieh Hu, Hude Liu, Hong-Yu Chen, Weimin Wu, Han Liu

Abstract: We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention's internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.

Comment: The paper provides theoretical insights into the universal approximation capabilities of softmax attention, which is highly relevant to foundational research in model architecture.

Relevance: 9 Novelty: 9

4. Shannon invariants: A scalable approach to information decomposition

ArXiv ID: 2504.15779

Authors: Aaron J. Gutknecht, Fernando E. Rosas, David A. Ehrlich, Abdullah Makkeh, Pedro A. M. Mediano, Michael Wibral

Abstract: Distributed systems, such as biological and artificial neural networks, process information via complex interactions engaging multiple subsystems, resulting in high-order patterns with distinct properties across scales. Investigating how these systems process information remains challenging due to difficulties in defining appropriate multivariate metrics and ensuring their scalability to large systems. To address these challenges, we introduce a novel framework based on what we call "Shannon invariants" -- quantities that capture essential properties of high-order information processing in a way that depends only on the definition of entropy and can be efficiently calculated for large systems. Our theoretical results demonstrate how Shannon invariants can be used to resolve long-standing ambiguities regarding the interpretation of widely used multivariate information-theoretic measures. Moreover, our practical results reveal distinctive information-processing signatures of various deep learning architectures across layers, which lead to new insights into how these systems process information and how this evolves during training. Overall, our framework resolves fundamental limitations in analyzing high-order phenomena and offers broad opportunities for theoretical developments and empirical analyses.

Comment: The paper introduces 'Shannon invariants' for scalable information decomposition, offering insights into how deep learning architectures process information. This aligns with representation learning and training dynamics.

Relevance: 9 Novelty: 8

5. LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement

ArXiv ID: 2504.16053

Authors: Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin

Abstract: State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.

Comment: The paper introduces LongMamba, a training-free method to enhance state space models (SSMs) for long-context understanding. It aligns with foundational research in model architecture by addressing limitations in SSMs and proposing a novel technique for improving their performance.

Relevance: 9 Novelty: 8

6. Learning Adaptive Parallel Reasoning with Language Models

ArXiv ID: 2504.15466

Authors: Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr

Abstract: Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.

Comment: The paper introduces Adaptive Parallel Reasoning (APR), which explores novel reasoning frameworks for LLMs, aligning with the criterion of theoretical insights into LLM behavior and architecture-level innovations.

Relevance: 9 Novelty: 8

7. Riemannian Neural Geodesic Interpolant

ArXiv ID: 2504.15736

Authors: Jiawen Wu, Bingguang Chen, Yuyi Zhou, Qi Meng, Rongchan Zhu, Zhi-Ming Ma

Abstract: Stochastic interpolants are efficient generative models that bridge two arbitrary probability density functions in finite time, enabling flexible generation from the source to the target distribution or vice versa. These models are primarily developed in Euclidean space, and are therefore limited in their application to many distribution learning problems defined on Riemannian manifolds in real-world scenarios. In this work, we introduce the Riemannian Neural Geodesic Interpolant (RNGI) model, which interpolates between two probability densities on a Riemannian manifold along the stochastic geodesics, and then samples from one endpoint as the final state using the continuous flow originating from the other endpoint. We prove that the temporal marginal density of RNGI solves a transport equation on the Riemannian manifold. After training the model's the neural velocity and score fields, we propose the Embedding Stochastic Differential Equation (E-SDE) algorithm for stochastic sampling of RNGI. E-SDE significantly improves the sampling quality by reducing the accumulated error caused by the excessive intrinsic discretization of Riemannian Brownian motion in the classical Geodesic Random Walk (GRW) algorithm. We also provide theoretical bounds on the generative bias measured in terms of KL-divergence. Finally, we demonstrate the effectiveness of the proposed RNGI and E-SDE through experiments conducted on both collected and synthetic distributions on S2 and SO(3).

Comment: The paper introduces a Riemannian Neural Geodesic Interpolant for generative modeling on manifolds, which is a novel contribution to model architecture and representation learning.

Relevance: 8 Novelty: 8

8. SUPRA: Subspace Parameterized Attention for Neural Operator on General Domains

ArXiv ID: 2504.15897

Authors: Zherui Yang, Zhengyang Xue, Ligang Liu

Abstract: Neural operators are efficient surrogate models for solving partial differential equations (PDEs), but their key components face challenges: (1) in order to improve accuracy, attention mechanisms suffer from computational inefficiency on large-scale meshes, and (2) spectral convolutions rely on the Fast Fourier Transform (FFT) on regular grids and assume a flat geometry, which causes accuracy degradation on irregular domains. To tackle these problems, we regard the matrix-vector operations in the standard attention mechanism on vectors in Euclidean space as bilinear forms and linear operators in vector spaces and generalize the attention mechanism to function spaces. This new attention mechanism is fully equivalent to the standard attention but impossible to compute due to the infinite dimensionality of function spaces. To address this, inspired by model reduction techniques, we propose a Subspace Parameterized Attention (SUPRA) neural operator, which approximates the attention mechanism within a finite-dimensional subspace. To construct a subspace on irregular domains for SUPRA, we propose using the Laplacian eigenfunctions, which naturally adapt to domains' geometry and guarantee the optimal approximation for smooth functions. Experiments show that the SUPRA neural operator reduces error rates by up to 33% on various PDE datasets while maintaining state-of-the-art computational efficiency.

Comment: The paper introduces a novel attention mechanism (SUPRA) for neural operators, which aligns with architectural innovations and efficiency improvements, particularly in irregular domains.

Relevance: 8 Novelty: 8

9. Deep learning with missing data

ArXiv ID: 2504.15388

Authors: Tianyi Ma, Tengyao Wang, Richard J. Samworth

Abstract: In the context of multivariate nonparametric regression with missing covariates, we propose Pattern Embedded Neural Networks (PENNs), which can be applied in conjunction with any existing imputation technique. In addition to a neural network trained on the imputed data, PENNs pass the vectors of observation indicators through a second neural network to provide a compact representation. The outputs are then combined in a third neural network to produce final predictions. Our main theoretical result exploits an assumption that the observation patterns can be partitioned into cells on which the Bayes regression function behaves similarly, and belongs to a compositional H\"older class. It provides a finite-sample excess risk bound that holds for an arbitrary missingness mechanism, and in combination with a complementary minimax lower bound, demonstrates that our PENN estimator attains in typical cases the minimax rate of convergence as if the cells of the partition were known in advance, up to a poly-logarithmic factor in the sample size. Numerical experiments on simulated, semi-synthetic and real data confirm that the PENN estimator consistently improves, often dramatically, on standard neural networks without pattern embedding. Code to reproduce our experiments, as well as a tutorial on how to apply our method, is publicly available.

Comment: The paper introduces Pattern Embedded Neural Networks (PENNs) for handling missing data, which provides theoretical insights into representation learning and achieves minimax rates.

Relevance: 8 Novelty: 8

10. Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

ArXiv ID: 2504.15479

Authors: Jeremy Goldwasser, Giles Hooker

Abstract: Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

Comment: The paper introduces a novel framework for counterfactual explanations in computer vision models, leveraging latent-space adversarial attacks. It aligns with representation learning by addressing interpretability and feature attribution in a computationally efficient manner.