Previous Day 2025-04-23
Monthly Overview 2025-04
Next Day 2025-04-25

Personalized Daily ArXiv Papers 2025-04-24

[gpt-4o] Prompt Completion Total
Token 26926 3139 30065
Cost $0.07 $0.03 $0.1

Total arXiv papers: 377

Total scanned papers: 223

Total relevant papers: 12

Table of contents with paper titles:

  1. SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures Authors: Max Hartman, Lav Varshney

  2. I-Con: A Unifying Framework for Representation Learning Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

  3. Quantum Doubly Stochastic Transformers Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

  4. An Effective Gram Matrix Characterizes Generalization in Deep Networks Authors: Rubing Yang, Pratik Chaudhari

  5. Representation Learning via Non-Contrastive Mutual Information Authors: Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai

  6. Provable wavelet-based neural approximation Authors: Youngmi Hur, Hyojae Lim, Mikyoung Lim

  7. Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu

  8. Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching Authors: Junn Yong Loo, Michelle Adeline, Julia Kaiwen Lau, Fang Yu Leong, Hwa Hui Tew, Arghya Pal, Vishnu Monn Baskaran, Chee-Ming Ting, Rapha\"el C. -W. Phan

  9. Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light Authors: Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi

  10. MAGIC: Near-Optimal Data Attribution for Deep Learning Authors: Andrew Ilyas, Logan Engstrom

  11. Simple Graph Contrastive Learning via Fractional-order Neural Diffusion Networks Authors: Yanan Zhao, Feng Ji, Kai Zhao, Xuhao Li, Qiyu Kang, Wenfei Liang, Yahya Alkhatib, Xingchao Jian, Wee Peng Tay

  12. Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations Authors: Manuel Quintero, William T. Stephenson, Advik Shreekumar, Tamara Broderick


1. SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

ArXiv ID: 2504.16140

Authors: Max Hartman, Lav Varshney

Abstract: Joint Embedding Predictive Architectures (JEPA) have emerged as a powerful framework for learning general-purpose representations. However, these models often lack interpretability and suffer from inefficiencies due to dense embedding representations. We propose SparseJEPA, an extension that integrates sparse representation learning into the JEPA framework to enhance the quality of learned representations. SparseJEPA employs a penalty method that encourages latent space variables to be shared among data features with strong semantic relationships, while maintaining predictive performance. We demonstrate the effectiveness of SparseJEPA by training on the CIFAR-100 dataset and pre-training a lightweight Vision Transformer. The improved embeddings are utilized in linear-probe transfer learning for both image classification and low-level tasks, showcasing the architecture's versatility across different transfer tasks. Furthermore, we provide a theoretical proof that demonstrates that the grouping mechanism enhances representation quality. This was done by displaying that grouping reduces Multiinformation among latent-variables, including proofing the Data Processing Inequality for Multiinformation. Our results indicate that incorporating sparsity not only refines the latent space but also facilitates the learning of more meaningful and interpretable representations. In further work, hope to further extend this method by finding new ways to leverage the grouping mechanism through object-centric representation learning.

Comment: SparseJEPA directly addresses 'Representation Learning' by integrating sparsity into Joint Embedding Predictive Architectures, with theoretical contributions like reducing Multiinformation and proving the Data Processing Inequality for Multiinformation.

Relevance: 10 Novelty: 9


2. I-Con: A Unifying Framework for Representation Learning

ArXiv ID: 2504.16929

Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

Abstract: As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

Comment: This paper presents I-Con, a unifying framework for representation learning that generalizes a wide range of loss functions using an information-theoretic perspective. It provides theoretical insights into representation learning and introduces new loss functions, making it highly relevant to foundational research in representation learning.

Relevance: 10 Novelty: 9


3. Quantum Doubly Stochastic Transformers

ArXiv ID: 2504.16275

Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

Abstract: At the core of the Transformer, the Softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often destabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the Softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard Vision Transformer and other doubly stochastic Transformers. Beyond the established Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. The QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

Comment: The paper proposes a hybrid classical-quantum Transformer with a novel quantum inductive bias for doubly stochastic matrices, which directly relates to architectural innovations in Transformers. The use of quantum circuits for DSMs is a unique and cutting-edge contribution.

Relevance: 9 Novelty: 9


4. An Effective Gram Matrix Characterizes Generalization in Deep Networks

ArXiv ID: 2504.16450

Authors: Rubing Yang, Pratik Chaudhari

Abstract: We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an effective Gram matrix'' that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initialresidual''. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.

Comment: This paper provides a theoretical analysis of generalization in deep networks using an effective Gram matrix, which aligns with the Representation Learning criterion by offering insights into training dynamics and generalization behavior.

Relevance: 9 Novelty: 8


5. Representation Learning via Non-Contrastive Mutual Information

ArXiv ID: 2504.16667

Authors: Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai

Abstract: Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.

Comment: This paper introduces a novel self-supervised objective, MINC loss, which combines the strengths of contrastive and non-contrastive methods for representation learning. It aligns closely with the representation learning criterion and offers a methodological innovation.

Relevance: 9 Novelty: 8


6. Provable wavelet-based neural approximation

ArXiv ID: 2504.16682

Authors: Youngmi Hur, Hyojae Lim, Mikyoung Lim

Abstract: In this paper, we develop a wavelet-based theoretical framework for analyzing the universal approximation capabilities of neural networks over a wide range of activation functions. Leveraging wavelet frame theory on the spaces of homogeneous type, we derive sufficient conditions on activation functions to ensure that the associated neural network approximates any functions in the given space, along with an error estimate. These sufficient conditions accommodate a variety of smooth activation functions, including those that exhibit oscillatory behavior. Furthermore, by considering the $L^2$-distance between smooth and non-smooth activation functions, we establish a generalized approximation result that is applicable to non-smooth activations, with the error explicitly controlled by this distance. This provides increased flexibility in the design of network architectures.

Comment: The paper develops a wavelet-based theoretical framework for analyzing neural network approximation capabilities, which aligns closely with foundational research in Representation Learning and Model Architecture.

Relevance: 9 Novelty: 8


7. Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention

ArXiv ID: 2504.16795

Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu

Abstract: A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.

Comment: The introduction of Hierarchical Sparse Attention (HSA) for enhancing RNNs aligns with the 'Model Architecture' criterion, as it proposes a novel mechanism for long-context modeling. The hardware-aligned kernel design adds practical efficiency insights.

Relevance: 9 Novelty: 8


8. Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching

ArXiv ID: 2504.16262

Authors: Junn Yong Loo, Michelle Adeline, Julia Kaiwen Lau, Fang Yu Leong, Hwa Hui Tew, Arghya Pal, Vishnu Monn Baskaran, Chee-Ming Ting, Rapha\"el C. -W. Phan

Abstract: Energy-based models (EBMs) are a powerful class of probabilistic generative models due to their flexibility and interpretability. However, relationships between potential flows and explicit EBMs remain underexplored, while contrastive divergence training via implicit Markov chain Monte Carlo (MCMC) sampling is often unstable and expensive in high-dimensional settings. In this paper, we propose Variational Potential Flow Bayes (VPFB), a new energy-based generative framework that eliminates the need for implicit MCMC sampling and does not rely on auxiliary networks or cooperative training. VPFB learns an energy-parameterized potential flow by constructing a flow-driven density homotopy that is matched to the data distribution through a variational loss minimizing the Kullback-Leibler divergence between the flow-driven and marginal homotopies. This principled formulation enables robust and efficient generative modeling while preserving the interpretability of EBMs. Experimental results on image generation, interpolation, out-of-distribution detection, and compositional generation confirm the effectiveness of VPFB, showing that our method performs competitively with existing approaches in terms of sample quality and versatility across diverse generative modeling tasks.

Comment: The paper proposes VPFB, a novel energy-based generative modeling framework that eliminates the need for implicit MCMC sampling. It introduces a variational principle for density homotopy matching, which is a significant theoretical contribution to generative modeling and energy-based models.

Relevance: 9 Novelty: 8


9. Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

ArXiv ID: 2504.16922

Authors: Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi

Abstract: Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.

Comment: The paper introduces Generalized Neighborhood Attention (GNA), a sparse attention mechanism aimed at improving efficiency in foundational models. This aligns with the 'Model Compression' criterion, focusing on sparsity and efficiency breakthroughs. The implementation and theoretical analysis of GNA also contribute to 'Model Architecture' by exploring sparse attention mechanisms.

Relevance: 9 Novelty: 8


10. MAGIC: Near-Optimal Data Attribution for Deep Learning

ArXiv ID: 2504.16430

Authors: Andrew Ilyas, Logan Engstrom

Abstract: The goal of predictive data attribution is to estimate how adding or removing a given set of training datapoints will affect model predictions. In convex settings, this goal is straightforward (i.e., via the infinitesimal jackknife). In large-scale (non-convex) settings, however, existing methods are far less successful -- current methods' estimates often only weakly correlate with ground truth. In this work, we present a new data attribution method (MAGIC) that combines classical methods and recent advances in metadifferentiation to (nearly) optimally estimate the effect of adding or removing training data on model predictions.

Comment: The MAGIC method addresses data attribution in deep learning, which is relevant to foundational research in Representation Learning by exploring the impact of training data on model predictions.

Relevance: 8 Novelty: 8


11. Simple Graph Contrastive Learning via Fractional-order Neural Diffusion Networks

ArXiv ID: 2504.16748

Authors: Yanan Zhao, Feng Ji, Kai Zhao, Xuhao Li, Qiyu Kang, Wenfei Liang, Yahya Alkhatib, Xingchao Jian, Wee Peng Tay

Abstract: Graph Contrastive Learning (GCL) has recently made progress as an unsupervised graph representation learning paradigm. GCL approaches can be categorized into augmentation-based and augmentation-free methods. The former relies on complex data augmentations, while the latter depends on encoders that can generate distinct views of the same input. Both approaches may require negative samples for training. In this paper, we introduce a novel augmentation-free GCL framework based on graph neural diffusion models. Specifically, we utilize learnable encoders governed by Fractional Differential Equations (FDE). Each FDE is characterized by an order parameter of the differential operator. We demonstrate that varying these parameters allows us to produce learnable encoders that generate diverse views, capturing either local or global information, for contrastive learning. Our model does not require negative samples for training and is applicable to both homophilic and heterophilic datasets. We demonstrate its effectiveness across various datasets, achieving state-of-the-art performance.

Comment: The paper proposes a novel augmentation-free graph contrastive learning framework using fractional-order neural diffusion networks. It aligns with representation learning and introduces a unique approach to graph representation learning.

Relevance: 8 Novelty: 8


12. Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations

ArXiv ID: 2504.16864

Authors: Manuel Quintero, William T. Stephenson, Advik Shreekumar, Tamara Broderick

Abstract: In science and social science, we often wish to explain why an outcome is different in two populations. For instance, if a jobs program benefits members of one city more than another, is that due to differences in program participants (particular covariates) or the local labor markets (outcomes given covariates)? The Kitagawa-Oaxaca-Blinder (KOB) decomposition is a standard tool in econometrics that explains the difference in the mean outcome across two populations. However, the KOB decomposition assumes a linear relationship between covariates and outcomes, while the true relationship may be meaningfully nonlinear. Modern machine learning boasts a variety of nonlinear functional decompositions for the relationship between outcomes and covariates in one population. It seems natural to extend the KOB decomposition using these functional decompositions. We observe that a successful extension should not attribute the differences to covariates -- or, respectively, to outcomes given covariates -- if those are the same in the two populations. Unfortunately, we demonstrate that, even in simple examples, two common decompositions -- functional ANOVA and Accumulated Local Effects -- can attribute differences to outcomes given covariates, even when they are identical in two populations. We provide a characterization of when functional ANOVA misattributes, as well as a general property that any discrete decomposition must satisfy to avoid misattribution. We show that if the decomposition is independent of its input distribution, it does not misattribute. We further conjecture that misattribution arises in any reasonable additive decomposition that depends on the distribution of the covariates.

Comment: This paper provides a theoretical critique of functional decompositions in explaining population outcome differences, which aligns with the Emerging Trends criterion as it challenges established assumptions in decomposition methods.

Relevance: 8 Novelty: 8


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: