Personalized Daily ArXiv Papers 2025-04-24

[gpt-4o]	Prompt	Completion	Total
Token	26926	3139	30065
Cost	$0.07	$0.03	$0.1

Total arXiv papers: 377

Total scanned papers: 223

Total relevant papers: 12

Table of contents with paper titles:

SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures Authors: Max Hartman, Lav Varshney
I-Con: A Unifying Framework for Representation Learning Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton
Quantum Doubly Stochastic Transformers Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk
An Effective Gram Matrix Characterizes Generalization in Deep Networks Authors: Rubing Yang, Pratik Chaudhari
Representation Learning via Non-Contrastive Mutual Information Authors: Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai
Provable wavelet-based neural approximation Authors: Youngmi Hur, Hyojae Lim, Mikyoung Lim
Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu
Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching Authors: Junn Yong Loo, Michelle Adeline, Julia Kaiwen Lau, Fang Yu Leong, Hwa Hui Tew, Arghya Pal, Vishnu Monn Baskaran, Chee-Ming Ting, Rapha\"el C. -W. Phan
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light Authors: Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi
MAGIC: Near-Optimal Data Attribution for Deep Learning Authors: Andrew Ilyas, Logan Engstrom
Simple Graph Contrastive Learning via Fractional-order Neural Diffusion Networks Authors: Yanan Zhao, Feng Ji, Kai Zhao, Xuhao Li, Qiyu Kang, Wenfei Liang, Yahya Alkhatib, Xingchao Jian, Wee Peng Tay
Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations Authors: Manuel Quintero, William T. Stephenson, Advik Shreekumar, Tamara Broderick

1. SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

ArXiv ID: 2504.16140

Authors: Max Hartman, Lav Varshney

Abstract: Joint Embedding Predictive Architectures (JEPA) have emerged as a powerful framework for learning general-purpose representations. However, these models often lack interpretability and suffer from inefficiencies due to dense embedding representations. We propose SparseJEPA, an extension that integrates sparse representation learning into the JEPA framework to enhance the quality of learned representations. SparseJEPA employs a penalty method that encourages latent space variables to be shared among data features with strong semantic relationships, while maintaining predictive performance. We demonstrate the effectiveness of SparseJEPA by training on the CIFAR-100 dataset and pre-training a lightweight Vision Transformer. The improved embeddings are utilized in linear-probe transfer learning for both image classification and low-level tasks, showcasing the architecture's versatility across different transfer tasks. Furthermore, we provide a theoretical proof that demonstrates that the grouping mechanism enhances representation quality. This was done by displaying that grouping reduces Multiinformation among latent-variables, including proofing the Data Processing Inequality for Multiinformation. Our results indicate that incorporating sparsity not only refines the latent space but also facilitates the learning of more meaningful and interpretable representations. In further work, hope to further extend this method by finding new ways to leverage the grouping mechanism through object-centric representation learning.

Comment: SparseJEPA directly addresses 'Representation Learning' by integrating sparsity into Joint Embedding Predictive Architectures, with theoretical contributions like reducing Multiinformation and proving the Data Processing Inequality for Multiinformation.

Relevance: 10 Novelty: 9

2. I-Con: A Unifying Framework for Representation Learning

ArXiv ID: 2504.16929

Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

Abstract: As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

Comment: This paper presents I-Con, a unifying framework for representation learning that generalizes a wide range of loss functions using an information-theoretic perspective. It provides theoretical insights into representation learning and introduces new loss functions, making it highly relevant to foundational research in representation learning.

Relevance: 10 Novelty: 9

3. Quantum Doubly Stochastic Transformers

ArXiv ID: 2504.16275

Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

Abstract: At the core of the Transformer, the Softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often destabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the Softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard Vision Transformer and other doubly stochastic Transformers. Beyond the established Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. The QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

Comment: The paper proposes a hybrid classical-quantum Transformer with a novel quantum inductive bias for doubly stochastic matrices, which directly relates to architectural innovations in Transformers. The use of quantum circuits for DSMs is a unique and cutting-edge contribution.

Relevance: 9 Novelty: 9

4. An Effective Gram Matrix Characterizes Generalization in Deep Networks

ArXiv ID: 2504.16450

Authors: Rubing Yang, Pratik Chaudhari

Abstract: We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an effective Gram matrix'' that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initialresidual''. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.

Comment: This paper provides a theoretical analysis of generalization in deep networks using an effective Gram matrix, which aligns with the Representation Learning criterion by offering insights into training dynamics and generalization behavior.