Personalized Daily ArXiv Papers 2025-04-25

[gpt-4o]	Prompt	Completion	Total
Token	26265	3534	29799
Cost	$0.07	$0.04	$0.1

Total arXiv papers: 392

Total scanned papers: 225

Total relevant papers: 19

Table of contents with paper titles:

I-Con: A Unifying Framework for Representation Learning Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton
Representation Learning via Non-Contrastive Mutual Information Authors: Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai
Quantum Doubly Stochastic Transformers Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk
Symbolic Representation for Any-to-Any Generative Tasks Authors: Jiaqi Chen, Xiaoye Zhu, Yue Wang, Tianyang Liu, Xinhui Chen, Ying Chen, Chak Tou Leong, Yifei Ke, Joseph Liu, Yiwen Yuan, Julian McAuley, Li-jia Li
Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu
Backslash: Rate Constrained Optimized Training of Large Language Models Authors: Jun Wu, Jiangtao Wen, Yuxing Han
When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars Authors: Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light Authors: Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs Authors: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti
Enhancing Variational Autoencoders with Smooth Robust Latent Encoding Authors: Hyomin Lee, Minseon Kim, Sangwon Jang, Jongheon Jeong, Sung Ju Hwang
SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures Authors: Max Hartman, Lav Varshney
Likelihood-Free Variational Autoencoders Authors: Chen Xu, Qiang Wang, Lijun Sun
Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models Authors: Julius Vetter, Manuel Gloeckler, Daniel Gedon, Jakob H. Macke
Physics-informed features in supervised machine learning Authors: Margherita Lampani, Sabrina Guastavino, Michele Piana, Federico Benvenuto
HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models Authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Qin Xie, Guiming Xie, Xuejian Gong
Towards Robust LLMs: an Adversarial Robustness Measurement Framework Authors: Natan Levy, Adiel Ashrov, Guy Katz
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing Authors: Myunghyun Rhee, Joonseop Sim, Taeyoung Ahn, Seungyong Lee, Daegun Yoon, Euiseok Kim, Kyoung Park, Youngpyo Joo, Hosik Kim
In-Context Learning can distort the relationship between sequence likelihoods and biological fitness Authors: Pranav Kantroo, G\"unter P. Wagner, Benjamin B. Machta
NeuralGrok: Accelerate Grokking by Neural Gradient Transformation Authors: Xinyu Zhou, Simin Fan, Martin Jaggi, Jie Fu

1. I-Con: A Unifying Framework for Representation Learning

ArXiv ID: 2504.16929

Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

Abstract: As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

Comment: The paper introduces a unifying framework for representation learning, connecting various loss functions and methods through an information-theoretic perspective. This aligns closely with the 'Representation Learning' criterion, particularly in understanding how deep networks encode information.

Relevance: 10 Novelty: 9

2. Representation Learning via Non-Contrastive Mutual Information

ArXiv ID: 2504.16667

Authors: Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai

Abstract: Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.

Comment: The paper proposes a novel non-contrastive mutual information objective (MINC) for self-supervised representation learning, which is highly relevant to foundational research in representation learning. The approach combines strengths of contrastive and non-contrastive methods, offering a significant methodological improvement.

Relevance: 10 Novelty: 8

3. Quantum Doubly Stochastic Transformers

ArXiv ID: 2504.16275

Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

Abstract: At the core of the Transformer, the Softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often destabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the Softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard Vision Transformer and other doubly stochastic Transformers. Beyond the established Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. The QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

Comment: The paper introduces a quantum-inspired doubly stochastic Transformer, replacing Softmax with a variational quantum circuit. This aligns with the 'Model Architecture' criterion, particularly in exploring novel architectural paradigms.

Relevance: 9 Novelty: 9

4. Symbolic Representation for Any-to-Any Generative Tasks

ArXiv ID: 2504.17261

Authors: Jiaqi Chen, Xiaoye Zhu, Yue Wang, Tianyang Liu, Xinhui Chen, Ying Chen, Chak Tou Leong, Yifei Ke, Joseph Liu, Yiwen Yuan, Julian McAuley, Li-jia Li

Abstract: We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introduces an explicit symbolic representation comprising three core primitives: functions, parameters, and topological logic. Leveraging a pre-trained language model, our inference engine maps natural language instructions directly to symbolic workflows in a training-free manner. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility. We believe that symbolic task representations provide a cost-effective and extensible foundation for advancing the capabilities of generative AI.

Comment: The paper proposes a symbolic generative task description language, which introduces a novel paradigm for generative AI. This aligns with emerging trends and foundational research in representation learning.

Relevance: 9 Novelty: 9

5. Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention

ArXiv ID: 2504.16795

Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu

Abstract: A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.

Comment: The paper introduces a novel hierarchical sparse attention mechanism (HSA) for RNNs, enhancing their efficiency and long-range context modeling. This aligns with the 'Model Architecture' criterion, particularly in architectural innovations for efficiency.