Previous Day 2025-04-18
Monthly Overview 2025-04
Next Day 2025-04-22

Personalized Daily ArXiv Papers 2025-04-21

[gpt-4o] Prompt Completion Total
Token 26846 3478 30324
Cost $0.07 $0.03 $0.1

Total arXiv papers: 347

Total scanned papers: 204

Total relevant papers: 13

Table of contents with paper titles:

  1. Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective Authors: Yuling Jiao, Yanming Lai, Yang Wang, Bokai Yan

  2. Generative AI Act II: Test Time Scaling Drives Cognition Engineering Authors: Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu

  3. A Quantum of Learning: Using Quaternion Algebra to Model Learning on Quantum Devices Authors: Sayed Pouria Talebi, Clive Cheong Took, Danilo P. Mandic

  4. Efficient algorithms for the Hadamard decomposition Authors: Samuel Wertz, Arnaud Vandaele, Nicolas Gillis

  5. Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model Authors: Zhiwei Xu, Zhiyu Ni, Yixin Wang, Wei Hu

  6. How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings Authors: Samuel Audia, Soheil Feizi, Matthias Zwicker, Dinesh Manocha

  7. DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLMs Authors: Tamim Al Mahmud, Najeeb Jebreel, Josep Domingo-Ferrer, David Sanchez

  8. Probabilistic Stability Guarantees for Feature Attributions Authors: Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong

  9. Decoding Vision Transformers: the Diffusion Steering Lens Authors: Ryota Takatsuki, Sonia Joseph, Ippei Fujisawa, Ryota Kanai

  10. Training Autoencoders Using Stochastic Hessian-Free Optimization with LSMR Authors: Ibrahim Emirahmetoglu, David E. Stewart

  11. DIDS: Domain Impact-aware Data Sampling for Large Language Model Training Authors: Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou

  12. Learning to Attribute with Attention Authors: Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry

  13. Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs Authors: Zichao Yue, Chenhui Deng, Zhiru Zhang


1. Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

ArXiv ID: 2504.13558

Authors: Yuling Jiao, Yanming Lai, Yang Wang, Bokai Yan

Abstract: The Transformer model is widely used in various application areas of machine learning, such as natural language processing. This paper investigates the approximation of the H\"older continuous function class $\mathcal{H}_{Q}^{\beta}\left([0,1]^{d\times n},\mathbb{R}^{d\times n}\right)$ by Transformers and constructs several Transformers that can overcome the curse of dimensionality. These Transformers consist of one self-attention layer with one head and the softmax function as the activation function, along with several feedforward layers. For example, to achieve an approximation accuracy of $\epsilon$, if the activation functions of the feedforward layers in the Transformer are ReLU and floor, only $\mathcal{O}\left(\log\frac{1}{\epsilon}\right)$ layers of feedforward layers are needed, with widths of these layers not exceeding $\mathcal{O}\left(\frac{1}{\epsilon^{2/\beta}}\log\frac{1}{\epsilon}\right)$. If other activation functions are allowed in the feedforward layers, the width of the feedforward layers can be further reduced to a constant. These results demonstrate that Transformers have a strong expressive capability. The construction in this paper is based on the Kolmogorov-Arnold Representation Theorem and does not require the concept of contextual mapping, hence our proof is more intuitively clear compared to previous Transformer approximation works. Additionally, the translation technique proposed in this paper helps to apply the previous approximation results of feedforward neural networks to Transformer research.

Comment: This paper provides a theoretical study on the expressive capabilities of Transformers, specifically addressing their ability to overcome the curse of dimensionality. It aligns closely with the 'Model Architecture' criterion by offering insights into the structure and theoretical underpinnings of Transformers.

Relevance: 10 Novelty: 8


2. Generative AI Act II: Test Time Scaling Drives Cognition Engineering

ArXiv ID: 2504.13828

Authors: Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu

Abstract: The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations in knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering

Comment: The paper discusses 'Act II' of generative AI and test-time scaling, which introduces a new paradigm in cognition engineering. This aligns with emerging trends and foundational shifts in AI.

Relevance: 9 Novelty: 9


3. A Quantum of Learning: Using Quaternion Algebra to Model Learning on Quantum Devices

ArXiv ID: 2504.13232

Authors: Sayed Pouria Talebi, Clive Cheong Took, Danilo P. Mandic

Abstract: This article considers the problem of designing adaption and optimisation techniques for training quantum learning machines. To this end, the division algebra of quaternions is used to derive an effective model for representing computation and measurement operations on qubits. In turn, the derived model, serves as the foundation for formulating an adaptive learning problem on principal quantum learning units, thereby establishing quantum information processing units akin to that of neurons in classical approaches. Then, leveraging the modern HR-calculus, a comprehensive training framework for learning on quantum machines is developed. The quaternion-valued model accommodates mathematical tractability and establishment of performance criteria, such as convergence conditions.

Comment: The paper introduces quaternion algebra for modeling learning on quantum devices, which represents a novel and emerging trend in foundational research.

Relevance: 9 Novelty: 9


4. Efficient algorithms for the Hadamard decomposition

ArXiv ID: 2504.13633

Authors: Samuel Wertz, Arnaud Vandaele, Nicolas Gillis

Abstract: The Hadamard decomposition is a powerful technique for data analysis and matrix compression, which decomposes a given matrix into the element-wise product of two or more low-rank matrices. In this paper, we develop an efficient algorithm to solve this problem, leveraging an alternating optimization approach that decomposes the global non-convex problem into a series of convex sub-problems. To improve performance, we explore advanced initialization strategies inspired by the singular value decomposition (SVD) and incorporate acceleration techniques by introducing momentum-based updates. Beyond optimizing the two-matrix case, we also extend the Hadamard decomposition framework to support more than two low-rank matrices, enabling approximations with higher effective ranks while preserving computational efficiency. Finally, we conduct extensive experiments to compare our method with the existing gradient descent-based approaches for the Hadamard decomposition and with traditional low-rank approximation techniques. The results highlight the effectiveness of our proposed method across diverse datasets.

Comment: The paper introduces an efficient algorithm for the Hadamard decomposition, which is relevant to model compression and low-rank approaches. The extension to multiple low-rank matrices adds methodological depth.

Relevance: 9 Novelty: 8


5. Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model

ArXiv ID: 2504.13292

Authors: Zhiwei Xu, Zhiyu Ni, Yixin Wang, Wei Hu

Abstract: ''Grokking'' is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. While intriguing, this delayed generalization phenomenon compromises predictability and efficiency. Ideally, models should generalize directly without delay. To this end, this paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks, based on the key observation that data embedding plays a crucial role in determining whether generalization is delayed. GrokTransfer first trains a smaller, weaker model to reach a nontrivial (but far from optimal) test performance. Then, the learned input embedding from this weaker model is extracted and used to initialize the embedding in the target, stronger model. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay. Moreover, we demonstrate that, across empirical studies of different tasks, GrokTransfer effectively reshapes the training dynamics and eliminates delayed generalization, for both fully-connected neural networks and Transformers.

Comment: The paper explores the phenomenon of grokking and proposes a method to accelerate it, which provides insights into training dynamics and representation learning.

Relevance: 9 Novelty: 8


6. How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings

ArXiv ID: 2504.13412

Authors: Samuel Audia, Soheil Feizi, Matthias Zwicker, Dinesh Manocha

Abstract: Neural networks that map between low dimensional spaces are ubiquitous in computer graphics and scientific computing; however, in their naive implementation, they are unable to learn high frequency information. We present a comprehensive analysis comparing the two most common techniques for mitigating this spectral bias: Fourier feature encodings (FFE) and multigrid parametric encodings (MPE). FFEs are seen as the standard for low dimensional mappings, but MPEs often outperform them and learn representations with higher resolution and finer detail. FFE's roots in the Fourier transform, make it susceptible to aliasing if pushed too far, while MPEs, which use a learned grid structure, have no such limitation. To understand the difference in performance, we use the neural tangent kernel (NTK) to evaluate these encodings through the lens of an analogous kernel regression. By finding a lower bound on the smallest eigenvalue of the NTK, we prove that MPEs improve a network's performance through the structure of their grid and not their learnable embedding. This mechanism is fundamentally different from FFEs, which rely solely on their embedding space to improve performance. Results are empirically validated on a 2D image regression task using images taken from 100 synonym sets of ImageNet and 3D implicit surface regression on objects from the Stanford graphics dataset. Using peak signal-to-noise ratio (PSNR) and multiscale structural similarity (MS-SSIM) to evaluate how well fine details are learned, we show that the MPE increases the minimum eigenvalue by 8 orders of magnitude over the baseline and 2 orders of magnitude over the FFE. The increase in spectrum corresponds to a 15 dB (PSNR) / 0.65 (MS-SSIM) increase over baseline and a 12 dB (PSNR) / 0.33 (MS-SSIM) increase over the FFE.

Comment: The paper provides a theoretical analysis of multigrid parametric encodings (MPE) and Fourier feature encodings (FFE) using neural tangent kernels, offering foundational insights into representation learning.

Relevance: 9 Novelty: 8


7. DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLMs

ArXiv ID: 2504.13774

Authors: Tamim Al Mahmud, Najeeb Jebreel, Josep Domingo-Ferrer, David Sanchez

Abstract: Large language models (LLMs) have recently revolutionized language processing tasks but have also brought ethical and legal issues. LLMs have a tendency to memorize potentially private or copyrighted information present in the training data, which might then be delivered to end users at inference time. When this happens, a naive solution is to retrain the model from scratch after excluding the undesired data. Although this guarantees that the target data have been forgotten, it is also prohibitively expensive for LLMs. Approximate unlearning offers a more efficient alternative, as it consists of ex post modifications of the trained model itself to prevent undesirable results, but it lacks forgetting guarantees because it relies solely on empirical evidence. In this work, we present DP2Unlearning, a novel LLM unlearning framework that offers formal forgetting guarantees at a significantly lower cost than retraining from scratch on the data to be retained. DP2Unlearning involves training LLMs on textual data protected using {\epsilon}-differential privacy (DP), which later enables efficient unlearning with the guarantees against disclosure associated with the chosen {\epsilon}. Our experiments demonstrate that DP2Unlearning achieves similar model performance post-unlearning, compared to an LLM retraining from scratch on retained data -- the gold standard exact unlearning -- but at approximately half the unlearning cost. In addition, with a reasonable computational cost, it outperforms approximate unlearning methods at both preserving the utility of the model post-unlearning and effectively forgetting the targeted information.

Comment: The paper introduces a differential privacy-based unlearning framework for LLMs, which aligns with foundational research in model efficiency and privacy guarantees, offering a novel approach to unlearning.

Relevance: 8 Novelty: 8


8. Probabilistic Stability Guarantees for Feature Attributions

ArXiv ID: 2504.13787

Authors: Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong

Abstract: Stability guarantees are an emerging tool for evaluating feature attributions, but existing certification methods rely on smoothed classifiers and often yield conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, and sample-efficient stability certification algorithm (SCA) that provides non-trivial and interpretable guarantees for any attribution. Moreover, we show that mild smoothing enables a graceful tradeoff between accuracy and stability, in contrast to prior certification methods that require a more aggressive compromise. Using Boolean function analysis, we give a novel characterization of stability under smoothing. We evaluate SCA on vision and language tasks, and demonstrate the effectiveness of soft stability in measuring the robustness of explanation methods.

Comment: The paper proposes a novel stability certification algorithm for feature attributions, which aligns with representation learning by providing insights into the robustness of explanation methods. The use of Boolean function analysis and soft stability introduces a novel theoretical perspective.

Relevance: 8 Novelty: 8


9. Decoding Vision Transformers: the Diffusion Steering Lens

ArXiv ID: 2504.13763

Authors: Ryota Takatsuki, Sonia Joseph, Ippei Fujisawa, Ryota Kanai

Abstract: Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting them into the output vocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is technically straightforward, its direct use faces limitations in capturing the richness of visual representations. Building on the work of Toker et al. (2024)~\cite{Toker2024-ve}, who introduced Diffusion Lens to visualize intermediate representations in the text encoders of text-to-image diffusion models, we demonstrate that while Diffusion Lens can effectively visualize residual stream representations in image encoders, it fails to capture the direct contributions of individual submodules. To overcome this limitation, we propose \textbf{Diffusion Steering Lens} (DSL), a novel, training-free approach that steers submodule outputs and patches subsequent indirect contributions. We validate our method through interventional studies, showing that DSL provides an intuitive and reliable interpretation of the internal processing in ViTs.

Comment: The paper introduces Diffusion Steering Lens (DSL) for interpretability in Vision Transformers, which aligns with the analysis of existing architectures. This is relevant to understanding how representations evolve in ViTs.

Relevance: 8 Novelty: 7


10. Training Autoencoders Using Stochastic Hessian-Free Optimization with LSMR

ArXiv ID: 2504.13302

Authors: Ibrahim Emirahmetoglu, David E. Stewart

Abstract: Hessian-free (HF) optimization has been shown to effectively train deep autoencoders (Martens, 2010). In this paper, we aim to accelerate HF training of autoencoders by reducing the amount of data used in training. HF utilizes the conjugate gradient algorithm to estimate update directions. Instead, we propose using the LSMR method, which is known for effectively solving large sparse linear systems. We also incorporate Chapelle & Erhan (2011)'s improved preconditioner for HF optimization. In addition, we introduce a new mini-batch selection algorithm to mitigate overfitting. Our algorithm starts with a small subset of the training data and gradually increases the mini-batch size based on (i) variance estimates obtained during the computation of a mini-batch gradient (Byrd et al., 2012) and (ii) the relative decrease in objective value for the validation data. Our experimental results demonstrate that our stochastic Hessian-free optimization, using the LSMR method and the new sample selection algorithm, leads to rapid training of deep autoencoders with improved generalization error.

Comment: The paper proposes improvements to Hessian-free optimization for training autoencoders, which aligns with representation learning and foundational training dynamics. The use of LSMR and mini-batch selection adds methodological novelty.

Relevance: 8 Novelty: 7


11. DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

ArXiv ID: 2504.13227

Authors: Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou

Abstract: Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.

Comment: The paper introduces a domain-aware data sampling strategy for LLM training, which aligns with foundational research in optimizing training dynamics and efficiency.

Relevance: 8 Novelty: 7


12. Learning to Attribute with Attention

ArXiv ID: 2504.13752

Authors: Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry

Abstract: Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at https://github.com/MadryLab/AT2 .

Comment: The paper proposes a method for token attribution using attention weights, which provides insights into interpretability and training dynamics of LLMs.

Relevance: 8 Novelty: 7


13. Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs

ArXiv ID: 2504.13266

Authors: Zichao Yue, Chenhui Deng, Zhiru Zhang

Abstract: Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue. Pre-propagation GNNs (PP-GNNs) represent a new class of models that decouple feature propagation from training through pre-processing, addressing neighbor explosion in theory. Yet, their practical advantages and system-level optimizations remain underexplored. This paper provides a comprehensive characterization of PP-GNNs, comparing them with graph-sampling-based methods in training efficiency, scalability, and accuracy. While PP-GNNs achieve comparable accuracy, we identify data loading as the key bottleneck for training efficiency and input expansion as a major scalability challenge. To address these issues, we propose optimized data loading schemes and tailored training methods that improve PP-GNN training throughput by an average of 15$\times$ over the PP-GNN baselines, with speedup of up to 2 orders of magnitude compared to sampling-based GNNs on large graph benchmarks. Our implementation is publicly available at https://github.com/cornell-zhang/preprop-gnn.

Comment: The paper characterizes and optimizes pre-propagation GNNs, which aligns with foundational research in graph learning and scalability, offering system-level insights.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: