Personalized Daily ArXiv Papers 2025-04-21

[gpt-4o]	Prompt	Completion	Total
Token	26846	3478	30324
Cost	$0.07	$0.03	$0.1

Total arXiv papers: 347

Total scanned papers: 204

Total relevant papers: 13

Table of contents with paper titles:

Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective Authors: Yuling Jiao, Yanming Lai, Yang Wang, Bokai Yan
Generative AI Act II: Test Time Scaling Drives Cognition Engineering Authors: Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu
A Quantum of Learning: Using Quaternion Algebra to Model Learning on Quantum Devices Authors: Sayed Pouria Talebi, Clive Cheong Took, Danilo P. Mandic
Efficient algorithms for the Hadamard decomposition Authors: Samuel Wertz, Arnaud Vandaele, Nicolas Gillis
Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model Authors: Zhiwei Xu, Zhiyu Ni, Yixin Wang, Wei Hu
How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings Authors: Samuel Audia, Soheil Feizi, Matthias Zwicker, Dinesh Manocha
DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLMs Authors: Tamim Al Mahmud, Najeeb Jebreel, Josep Domingo-Ferrer, David Sanchez
Probabilistic Stability Guarantees for Feature Attributions Authors: Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong
Decoding Vision Transformers: the Diffusion Steering Lens Authors: Ryota Takatsuki, Sonia Joseph, Ippei Fujisawa, Ryota Kanai
Training Autoencoders Using Stochastic Hessian-Free Optimization with LSMR Authors: Ibrahim Emirahmetoglu, David E. Stewart
DIDS: Domain Impact-aware Data Sampling for Large Language Model Training Authors: Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou
Learning to Attribute with Attention Authors: Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry
Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs Authors: Zichao Yue, Chenhui Deng, Zhiru Zhang

1. Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

ArXiv ID: 2504.13558

Authors: Yuling Jiao, Yanming Lai, Yang Wang, Bokai Yan

Abstract: The Transformer model is widely used in various application areas of machine learning, such as natural language processing. This paper investigates the approximation of the H\"older continuous function class $\mathcal{H}_{Q}^{\beta}\left([0,1]^{d\times n},\mathbb{R}^{d\times n}\right)$ by Transformers and constructs several Transformers that can overcome the curse of dimensionality. These Transformers consist of one self-attention layer with one head and the softmax function as the activation function, along with several feedforward layers. For example, to achieve an approximation accuracy of $\epsilon$, if the activation functions of the feedforward layers in the Transformer are ReLU and floor, only $\mathcal{O}\left(\log\frac{1}{\epsilon}\right)$ layers of feedforward layers are needed, with widths of these layers not exceeding $\mathcal{O}\left(\frac{1}{\epsilon^{2/\beta}}\log\frac{1}{\epsilon}\right)$. If other activation functions are allowed in the feedforward layers, the width of the feedforward layers can be further reduced to a constant. These results demonstrate that Transformers have a strong expressive capability. The construction in this paper is based on the Kolmogorov-Arnold Representation Theorem and does not require the concept of contextual mapping, hence our proof is more intuitively clear compared to previous Transformer approximation works. Additionally, the translation technique proposed in this paper helps to apply the previous approximation results of feedforward neural networks to Transformer research.

Comment: This paper provides a theoretical study on the expressive capabilities of Transformers, specifically addressing their ability to overcome the curse of dimensionality. It aligns closely with the 'Model Architecture' criterion by offering insights into the structure and theoretical underpinnings of Transformers.

Relevance: 10 Novelty: 8

2. Generative AI Act II: Test Time Scaling Drives Cognition Engineering

ArXiv ID: 2504.13828

Authors: Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu

Abstract: The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations in knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering

Comment: The paper discusses 'Act II' of generative AI and test-time scaling, which introduces a new paradigm in cognition engineering. This aligns with emerging trends and foundational shifts in AI.

Relevance: 9 Novelty: 9

3. A Quantum of Learning: Using Quaternion Algebra to Model Learning on Quantum Devices

ArXiv ID: 2504.13232

Authors: Sayed Pouria Talebi, Clive Cheong Took, Danilo P. Mandic

Abstract: This article considers the problem of designing adaption and optimisation techniques for training quantum learning machines. To this end, the division algebra of quaternions is used to derive an effective model for representing computation and measurement operations on qubits. In turn, the derived model, serves as the foundation for formulating an adaptive learning problem on principal quantum learning units, thereby establishing quantum information processing units akin to that of neurons in classical approaches. Then, leveraging the modern HR-calculus, a comprehensive training framework for learning on quantum machines is developed. The quaternion-valued model accommodates mathematical tractability and establishment of performance criteria, such as convergence conditions.

Comment: The paper introduces quaternion algebra for modeling learning on quantum devices, which represents a novel and emerging trend in foundational research.

Relevance: 9 Novelty: 9

4. Efficient algorithms for the Hadamard decomposition

ArXiv ID: 2504.13633

Authors: Samuel Wertz, Arnaud Vandaele, Nicolas Gillis

Abstract: The Hadamard decomposition is a powerful technique for data analysis and matrix compression, which decomposes a given matrix into the element-wise product of two or more low-rank matrices. In this paper, we develop an efficient algorithm to solve this problem, leveraging an alternating optimization approach that decomposes the global non-convex problem into a series of convex sub-problems. To improve performance, we explore advanced initialization strategies inspired by the singular value decomposition (SVD) and incorporate acceleration techniques by introducing momentum-based updates. Beyond optimizing the two-matrix case, we also extend the Hadamard decomposition framework to support more than two low-rank matrices, enabling approximations with higher effective ranks while preserving computational efficiency. Finally, we conduct extensive experiments to compare our method with the existing gradient descent-based approaches for the Hadamard decomposition and with traditional low-rank approximation techniques. The results highlight the effectiveness of our proposed method across diverse datasets.

Comment: The paper introduces an efficient algorithm for the Hadamard decomposition, which is relevant to model compression and low-rank approaches. The extension to multiple low-rank matrices adds methodological depth.

Relevance: 9 Novelty: 8

5. Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model

ArXiv ID: 2504.13292

Authors: Zhiwei Xu, Zhiyu Ni, Yixin Wang, Wei Hu

Abstract: ''Grokking'' is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. While intriguing, this delayed generalization phenomenon compromises predictability and efficiency. Ideally, models should generalize directly without delay. To this end, this paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks, based on the key observation that data embedding plays a crucial role in determining whether generalization is delayed. GrokTransfer first trains a smaller, weaker model to reach a nontrivial (but far from optimal) test performance. Then, the learned input embedding from this weaker model is extracted and used to initialize the embedding in the target, stronger model. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay. Moreover, we demonstrate that, across empirical studies of different tasks, GrokTransfer effectively reshapes the training dynamics and eliminates delayed generalization, for both fully-connected neural networks and Transformers.

Comment: The paper explores the phenomenon of grokking and proposes a method to accelerate it, which provides insights into training dynamics and representation learning.

Relevance: 9 Novelty: 8

6. How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings

ArXiv ID: 2504.13412

Authors: Samuel Audia, Soheil Feizi, Matthias Zwicker, Dinesh Manocha

Abstract: Neural networks that map between low dimensional spaces are ubiquitous in computer graphics and scientific computing; however, in their naive implementation, they are unable to learn high frequency information. We present a comprehensive analysis comparing the two most common techniques for mitigating this spectral bias: Fourier feature encodings (FFE) and multigrid parametric encodings (MPE). FFEs are seen as the standard for low dimensional mappings, but MPEs often outperform them and learn representations with higher resolution and finer detail. FFE's roots in the Fourier transform, make it susceptible to aliasing if pushed too far, while MPEs, which use a learned grid structure, have no such limitation. To understand the difference in performance, we use the neural tangent kernel (NTK) to evaluate these encodings through the lens of an analogous kernel regression. By finding a lower bound on the smallest eigenvalue of the NTK, we prove that MPEs improve a network's performance through the structure of their grid and not their learnable embedding. This mechanism is fundamentally different from FFEs, which rely solely on their embedding space to improve performance. Results are empirically validated on a 2D image regression task using images taken from 100 synonym sets of ImageNet and 3D implicit surface regression on objects from the Stanford graphics dataset. Using peak signal-to-noise ratio (PSNR) and multiscale structural similarity (MS-SSIM) to evaluate how well fine details are learned, we show that the MPE increases the minimum eigenvalue by 8 orders of magnitude over the baseline and 2 orders of magnitude over the FFE. The increase in spectrum corresponds to a 15 dB (PSNR) / 0.65 (MS-SSIM) increase over baseline and a 12 dB (PSNR) / 0.33 (MS-SSIM) increase over the FFE.

Comment: The paper provides a theoretical analysis of multigrid parametric encodings (MPE) and Fourier feature encodings (FFE) using neural tangent kernels, offering foundational insights into representation learning.

Relevance: 9 Novelty: 8

7. DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLMs

ArXiv ID: 2504.13774

Authors: Tamim Al Mahmud, Najeeb Jebreel, Josep Domingo-Ferrer, David Sanchez

Abstract: Large language models (LLMs) have recently revolutionized language processing tasks but have also brought ethical and legal issues. LLMs have a tendency to memorize potentially private or copyrighted information present in the training data, which might then be delivered to end users at inference time. When this happens, a naive solution is to retrain the model from scratch after excluding the undesired data. Although this guarantees that the target data have been forgotten, it is also prohibitively expensive for LLMs. Approximate unlearning offers a more efficient alternative, as it consists of ex post modifications of the trained model itself to prevent undesirable results, but it lacks forgetting guarantees because it relies solely on empirical evidence. In this work, we present DP2Unlearning, a novel LLM unlearning framework that offers formal forgetting guarantees at a significantly lower cost than retraining from scratch on the data to be retained. DP2Unlearning involves training LLMs on textual data protected using {\epsilon}-differential privacy (DP), which later enables efficient unlearning with the guarantees against disclosure associated with the chosen {\epsilon}. Our experiments demonstrate that DP2Unlearning achieves similar model performance post-unlearning, compared to an LLM retraining from scratch on retained data -- the gold standard exact unlearning -- but at approximately half the unlearning cost. In addition, with a reasonable computational cost, it outperforms approximate unlearning methods at both preserving the utility of the model post-unlearning and effectively forgetting the targeted information.

Comment: The paper introduces a differential privacy-based unlearning framework for LLMs, which aligns with foundational research in model efficiency and privacy guarantees, offering a novel approach to unlearning.

Relevance: 8 Novelty: 8

8. Probabilistic Stability Guarantees for Feature Attributions

ArXiv ID: 2504.13787

Authors: Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong

Abstract: Stability guarantees are an emerging tool for evaluating feature attributions, but existing certification methods rely on smoothed classifiers and often yield conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, and sample-efficient stability certification algorithm (SCA) that provides non-trivial and interpretable guarantees for any attribution. Moreover, we show that mild smoothing enables a graceful tradeoff between accuracy and stability, in contrast to prior certification methods that require a more aggressive compromise. Using Boolean function analysis, we give a novel characterization of stability under smoothing. We evaluate SCA on vision and language tasks, and demonstrate the effectiveness of soft stability in measuring the robustness of explanation methods.

Comment: The paper proposes a novel stability certification algorithm for feature attributions, which aligns with representation learning by providing insights into the robustness of explanation methods. The use of Boolean function analysis and soft stability introduces a novel theoretical perspective.

Relevance: 8 Novelty: 8

9. Decoding Vision Transformers: the Diffusion Steering Lens

ArXiv ID: 2504.13763

Authors: Ryota Takatsuki, Sonia Joseph, Ippei Fujisawa, Ryota Kanai

Abstract: Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting them into the output vocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is technically straightforward, its direct use faces limitations in capturing the richness of visual representations. Building on the work of Toker et al. (2024)~\cite{Toker2024-ve}, who introduced Diffusion Lens to visualize intermediate representations in the text encoders of text-to-image diffusion models, we demonstrate that while Diffusion Lens can effectively visualize residual stream representations in image encoders, it fails to capture the direct contributions of individual submodules. To overcome this limitation, we propose \textbf{Diffusion Steering Lens} (DSL), a novel, training-free approach that steers submodule outputs and patches subsequent indirect contributions. We validate our method through interventional studies, showing that DSL provides an intuitive and reliable interpretation of the internal processing in ViTs.

Comment: The paper introduces Diffusion Steering Lens (DSL) for interpretability in Vision Transformers, which aligns with the analysis of existing architectures. This is relevant to understanding how representations evolve in ViTs.