Personalized Daily Arxiv Papers 02/03/2025

	Prompt	Completion	Total
Token	85599	7805	93404
Cost	$2.14	$0.78	$2.92

Total scanned papers: 298

Total relevant papers: 27

Table of contents with paper titles:

Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method Authors: Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian Buc, Crist\'obal Rojas
TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs Authors: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models Authors: Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh
Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected Authors: Yingtao Zhang, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci
A theoretical framework for overfitting in energy-based modeling Authors: Giovanni Catania, Aur\'elien Decelle, Cyril Furtlehner, Beatriz Seoane
Norm-Bounded Low-Rank Adaptation Authors: Ruigang Wang, Krishnamurthy Dvijotham, Ian R. Manchester
Low-Rank Adapting Models for Sparse Autoencoders Authors: Matthew Chen, Joshua Engels, Max Tegmark
An Invitation to Neuroalgebraic Geometry Authors: Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl\'en Kohn
A Metric for the Balance of Information in Graph Learning Authors: Alex O. Davies, Nirav S. Ajmeri, Telmo de Menezes e Silva Filho
Self-Supervised Learning Using Nonlinear Dependence Authors: M. Hadi Sepanj, Benyamin Ghojogh, Paul Fieguth
Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models Authors: Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci
Memory-Efficient Fine-Tuning of Transformers via Token Selection Authors: Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
\underline{E2}Former: A Linear-time \underline{E}fficient and \underline{E}quivariant Trans\underline{former} for Scalable Molecular Modeling Authors: Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Jia Zhang, Mark Gerstein, Tao Qin
Symmetric Pruning of Large Language Models Authors: Kai Yi, Peter Richt\'arik
A Comunication Framework for Compositional Generation Authors: Rafael Elberg, Mircea Petrache, Denis Parra
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Authors: Shichang Zhang, Tessa Han, Usha Bhalla, Hima Lakkaraju
Neural SDEs as a Unified Approach to Continuous-Domain Sequence Modeling Authors: Macheng Shen, Chen Cheng
Error Slice Discovery via Manifold Compactness Authors: Han Yu, Jiashuo Liu, Hao Zou, Renzhe Xu, Yue He, Xingxuan Zhang, Peng Cui
BCAT: A Block Causal Transformer for PDE Foundation Models for Fluid Dynamics Authors: Yuxuan Liu, Jingmin Sun, Hayden Schaeffer
Temperature-Annealed Boltzmann Generators Authors: Henrik Schopmans, Pascal Friederich
Understanding Oversmoothing in GNNs as Consensus in Opinion Dynamics Authors: Keqin Wang, Yulong Yang, Ishan Saha, Christine Allen-Blanchette
Can We Predict the Effect of Prompts? Authors: Jae Yong Lee, Sungmin Kang, Shin Yoo
Unraveling Zeroth-Order Optimization through the Lens of Low-Dimensional Structured Perturbations Authors: Sihwan Park, Jihun Yun, SungYub Kim, Souvik Kundu, Eunho Yang
What is causal about causal models and representations? Authors: Frederik Hytting J{\o}rgensen, Luigi Gresele, Sebastian Weichwald
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning Authors: Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment Authors: Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Sch\"onfeld, Ali Thabet, Jonas Kohler
Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment Authors: Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

1. Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method

ArXiv ID: 2501.19215

Authors: Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian Buc, Crist\'obal Rojas

Abstract: We propose a novel method to evaluate the theoretical limits of Transformers, allowing us to prove the first lower bounds against one-layer softmax Transformers with infinite precision. We establish those bounds for three tasks that require advanced reasoning. The first task, Match3 (Sanford et al., 2023), requires looking at all triples of positions. The second and third tasks address compositionality-based reasoning: one is composition of functions (Peng et al., 2024) and the other is composition of binary relations. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. In an attempt to overcome these limitations, we introduce Strassen attention and prove that with this mechanism a one-layer Transformer can in principle solve all these tasks. We also show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023) and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

Comment: This paper addresses an innovative attention mechanism (Strassen Attention) for improving transformer scalability and reasoning abilities, directly aligning with model architecture advancements. It provides theoretical and experimental insights.

Relevance: 10 Novelty: 9

2. TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs

ArXiv ID: 2501.19057

Authors: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao

Abstract: Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.

Comment: The paper explores a novel method for memory-efficient fine-tuning of LLMs by leveraging low-rankness across temporal dimensions and employing Canonical Polyadic Decomposition (CPD). This is closely tied to the 'Model Compression' criteria.

Relevance: 10 Novelty: 9

3. Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

ArXiv ID: 2501.19392

Authors: Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh

Abstract: Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1\%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

Comment: This paper addresses KV cache compression in LLMs, explicitly aligning with the model compression criterion. The introduction of AQUA-KV for adaptive quantization is relevant and demonstrates novel efficiency improvements.

Relevance: 10 Novelty: 8

4. Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected

ArXiv ID: 2501.19107

Authors: Yingtao Zhang, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci

Abstract: This study aims to enlarge our current knowledge on application of brain-inspired network science principles for training artificial neural networks (ANNs) with sparse connectivity. Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties to keep peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (1% connectivity or lower) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is O(Nd^3) - N node network size, d node degree - hence it can apply only to ultra-sparse networks. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. We propose a GPU-friendly approximation of the CH link predictor, which reduces the computational complexity to O(N^3), enabling a fast implementation of CHT in large-scale models. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. To improve performance, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that, using 1% of connections, CHTs outperforms fully connected networks in MLP on visual classification tasks, compressing some networks to < 30% nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Using 30% of the connections, CHTss achieves superior performance compared to other dynamic sparse training methods in language modeling, and it surpasses the fully connected counterpart in zero-shot evaluations.

Comment: The use of brain-inspired sparse training and dynamic sparse connectivity in transformer-based models is directly relevant to sparsity and model compression, fitting well with foundational contributions in efficiency and architecture.

Relevance: 9 Novelty: 9

5. A theoretical framework for overfitting in energy-based modeling

ArXiv ID: 2501.19158

Authors: Giovanni Catania, Aur\'elien Decelle, Cyril Furtlehner, Beatriz Seoane

Abstract: We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models.

Comment: Develops a theoretical framework for overfitting in energy-based generative models, exploring spectral learning dynamics. Matches foundational research in representation learning.

Relevance: 9 Novelty: 9

6. Norm-Bounded Low-Rank Adaptation

ArXiv ID: 2501.19050

Authors: Ruigang Wang, Krishnamurthy Dvijotham, Ian R. Manchester

Abstract: In this work, we propose norm-bounded low-rank adaptation (NB-LoRA) for parameter-efficient fine tuning. We introduce two parameterizations that allow explicit bounds on each singular value of the weight adaptation matrix, which can therefore satisfy any prescribed unitarily invariant norm bound, including the Schatten norms (e.g., nuclear, Frobenius, spectral norm). The proposed parameterizations are unconstrained and complete, i.e. they cover all matrices satisfying the prescribed rank and norm constraints. Experiments on vision fine-tuning benchmarks show that the proposed approach can achieve good adaptation performance while avoiding model catastrophic forgetting and also substantially improve robustness to a wide range of hyper-parameters, including adaptation rank, learning rate and number of training epochs. We also explore applications in privacy-preserving model merging and low-rank matrix completion.

Comment: Proposes NB-LoRA for parameter-efficient fine-tuning with norm bounds on adaptation matrices. This paper aligns closely with model compression, particularly low-rank adaptation techniques, making it highly relevant.

Relevance: 10 Novelty: 8

7. Low-Rank Adapting Models for Sparse Autoencoders

ArXiv ID: 2501.19406

Authors: Matthew Chen, Joshua Engels, Max Tegmark

Abstract: Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the language model itself around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on Gemma-2-2B and 2$\times$ to 10$\times$ faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.

Comment: This paper attempts to improve sparse autoencoders by combining low-rank adaptation with interpretability-driven design, making it directly relevant to topics like sparsity, low-rank techniques, and sparse autoencoders.

Relevance: 10 Novelty: 8

8. An Invitation to Neuroalgebraic Geometry

ArXiv ID: 2501.18915

Authors: Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl\'en Kohn

Abstract: In this expository work, we promote the study of function spaces parameterized by machine learning models through the lens of algebraic geometry. To this end, we focus on algebraic models, such as neural networks with polynomial activations, whose associated function spaces are semi-algebraic varieties. We outline a dictionary between algebro-geometric invariants of these varieties, such as dimension, degree, and singularities, and fundamental aspects of machine learning, such as sample complexity, expressivity, training dynamics, and implicit bias. Along the way, we review the literature and discuss ideas beyond the algebraic domain. This work lays the foundations of a research direction bridging algebraic geometry and deep learning, that we refer to as neuroalgebraic geometry.

Comment: The paper introduces a new theoretical framework connecting algebraic geometry and machine learning, specifically targeting neural networks. This aligns with 'Representation Learning' as it provides unique insights on the expressivity and training dynamics of neural networks.

Relevance: 9 Novelty: 9

9. A Metric for the Balance of Information in Graph Learning

ArXiv ID: 2501.19137

Authors: Alex O. Davies, Nirav S. Ajmeri, Telmo de Menezes e Silva Filho

Abstract: Graph learning on molecules makes use of information from both the molecular structure and the features attached to that structure. Much work has been conducted on biasing either towards structure or features, with the aim that bias bolsters performance. Identifying which information source a dataset favours, and therefore how to approach learning that dataset, is an open issue. Here we propose Noise-Noise Ratio Difference (NNRD), a quantitative metric for whether there is more useful information in structure or features. By employing iterative noising on features and structure independently, leaving the other intact, NNRD measures the degradation of information in each. We employ NNRD over a range of molecular tasks, and show that it corresponds well to a loss of information, with intuitive results that are more expressive than simple performance aggregates. Our future work will focus on expanding data domains, tasks and types, as well as refining our choice of baseline model.

Comment: Introduces a metric (NNRD) to balance structural and feature information in graph learning. Contains fundamental insights into managing representation biases in molecular graph data.