Personalized Daily Arxiv Papers 03/18/2025

[gpt-4o]	Prompt	Completion	Total
Token	57807	8309	66116
Cost	$0.14	$0.08	$0.22

Total arXiv papers: 897

Total scanned papers: 503

Total relevant papers: 47

Table of contents with paper titles:

Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs Authors: Nir Ailon, Akhiad Bercovich, Omri Weinstein
Counterfactual Realizability Authors: Arvind Raghavan, Elias Bareinboim
Finite Samples for Shallow Neural Networks Authors: Yu Xia, Zhiqiang Xu
Gradient Extrapolation for Debiased Representation Learning Authors: Ihab Asaad, Maha Shadaydeh, Joachim Denzler
Test-Time Training Provably Improves Transformers as In-context Learners Authors: Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak
Towards Learning High-Precision Least Squares Algorithms with Sequence Models Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher R\'e
Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis Authors: Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths
Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization Authors: Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber
Computation Mechanism Behind LLM Position Generalization Authors: Chi Han, Heng Ji
ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning Authors: Baohao Liao, Christian Herold, Seyyed Hadi Hashemi, Stefan Vasilev, Shahram Khadivi, Christof Monz
SuperBPE: Space Travel for Language Models Authors: Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi
ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang
Beyond Propagation of Chaos: A Stochastic Algorithm for Mean Field Optimization Authors: Chandan Tankala, Dheeraj M. Nagaraj, Anant Raj
xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference Authors: Maximilian Beck, Korbinian P\"oppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, G\"unter Klambauer, Sebastian B\"ock, Sepp Hochreiter
Edgeworth Expansion for Semi-hard Triplet Loss Authors: Masanari Kimura
A Survey on Transformer Context Extension: Approaches and Evaluation Authors: Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu
COSMOS: Continuous Simplicial Neural Networks Authors: Aref Einizade, Dorina Thanou, Fragkiskos D. Malliaros, Jhony H. Giraldo
Proof-Driven Clause Learning in Neural Network Verification Authors: Omri Isac, Idan Refaeli, Haoze Wu, Clark Barrett, Guy Katz
Quantum-Enhanced LLM Efficient Fine Tuning Authors: Xiaofei Kong, Lei Li, Menghan Dou, Zhaoyun Chen, Yuchun Wu, Guoping Guo
Verification Learning: Make Unsupervised Neuro-Symbolic System Feasible Authors: Lin-Han Jia, Wen-Chao Hu, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li
TNCSE: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Sentence Embeddings Authors: Tianyu Zong, Bingkang Shi, Hongzhu Yi, Jungang Xu
FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization Authors: Hao Mark Chen, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan
Do you understand epistemic uncertainty? Think again! Rigorous frequentist epistemic uncertainty estimation in regression Authors: Enrico Foglia, Benjamin Bobbia, Nikita Durasov, Michael Bauerheim, Pascal Fua, Stephane Moreau, Thierry Jardin
GFSNetwork: Differentiable Feature Selection via Gumbel-Sigmoid Relaxation Authors: Witold Wydma\'nski, Marek \'Smieja
Fast filtering of non-Gaussian models using Amortized Optimal Transport Maps Authors: Mohammad Al-Jarrah, Bamdad Hosseini, Amirhossein Taghvaei
HyperKAN: Hypergraph Representation Learning with Kolmogorov-Arnold Networks Authors: Xiangfei Fang, Boying Wang, Chengying Huan, Shaonan Ma, Heng Zhang, Chen Zhao
Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach Authors: Sinan Fan, Liang Xie, Chen Shen, Ge Teng, Xiaosong Yuan, Xiaofeng Zhang, Chenxi Huang, Wenxiao Wang, Xiaofei He, Jieping Ye
Scale Efficient Training for Large Datasets Authors: Qing Zhou, Junyu Gao, Qi Wang
S2IL: Structurally Stable Incremental Learning Authors: S Balasubramanian, Yedu Krishna P, Talasu Sai Sriram, M Sai Subramaniam, Manepalli Pranav Phanindra Sai, Darshan Gera
ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM Authors: Wenqiang Wang, Yijia Zhang, Zikai Zhang, Guanting Huo, Hao Liang, Shijie Cao, Ningyi Xu
An Optimization Framework for Differentially Private Sparse Fine-Tuning Authors: Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Ponomareva, Hussein Hazimeh, Rahul Mazumder
Deep Belief Markov Models for POMDP Inference Authors: Giacomo Arcieri, Konstantinos G. Papakonstantinou, Daniel Straub, Eleni Chatzi
Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization Authors: Dmitry Kovalev
SparseLUT: Sparse Connectivity Optimization for Lookup Table-based Deep Neural Networks Authors: Binglei Lou, Ruilin Wu, Philip Leong
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs Authors: Abhishek Moitra, Arkapravo Ghosh, Shrey Agarwal, Aporva Amarnath, Karthik Swaminathan, Priyadarshini Panda
The Architecture and Evaluation of Bayesian Neural Networks Authors: Alisa Sheinkman, Sara Wade
Entropy-regularized Gradient Estimators for Approximate Bayesian Inference Authors: Jasmeet Kaur
Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning Authors: Xi Wang, Hideaki Shimazaki
MetaScale: Test-Time Scaling with Evolving Meta-Thoughts Authors: Qin Liu, Wenxuan Zhou, Nan Xu, James Y. Huang, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen
PREAMBLE: Private and Efficient Aggregation of Block Sparse Vectors and Applications Authors: Hilal Asi, Vitaly Feldman, Hannah Keller, Guy N. Rothblum, Kunal Talwar
On Local Posterior Structure in Deep Ensembles Authors: Mikkel Jordahn, Jonas Vestergaard Jensen, Mikkel N. Schmidt, Michael Riis Andersen
Can LLMs Formally Reason as Abstract Interpreters for Program Analysis? Authors: Jacqueline L. Mitchell, Brian Hyeongseok Kim, Chenyu Zhou, Chao Wang
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules Authors: Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference Authors: Hao Yin, Guangzong Si, Zilei Wang
An Algebraic Approach to Moralisation and Triangulation of Probabilistic Graphical Models Authors: Antonio Lorenzin, Fabio Zanasi
Experiments with Optimal Model Trees Authors: Sabino Francesco Roselli, Eibe Frank
Permutation Learning with Only N Parameters: From SoftSort to Self-Organizing Gaussians Authors: Kai Uwe Barthel, Florian Barthel, Peter Eisert

1. Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs

ArXiv ID: 2503.12211

Authors: Nir Ailon, Akhiad Bercovich, Omri Weinstein

Abstract: We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs). Unlike many stubborn attempts to accelerate MatMuls in DNN inference, this operator is supported by capabilities of existing GPU hardware, most notably NVIDIA TensorCores. To our knowledge, this is the first GPU-native acceleration technique which \emph{does not decrease} (in fact, increases) the number of trainable parameters of the network, mitigating the accuracy-loss of compression-based techniques. Hence, this operator is at the same time more expressive than MatMul, yet requires substantially \emph{fewer} FLOPs to evaluate. We term this new operator \emph{Strassen-Tile} (STL). The main idea behind STL$(X,W)$ is a \emph{local} change-of-basis (learnable encoder) on weights and activation \emph{tiles}, after which we perform batched \emph{elementwise} products between tiles, and a final decoding transformation (inspired by algebraic pipelines from fast matrix and polynomial multiplication). We compare STL against two benchmarks. The first one is SoTA T2T-ViT on Imagenet-1K. Here we show that replacing \emph{all} linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 \emph{accuracy improvement}. Our second speed-accuracy comparison benchmark for pretrained LLMs is the most practical GPU-acceleration technique, \twofour structured Sparsity. Finetuning TinyLlama \cite{tinyllama24} with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter. Finally, we discuss a group-theoretic approach for discovering \emph{universal} encoders for STL, which could lead to fast \emph{black-box} acceleration via approximate matrix-multiplication (AMM).

Comment: The paper introduces a GPU-efficient alternative to matrix multiplication in DNNs, which aligns with model compression and efficiency breakthroughs. The Strassen-Tile operator is a novel contribution.

Relevance: 9 Novelty: 9

2. Counterfactual Realizability

ArXiv ID: 2503.11870

Authors: Arvind Raghavan, Elias Bareinboim

Abstract: It is commonly believed that, in a real-world environment, samples can only be drawn from observational and interventional distributions, corresponding to Layers 1 and 2 of the Pearl Causal Hierarchy. Layer 3, representing counterfactual distributions, is believed to be inaccessible by definition. However, Bareinboim, Forney, and Pearl (2015) introduced a procedure that allows an agent to sample directly from a counterfactual distribution, leaving open the question of what other counterfactual quantities can be estimated directly via physical experimentation. We resolve this by introducing a formal definition of realizability, the ability to draw samples from a distribution, and then developing a complete algorithm to determine whether an arbitrary counterfactual distribution is realizable given fundamental physical constraints, such as the inability to go back in time and subject the same unit to a different experimental condition. We illustrate the implications of this new framework for counterfactual data collection using motivating examples from causal fairness and causal reinforcement learning. While the baseline approach in these motivating settings typically follows an interventional or observational strategy, we show that a counterfactual strategy provably dominates both.

Comment: The paper explores counterfactual realizability, which is a cutting-edge theoretical contribution with potential implications for foundational research.

Relevance: 9 Novelty: 9

3. Finite Samples for Shallow Neural Networks

ArXiv ID: 2503.12744

Authors: Yu Xia, Zhiqiang Xu

Abstract: This paper investigates the ability of finite samples to identify two-layer irreducible shallow networks with various nonlinear activation functions, including rectified linear units (ReLU) and analytic functions such as the logistic sigmoid and hyperbolic tangent. An ``irreducible" network is one whose function cannot be represented by another network with fewer neurons. For ReLU activation functions, we first establish necessary and sufficient conditions for determining the irreducibility of a network. Subsequently, we prove a negative result: finite samples are insufficient for definitive identification of any irreducible ReLU shallow network. Nevertheless, we demonstrate that for a given irreducible network, one can construct a finite set of sampling points that can distinguish it from other network with the same neuron count. Conversely, for logistic sigmoid and hyperbolic tangent activation functions, we provide a positive result. We construct finite samples that enable the recovery of two-layer irreducible shallow analytic networks. To the best of our knowledge, this is the first study to investigate the exact identification of two-layer irreducible networks using finite sample function values. Our findings provide insights into the comparative performance of networks with different activation functions under limited sampling conditions.

Comment: The paper investigates the identifiability of shallow neural networks with finite samples, providing theoretical insights into network irreducibility and activation functions. This aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 8

4. Gradient Extrapolation for Debiased Representation Learning

ArXiv ID: 2503.13236

Authors: Ihab Asaad, Maha Shadaydeh, Joachim Denzler

Abstract: Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations to define the target gradient as the linear extrapolation of two gradients computed from each batch's loss. It is demonstrated that the extrapolated gradient, if directed toward the gradient of the batch with fewer amount of spurious correlation, can guide the training process toward learning a debiased model. GERNE can serve as a general framework for debiasing with methods, such as ERM, reweighting, and resampling, being shown as special cases. The theoretical upper and lower bounds of the extrapolation factor are derived to ensure convergence. By adjusting this factor, GERNE can be adapted to maximize the Group-Balanced Accuracy (GBA) or the Worst-Group Accuracy. The proposed approach is validated on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baseline methods.

Comment: The paper proposes a novel gradient extrapolation method for debiased representation learning, which aligns with foundational research in representation learning and optimization.

Relevance: 9 Novelty: 8

5. Test-Time Training Provably Improves Transformers as In-context Learners

ArXiv ID: 2503.11842

Authors: Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak

Abstract: Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.

Comment: The paper provides theoretical insights into test-time training for transformers as in-context learners, which aligns with foundational research in training dynamics and large language models.

Relevance: 9 Novelty: 8

6. Towards Learning High-Precision Least Squares Algorithms with Sequence Models

ArXiv ID: 2503.12295

Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher R\'e

Abstract: This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.

Comment: The paper explores the limitations of Transformers in high-precision numerical tasks and introduces polynomial architectures for learning numerical algorithms, which aligns with foundational research in model architecture and training dynamics.

Relevance: 9 Novelty: 8

7. Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

ArXiv ID: 2503.13401

Authors: Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths

Abstract: Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on Marr's three levels of analysis. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.

Comment: The paper proposes using cognitive science methods to understand LLMs, which aligns with theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8

8. Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization

ArXiv ID: 2503.11891

Authors: Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber

Abstract: We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task, with the network parameters being perturbed by small isotropic normal noise. The addition of such noise may be interpreted as a stochastic form of sharpness-aware minimization (SAM) and we prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss. In particular, the noise changes the expected gradient to force balancing of the weight matrices at a fast rate along the descent trajectory. In the diagonal linear model, we show that this equates to minimizing the average sharpness, as well as the trace of the Hessian matrix, among all possible factorizations of the same matrix. Further, the noise forces the gradient descent iterates towards a shrinkage-thresholding of the underlying true parameter, with the noise level explicitly regulating both the shrinkage factor and the threshold.

Comment: The paper provides theoretical insights into training dynamics and sharpness-aware minimization, which aligns with representation learning and training dynamics in neural networks.

Relevance: 9 Novelty: 8

9. Computation Mechanism Behind LLM Position Generalization

ArXiv ID: 2503.13305

Authors: Chi Han, Heng Ji

Abstract: Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs' computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs' position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs' internal mechanisms.

Comment: The paper provides computational insights into LLM position generalization, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

10. ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

ArXiv ID: 2503.13089

Authors: Baohao Liao, Christian Herold, Seyyed Hadi Hashemi, Stefan Vasilev, Shahram Khadivi, Christof Monz

Abstract: As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.

Comment: The paper proposes a novel compression paradigm, ClusComp, which aligns with model compression criteria by addressing quantization and efficient finetuning with theoretical contributions.

Relevance: 9 Novelty: 8

11. SuperBPE: Space Travel for Language Models

ArXiv ID: 2503.13423

Authors: Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi

Abstract: The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying only the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.

Comment: The paper introduces SuperBPE, a novel tokenization method for LLMs, which aligns with foundational research in LLM architecture and pretraining improvements.

Relevance: 9 Novelty: 8

12. ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

ArXiv ID: 2503.12668

Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang

Abstract: Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.

Comment: The paper introduces ZO2, a zeroth-order fine-tuning framework for LLMs, which aligns with model compression and efficiency breakthroughs by enabling fine-tuning of extremely large models with limited GPU memory.

Relevance: 9 Novelty: 8

13. Beyond Propagation of Chaos: A Stochastic Algorithm for Mean Field Optimization

ArXiv ID: 2503.13115

Authors: Chandan Tankala, Dheeraj M. Nagaraj, Anant Raj

Abstract: Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with $n$ particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent. This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm's output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.

Comment: The paper explores a stochastic algorithm for mean field optimization, which aligns with foundational research in representation learning and training dynamics. It provides theoretical insights into optimization in Wasserstein space.

Relevance: 9 Novelty: 8

14. xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

ArXiv ID: 2503.13427

Authors: Maximilian Beck, Korbinian P\"oppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, G\"unter Klambauer, Sebastian B\"ock, Sepp Hochreiter

Abstract: Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

Comment: The paper introduces xLSTM 7B, a recurrent LLM architecture optimized for efficient inference. This aligns with the core topic of model architecture innovations.

Relevance: 9 Novelty: 8

15. Edgeworth Expansion for Semi-hard Triplet Loss

ArXiv ID: 2503.12893

Authors: Masanari Kimura

Abstract: We develop a higher-order asymptotic analysis for the semi-hard triplet loss using the Edgeworth expansion. It is known that this loss function enforces that embeddings of similar samples are close while those of dissimilar samples are separated by a specified margin. By refining the classical central limit theorem, our approach quantifies the impact of the margin parameter and the skewness of the underlying data distribution on the loss behavior. In particular, we derive explicit Edgeworth expansions that reveal first-order corrections in terms of the third cumulant, thereby characterizing non-Gaussian effects present in the distribution of distance differences between anchor-positive and anchor-negative pairs. Our findings provide detailed insight into the sensitivity of the semi-hard triplet loss to its parameters and offer guidance for choosing the margin to ensure training stability.

Comment: This paper offers a higher-order asymptotic analysis of the semi-hard triplet loss, providing theoretical insights into its behavior. It aligns with foundational research in representation learning by analyzing the training dynamics and sensitivity of a loss function.

Relevance: 9 Novelty: 8

16. A Survey on Transformer Context Extension: Approaches and Evaluation

ArXiv ID: 2503.13299

Authors: Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu

Abstract: Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.

Comment: The survey focuses on extending Transformer context for long sequences, which aligns with foundational research in Transformer architecture and efficiency.

Relevance: 9 Novelty: 7

17. COSMOS: Continuous Simplicial Neural Networks

ArXiv ID: 2503.12919

Authors: Aref Einizade, Dorina Thanou, Fragkiskos D. Malliaros, Jhony H. Giraldo

Abstract: Simplicial complexes provide a powerful framework for modeling high-order interactions in structured data, making them particularly suitable for applications such as trajectory prediction and mesh processing. However, existing simplicial neural networks (SNNs), whether convolutional or attention-based, rely primarily on discrete filtering techniques, which can be restrictive. In contrast, partial differential equations (PDEs) on simplicial complexes offer a principled approach to capture continuous dynamics in such structures. In this work, we introduce COntinuous SiMplicial neural netwOrkS (COSMOS), a novel SNN architecture derived from PDEs on simplicial complexes. We provide theoretical and experimental justifications of COSMOS's stability under simplicial perturbations. Furthermore, we investigate the over-smoothing phenomenon, a common issue in geometric deep learning, demonstrating that COSMOS offers better control over this effect than discrete SNNs. Our experiments on real-world datasets of ocean trajectory prediction and regression on partial deformable shapes demonstrate that COSMOS achieves competitive performance compared to state-of-the-art SNNs in complex and noisy environments.

Comment: The paper introduces a novel architecture for simplicial neural networks derived from PDEs, which aligns with architectural innovations and addresses over-smoothing in geometric deep learning.

Relevance: 8 Novelty: 8

18. Proof-Driven Clause Learning in Neural Network Verification

ArXiv ID: 2503.12083

Authors: Omri Isac, Idan Refaeli, Haoze Wu, Clark Barrett, Guy Katz

Abstract: The widespread adoption of deep neural networks (DNNs) requires efficient techniques for safety verification. Existing methods struggle to scale to real-world DNNs, and tremendous efforts are being put into improving their scalability. In this work, we propose an approach for improving the scalability of DNN verifiers using Conflict-Driven Clause Learning (CDCL) -- an approach that has proven highly successful in SAT and SMT solving. We present a novel algorithm for deriving conflict clauses using UNSAT proofs, and propose several optimizations for expediting it. Our approach allows a modular integration of SAT solvers and DNN verifiers, and we implement it on top of an interface designed for this purpose. The evaluation of our implementation over several benchmarks suggests a 2X--3X improvement over a similar approach, with specific cases outperforming the state of the art.

Comment: The paper proposes a novel conflict-driven clause learning approach for DNN verification, which aligns with foundational research in model efficiency and scalability.

Relevance: 8 Novelty: 8

19. Quantum-Enhanced LLM Efficient Fine Tuning

ArXiv ID: 2503.12790

Authors: Xiaofei Kong, Lei Li, Menghan Dou, Zhaoyun Chen, Yuchun Wu, Guoping Guo

Abstract: Low-Rank Adaptation (LoRA) enables efficient fine-tuning of pre-trained language models via low-rank matrix approximation, which is effective in many scenarios. However, its low-rank representation capacity is constrained in complex tasks or high-rank dependency settings, potentially limiting model adaptability. Addressing the expressive bottleneck of classical low-rank approximation in fine-tuning large language models, this paper proposes a parameter-efficient fine-tuning method based on a Quantum Weighted Tensor Hybrid Network (QWTHN), which leverages Quantum Neural Network (QNN). The study investigates quantum-classical hybrid parameter-efficient fine-tuning in low-rank spaces. QWTHN decomposes pre-trained weights into quantum neural network and tensor network representations, utilizing quantum state superposition and other methods to break through classical rank limitations. Experiments show that the proposed quantum fine-tuning technique for large models approaches or even surpasses the parameter efficiency of LoRA. On the CPsyCounD and R1-Distill-SFT datasets, QWTHN, compared to classical LoRA, reduces training loss by up to 15% while using 76% fewer parameters, and achieves an 8.4% performance improvement on the CPsyCounD test set. This research not only realizes lightweight and efficient adaptation of quantum resources to billion-parameter models but also validates the practical path of quantum hardware driven by large model tasks, laying the first engineering-ready technical foundation for future quantum-enhanced AGI systems.

Comment: The paper proposes a quantum-enhanced fine-tuning method, which aligns with model compression and efficiency breakthroughs, particularly in low-rank approaches.

Relevance: 8 Novelty: 8

20. Verification Learning: Make Unsupervised Neuro-Symbolic System Feasible

ArXiv ID: 2503.12917

Authors: Lin-Han Jia, Wen-Chao Hu, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li

Abstract: The current Neuro-Symbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data. If we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts-issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs to the level of a Constraint Satisfaction Problem (CSP). To further enhance performance, we introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels, such as in addition, for some tasks, while for others, like Sudoku, the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements.

Comment: The paper introduces a novel verification learning paradigm for neuro-symbolic systems, which aligns with emerging trends in foundational AI research.

Relevance: 8 Novelty: 8

21. TNCSE: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Sentence Embeddings

ArXiv ID: 2503.12739

Authors: Tianyu Zong, Bingkang Shi, Hongzhu Yi, Jungang Xu

Abstract: Unsupervised sentence embedding representation has become a hot research topic in natural language processing. As a tensor, sentence embedding has two critical properties: direction and norm. Existing works have been limited to constraining only the orientation of the samples' representations while ignoring the features of their module lengths. To address this issue, we propose a new training objective that optimizes the training of unsupervised contrastive learning by constraining the module length features between positive samples. We combine the training objective of Tensor's Norm Constraints with ensemble learning to propose a new Sentence Embedding representation framework, TNCSE. We evaluate seven semantic text similarity tasks, and the results show that TNCSE and derived models are the current state-of-the-art approach; in addition, we conduct extensive zero-shot evaluations, and the results show that TNCSE outperforms other baselines.

Comment: The paper proposes a novel unsupervised contrastive learning framework for sentence embeddings, which aligns with representation learning and introduces tensor norm constraints.

Relevance: 8 Novelty: 8

22. FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

ArXiv ID: 2503.12649

Authors: Hao Mark Chen, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan

Abstract: Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models.

Comment: The paper introduces FW-Merging, a novel optimization-based approach for model merging, which aligns with foundational research in model efficiency and scaling.

Relevance: 8 Novelty: 8

23. Do you understand epistemic uncertainty? Think again! Rigorous frequentist epistemic uncertainty estimation in regression

ArXiv ID: 2503.13317

Authors: Enrico Foglia, Benjamin Bobbia, Nikita Durasov, Michael Bauerheim, Pascal Fua, Stephane Moreau, Thierry Jardin

Abstract: Quantifying model uncertainty is critical for understanding prediction reliability, yet distinguishing between aleatoric and epistemic uncertainty remains challenging. We extend recent work from classification to regression to provide a novel frequentist approach to epistemic and aleatoric uncertainty estimation. We train models to generate conditional predictions by feeding their initial output back as an additional input. This method allows for a rigorous measurement of model uncertainty by observing how prediction responses change when conditioned on the model's previous answer. We provide a complete theoretical framework to analyze epistemic uncertainty in regression in a frequentist way, and explain how it can be exploited in practice to gauge a model's uncertainty, with minimal changes to the original architecture.

Comment: The paper provides a theoretical framework for epistemic uncertainty estimation in regression, which aligns with foundational research in understanding model behavior and uncertainty quantification.