Personalized Daily Arxiv Papers 02/06/2025

	Prompt	Completion	Total
Token	81308	6734	88042
Cost	$2.03	$0.67	$2.71

Total scanned papers: 301

Total relevant papers: 24

Table of contents with paper titles:

Scaling Laws for Upcycling Mixture-of-Experts Language Models Authors: Seng Pei Liew, Takuya Kato, Sho Takase
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model Authors: Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiaqi Wang, Mengkang Hu, Zhi Chen, Wanxiang Che, Ting Liu
RiemannGFM: Learning a Graph Foundation Model from Riemannian Geometry Authors: Li Sun, Zhenhao Huang, Suyang Zhou, Qiqi Wan, Hao Peng, Philip Yu
From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning Authors: Noa Rubin, Kirsten Fischer, Javed Lindner, David Dahmen, Inbar Seroussi, Zohar Ringel, Michael Kr\"amer, Moritz Helias
On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation Authors: Nghiem T. Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy M. H. Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning Authors: DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, Qinqing Zheng
An Augmented Backward-Corrected Projector Splitting Integrator for Dynamical Low-Rank Training Authors: Jonas Kusch, Steffen Schotth\"ofer, Alexandra Walter
ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization Authors: Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra
ReGNet: Reciprocal Space-Aware Long-Range Modeling and Multi-Property Prediction for Crystals Authors: Jianan Nie, Peiyao Xiao, Kaiyi Ji, Peng Gao
Leveraging the true depth of LLMs Authors: Ram\'on Calvo Gonz\'alez, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, Fran\c{c}ois Fleuret
Theoretical Guarantees for Low-Rank Compression of Deep Neural Networks Authors: Shihao Zhang, Rayan Saab
Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning Authors: Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Jin Lu, Geng Yuan
PH-VAE: A Polynomial Hierarchical Variational Autoencoder Towards Disentangled Representation Learning Authors: Xi Chen, Shaofan Li
Networks with Finite VC Dimension: Pro and Contra Authors: Vera Kurkova, Marcello Sanguineti
Signature Reconstruction from Randomized Signatures Authors: Mie Gl\"uckstad, Nicola Muca Cirone, Josef Teichmann
Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting Authors: Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, Sujay Sanghavi
Rethinking Approximate Gaussian Inference in Classification Authors: B\'alint Mucs\'anyi, Natha\"el Da Costa, Philipp Hennig
Maximizing the Position Embedding for Vision Transformers with Global Average Pooling Authors: Wonjun Lee, Bumsub Ham, Suhyun Kim
Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization Authors: Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim
Building Bridges between Regression, Clustering, and Classification Authors: Lawrence Stewart (DI-ENS, LIENS, Inria), Francis Bach (LIENS, SIERRA), Quentin Berthet
Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization Authors: Yu-Han Wu, Pierre Marion, G\'erard Biau, Claire Boyer
Transformers and Their Roles as Time Series Foundation Models Authors: Dennis Wu, Yihan He, Yuan Cao, Jianqing Fan, Han Liu
Beyond Topological Self-Explainable GNNs: A Formal Explainability Perspective Authors: Steve Azzolin, Sagar Malhotra, Andrea Passerini, Stefano Teso
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation Authors: Jingyu Liu, Beidi Chen, Ce Zhang

1. Scaling Laws for Upcycling Mixture-of-Experts Language Models

ArXiv ID: 2502.03009

Authors: Seng Pei Liew, Takuya Kato, Sho Takase

Abstract: Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.

Comment: Explores scaling laws for upcycling LLMs into MoE models, offering empirical insights into scaling efficiency. This aligns well with MoE-related architectural research and compression topics, particularly training efficiency.

Relevance: 10 Novelty: 8

2. ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model

ArXiv ID: 2502.03325

Authors: Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiaqi Wang, Mengkang Hu, Zhi Chen, Wanxiang Che, Ting Liu

Abstract: Recent advancements in large language models (LLMs) have led to significant successes across various applications, where the most noticeable is to a series of emerging capabilities, particularly in the areas of In-Context Learning (ICL) and Chain-of-Thought (CoT). To better understand and control model performance, many studies have begun investigating the underlying causes of these phenomena and their impact on task outcomes. However, existing explanatory frameworks predominantly focus on isolating and explaining ICL and CoT independently, leading to an incomplete understanding of their combined influence on model performance. To address this gap, we propose the Electronic Circuit Model (ECM), which provides a foundation for developing scalable, learnable policies and improving the management of AI-generated content. Specifically, ECM conceptualizes model behavior as an electronic circuit: ICL is represented as semantic magnetic field to providing an additional voltage following Faraday's Law, while CoT is modeled as series resistors to constrain the model output performance following Ohm's Law. Experimental results demonstrate that the ECM effectively predicts and explains LLM performance across a variety of prompting strategies. Furthermore, we apply ECM to advanced reasoning strategy optimization on a series of tasks, such as the International Olympiad in Informatics (IOI) and the International Mathematical Olympiad (IMO), achieving competitive performance that surpasses nearly 80% of top human competitors.

Comment: Proposes a novel explanatory framework for LLM dynamics (ICL and CoT) and models them analogously to electronic circuits. This aligns closely with theoretical studies on LLMs and is quite innovative in its formulation.

Relevance: 9 Novelty: 9

3. RiemannGFM: Learning a Graph Foundation Model from Riemannian Geometry

ArXiv ID: 2502.03251

Authors: Li Sun, Zhenhao Huang, Suyang Zhou, Qiqi Wan, Hao Peng, Philip Yu

Abstract: The foundation model has heralded a new era in artificial intelligence, pretraining a single model to offer cross-domain transferability on different datasets. Graph neural networks excel at learning graph data, the omnipresent non-Euclidean structure, but often lack the generalization capacity. Hence, graph foundation model is drawing increasing attention, and recent efforts have been made to leverage Large Language Models. On the one hand, existing studies primarily focus on text-attributed graphs, while a wider range of real graphs do not contain fruitful textual attributes. On the other hand, the sequential graph description tailored for the Large Language Model neglects the structural complexity, which is a predominant characteristic of the graph. Such limitations motivate an important question: Can we go beyond Large Language Models, and pretrain a universal model to learn the structural knowledge for any graph? The answer in the language or vision domain is a shared vocabulary. We observe the fact that there also exist shared substructures underlying graph domain, and thereby open a new opportunity of graph foundation model with structural vocabulary. The key innovation is the discovery of a simple yet effective structural vocabulary of trees and cycles, and we explore its inherent connection to Riemannian geometry. Herein, we present a universal pretraining model, RiemannGFM. Concretely, we first construct a novel product bundle to incorporate the diverse geometries of the vocabulary. Then, on this constructed space, we stack Riemannian layers where the structural vocabulary, regardless of specific graph, is learned in Riemannian manifold offering cross-domain transferability. Extensive experiments show the effectiveness of RiemannGFM on a diversity of real graphs.

Comment: Proposes a foundational graph model drawing from Riemannian geometry and structural vocabulary, aligning well with model architecture and generalization across domains. Very novel in approach.

Relevance: 9 Novelty: 9

4. From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

ArXiv ID: 2502.03210

Authors: Noa Rubin, Kirsten Fischer, Javed Lindner, David Dahmen, Inbar Seroussi, Zohar Ringel, Michael Kr\"amer, Moritz Helias

Abstract: Theoretically describing feature learning in neural networks is crucial for understanding their expressive power and inductive biases, motivating various approaches. Some approaches describe network behavior after training through a simple change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving complex directional changes to the kernel. While these approaches capture different facets of network behavior, their relationship and respective strengths across scaling regimes remains an open question. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these approaches. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output of a linear network. However, even in this case, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.

Comment: Presents a theoretical framework that bridges kernel-based and feature-adaptive learning, contributing to representation learning through a multi-scale theoretical approach. Highly relevant to model understanding and feature learning.

Relevance: 9 Novelty: 9

5. On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

ArXiv ID: 2502.03029

Authors: Nghiem T. Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy M. H. Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho

Abstract: The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

Comment: Focuses on zero-initialized attention and its theoretical ties to mixture-of-experts (MoE) models, investigating optimal prompts and gating factors. Provides both theoretical insights and experiments, aligning with the architectural and representation-learning criteria.