Personalized Daily Arxiv Papers 03/07/2025

[gpt-4o]	Prompt	Completion	Total
Token	41894	5862	47756
Cost	$0.1	$0.06	$0.16

Total ArXiv papers: 523

Total scanned papers: 281

Total relevant papers: 30

Table of contents with paper titles:

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers Authors: William Merrill, Ashish Sabharwal
L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling Authors: Zhuo Chen, Oriol Mayn\'e i Comas, Zhuotao Jin, Di Luo, Marin Solja\v{c}i\'c
Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining Authors: Houyi Li, Wenzheng Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Yangshijie Xu, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability Authors: Lijia Yu, Yibo Miao, Yifan Zhu, Xiao-Shan Gao, Lijun Zhang
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions Authors: Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig
SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning Authors: Chen Li, Yinyi Luo, Anudeep Bolimera, Marios Savvides
Learning Causal Response Representations through Direct Effect Analysis Authors: Homer Durand, Gherardo Varando, Gustau Camps-Valls
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization Authors: Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
Causally Reliable Concept Bottleneck Models Authors: Giovanni De Felice, Arianna Casanova Flores, Francesco De Santis, Silvia Santini, Johannes Schneider, Pietro Barbiero, Alberto Termine
Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators Authors: Blaine Quackenbush, Paul J. Atzberger
Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling Authors: Yan Li, Pengfei Zheng, Shuang Chen, Zewei Xu, Yunfei Du, Zhengang Wang
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size Authors: Alireza Behtash, Marijan Fofonjka, Ethan Baird, Tyler Mauer, Hossein Moghimifam, David Stout, Joel Dennison
Activation Space Interventions Can Be Transferred Between Large Language Models Authors: Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah
How can representation dimension dominate structurally pruned LLMs? Authors: Mingxue Xu, Lisa Alazraki, Danilo P. Mandic
Enough Coin Flips Can Make LLMs Act Bayesian Authors: Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, David M. Chan
Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization Authors: Shuang Liu, Yihan Wang, Yifan Zhu, Yibo Miao, Xiao-Shan Gao
Generative Learning of Densities on Manifolds Authors: Dimitris G. Giovanis, Ellis Crabtree, Roger G. Ghanem, Ioannis G. kevrekidis
An optimal Petrov-Galerkin framework for operator networks Authors: Philip Charles, Deep Ray, Yue Yu, Joost Prins, Hugo Melchers, Michael R. A. Abdelmalik, Jeffrey Cochran, Assad A. Oberai, Thomas J. R. Hughes, Mats G. Larson
Sample-Optimal Agnostic Boosting with Unlabeled Data Authors: Udaya Ghai, Karan Singh
Generalized Interpolating Discrete Diffusion Authors: Dimitri von R\"utte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Sch\"olkopf, Thomas Hofmann
All-atom Diffusion Transformers: Unified generative modelling of molecules and materials Authors: Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram, Zachary W. Ulissi
Quantitative Flow Approximation Properties of Narrow Neural ODEs Authors: Karthik Elamvazhuthi
Boosting Offline Optimizers with Surrogate Sensitivity Authors: Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang
LEWIS (LayEr WIse Sparsity) -- A Training Free Guided Model Merging Approach Authors: Hetarth Chopra, Vidhi Rambhia, Vikram Adve
An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding Authors: Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu
Simple Self Organizing Map with Visual Transformer Authors: Alan Luo, Kaiwen Yuan
Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows Authors: Xiangxin Zhou, Yi Xiao, Haowei Lin, Xinheng He, Jiaqi Guan, Yang Wang, Qiang Liu, Feng Zhou, Liang Wang, Jianzhu Ma
IDInit: A Universal and Stable Initialization Method for Neural Network Training Authors: Yu Pan, Chaozheng Wang, Zekai Wu, Qifan Wang, Min Zhang, Zenglin Xu
Bi-Lipschitz Ansatz for Anti-Symmetric Functions Authors: Nadav Dym, Jianfeng Lu, Matan Mizrachi
Continual Optimization with Symmetry Teleportation for Multi-Task Learning Authors: Zhipeng Zhou, Ziqiao Meng, Pengcheng Wu, Peilin Zhao, Chunyan Miao

1. A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

ArXiv ID: 2503.03961

Authors: William Merrill, Ashish Sabharwal

Abstract: Recent theoretical results show transformers cannot express sequential reasoning problems over long input lengths, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer's depth affects its expressive power. We address these questions by analyzing the expressive power of transformers whose depth can grow minimally with context length $n$. We show even highly uniform transformers with depth $\Theta(\log n)$ can express two important problems: recognizing regular languages, which captures state tracking abilities, and graph connectivity, which underlies multi-step reasoning. Notably, both of these problems cannot be expressed by fixed-depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain-of-thought steps. Empirically, we find our theoretical depth requirements for regular language recognition match the practical depth requirements of transformers remarkably well. Thus, our results clarify precisely how depth affects transformers' reasoning capabilities, providing potential practical insights for designing models that are better at sequential reasoning.

Comment: The paper provides theoretical insights into the expressive power of log-depth transformers, directly addressing foundational questions about model architecture and depth scaling.

Relevance: 10 Novelty: 9

2. L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling

ArXiv ID: 2503.04725

Authors: Zhuo Chen, Oriol Mayn\'e i Comas, Zhuotao Jin, Di Luo, Marin Solja\v{c}i\'c

Abstract: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.

Comment: The paper establishes a mutual information scaling law for long-context language modeling, which provides theoretical insights into LLM behavior and aligns with the LLM criterion.

Relevance: 9 Novelty: 9

3. Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

ArXiv ID: 2503.04715

Authors: Houyi Li, Wenzheng Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Yangshijie Xu, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

Abstract: The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.07\% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/

Comment: The paper establishes scaling laws for hyperparameters in LLM pretraining, providing theoretical insights into model optimization and aligning with foundational research in LLM behavior.

Relevance: 9 Novelty: 9

4. Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability

ArXiv ID: 2503.04111

Authors: Lijia Yu, Yibo Miao, Yifan Zhu, Xiao-Shan Gao, Lijun Zhang

Abstract: The primary objective of learning methods is generalization. Classic uniform generalization bounds, which rely on VC-dimension or Rademacher complexity, fail to explain the significant attribute that over-parameterized models in deep learning exhibit nice generalizability. On the other hand, algorithm-dependent generalization bounds, like stability bounds, often rely on strict assumptions. To establish generalizability under less stringent assumptions, this paper investigates the generalizability of neural networks that minimize or approximately minimize empirical risk. We establish a lower bound for population accuracy based on the expressiveness of these networks, which indicates that with an adequate large number of training samples and network sizes, these networks, including over-parameterized ones, can generalize effectively. Additionally, we provide a necessary condition for generalization, demonstrating that, for certain data distributions, the quantity of training data required to ensure generalization exceeds the network size needed to represent the corresponding data distribution. Finally, we provide theoretical insights into several phenomena in deep learning, including robust generalization, importance of over-parameterization, and effect of loss function on generalization.

Comment: The paper provides theoretical insights into generalizability based on expressiveness, directly addressing foundational questions in representation learning and over-parameterization.