Personalized Daily Arxiv Papers 03/06/2025

[gpt-4o]	Prompt	Completion	Total
Token	31316	4414	35730
Cost	$0.08	$0.04	$0.12

Total ArXiv papers: 442

Total scanned papers: 236

Total relevant papers: 17

Table of contents with paper titles:

Convergence Rates for Softmax Gating Mixture of Experts Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo
Conformal Transformations for Symmetric Power Transformers Authors: Saurabh Kumar, Jacob Buckman, Carles Gelada, Sean Zhang
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs Authors: Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability Authors: Chenhui Xu, Dancheng Liu, Jiajie Li, Amir Nassereldine, Zhaohui Li, Jinjun Xiong
Effective LLM Knowledge Learning via Model Generalization Authors: Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention Authors: Lida Chen, Dong Xu, Chenxin An, Xintao Wang, Yikai Zhang, Jiangjie Chen, Zujie Liang, Feng Wei, Jiaqing Liang, Yanghua Xiao, Wei Wang
Towards Understanding Distilled Reasoning Models: A Representational Approach Authors: David D. Baek, Max Tegmark
Process-based Self-Rewarding Language Models Authors: Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong
Early-Stopped Mirror Descent for Linear Regression over Convex Bodies Authors: Tobias Wegel, Gil Kur, Patrick Rebeschini
AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model Authors: Wenlun Zhang, Shimpei Ando, Kentaro Yoshioka
Feature Matching Intervention: Leveraging Observational Data for Causal Representation Learning Authors: Haoze Li, Jun Xie
State-offset Tuning: State-based Parameter-Efficient Fine-Tuning for State Space Models Authors: Wonjun Kang, Kevin Galim, Yuchen Zeng, Minjae Lee, Hyung Il Koo, Nam Ik Cho
Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction Authors: Gustaw Opie{\l}ka, Hannes Rosenbusch, Claire E. Stevenson
Conceptualizing Uncertainty Authors: Isaac Roberts, Alexander Schulz, Sarah Schroeder, Fabian Hinder, Barbara Hammer
See What You Are Told: Visual Attention Sink in Large Multimodal Models Authors: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang
Integrating Predictive and Generative Capabilities by Latent Space Design via the DKL-VAE Model Authors: Boris N. Slautin, Utkarsh Pratiush, Doru C. Lupascu, Maxim A. Ziatdinov, Sergei V. Kalinin
Partial Convolution Meets Visual Attention Authors: Haiduo Huang, Fuwei Yang, Dong Li, Ji Liu, Lu Tian, Jinzhang Peng, Pengju Ren, Emad Barsoum

1. Convergence Rates for Softmax Gating Mixture of Experts

ArXiv ID: 2503.03213

Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo

Abstract: Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. Despite its widespread use in practice, a comprehensive study on the effects of the softmax gating on the MoE has been lacking in the literature. To bridge this gap in this paper, we perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating, respectively. Furthermore, our theories also provide useful insights into the design of sample-efficient expert structures. In particular, we demonstrate that it requires polynomially many data points to estimate experts satisfying our proposed \emph{strong identifiability} condition, namely a commonly used two-layer feed-forward network. In stark contrast, estimating linear experts, which violate the strong identifiability condition, necessitates exponentially many data points as a result of intrinsic parameter interactions expressed in the language of partial differential equations. All the theoretical results are substantiated with a rigorous guarantee.

Comment: The paper provides a theoretical analysis of softmax gating in Mixture of Experts (MoE), directly addressing architectural insights and efficiency. The convergence analysis and sample efficiency insights are highly relevant.

Relevance: 10 Novelty: 9

2. Conformal Transformations for Symmetric Power Transformers

ArXiv ID: 2503.03269

Authors: Saurabh Kumar, Jacob Buckman, Carles Gelada, Sean Zhang

Abstract: Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.

Comment: The paper introduces a novel architectural improvement to linear transformers by addressing capacity limitations in symmetric power transformers using conformal transformations. This aligns with the 'Model Architecture' criterion, focusing on architectural innovations.