Personalized Daily ArXiv Papers 2025-05-05

[gpt-4o]	Prompt	Completion	Total
Token	32232	4357	36589
Cost	$0.08	$0.04	$0.12

Total arXiv papers: 336

Total scanned papers: 217

Total relevant papers: 11

Table of contents with paper titles:

Improving Routing in Sparse Mixture of Experts with Graph of Tokens Authors: Tam Nguyen, Ngoc N. Tran, Khai Nguyen, Richard G. Baraniuk
CoCoAFusE: Beyond Mixtures of Experts via Model Fusion Authors: Aurelio Raffa Ugolini, Mara Tanelli, Valentina Breschi
Multi-Step Consistency Models: Fast Generation with Theoretical Guarantees Authors: Nishant Jain, Xunpeng Huang, Yian Ma, Tong Zhang
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias Authors: Ruiquan Huang, Yingbin Liang, Jing Yang
Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures Authors: Heng-Sheng Chang, Prashant G. Mehta
ICQuant: Index Coding enables Low-bit LLM Quantization Authors: Xinlin Li, Osama Hanna, Christina Fragouli, Suhas Diggavi
StablePCA: Learning Shared Representations across Multiple Sources via Minimax Optimization Authors: Zhenyu Wang, Molei Liu, Jing Lei, Francis Bach, Zijian Guo
Compact Recurrent Transformer with Persistent Memory Authors: Edison Mucllari, Zachary Daniels, David Zhang, Qiang Ye
Learning and Transferring Physical Models through Derivatives Authors: Alessandro Trenta, Andrea Cossu, Davide Bacciu
Incorporating Inductive Biases to Energy-based Generative Models Authors: Yukun Li, Li-Ping Liu
Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders Authors: Rogelio A Mancisidor, Robert Jenssen, Shujian Yu, Michael Kampffmeyer

1. Improving Routing in Sparse Mixture of Experts with Graph of Tokens

ArXiv ID: 2505.00792

Authors: Tam Nguyen, Ngoc N. Tran, Khai Nguyen, Richard G. Baraniuk

Abstract: Sparse Mixture of Experts (SMoE) has emerged as a key to achieving unprecedented scalability in deep learning. By activating only a small subset of parameters per sample, SMoE achieves an exponential increase in parameter counts while maintaining a constant computational overhead. However, SMoE models are susceptible to routing fluctuations--changes in the routing of a given input to its target expert--at the late stage of model training, leading to model non-robustness. In this work, we unveil the limitation of SMoE through the perspective of the probabilistic graphical model (PGM). Through this PGM framework, we highlight the independence in the expert-selection of tokens, which exposes the model to routing fluctuation and non-robustness. Alleviating this independence, we propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection. We then derive a new PGM underlying an (S)MoE-Attention block, going beyond just a single (S)MoE layer. Leveraging the token similarities captured by the attention matrix, we propose the innovative Attention-Aware (S)MoE, which employs the attention matrix to guide the routing of tokens to appropriate experts in (S)MoE. We theoretically prove that Similarity/Attention-Aware routing help reduce the entropy of expert selection, resulting in more stable token routing mechanisms. We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations, enhancing accuracy, and increasing model robustness over the baseline MoE-Transformer with token routing via softmax gating.

Comment: This paper addresses routing stability in Sparse Mixture of Experts (SMoE) through novel probabilistic graphical modeling and attention-aware mechanisms, directly contributing to foundational research in model architecture and sparsity.

Relevance: 10 Novelty: 8

2. CoCoAFusE: Beyond Mixtures of Experts via Model Fusion

ArXiv ID: 2505.01105

Authors: Aurelio Raffa Ugolini, Mara Tanelli, Valentina Breschi

Abstract: Many learning problems involve multiple patterns and varying degrees of uncertainty dependent on the covariates. Advances in Deep Learning (DL) have addressed these issues by learning highly nonlinear input-output dependencies. However, model interpretability and Uncertainty Quantification (UQ) have often straggled behind. In this context, we introduce the Competitive/Collaborative Fusion of Experts (CoCoAFusE), a novel, Bayesian Covariates-Dependent Modeling technique. CoCoAFusE builds on the very philosophy behind Mixtures of Experts (MoEs), blending predictions from several simple sub-models (or "experts") to achieve high levels of expressiveness while retaining a substantial degree of local interpretability. Our formulation extends that of a classical Mixture of Experts by contemplating the fusion of the experts' distributions in addition to their more usual mixing (i.e., superimposition). Through this additional feature, CoCoAFusE better accommodates different scenarios for the intermediate behavior between generating mechanisms, resulting in tighter credible bounds on the response variable. Indeed, only resorting to mixing, as in classical MoEs, may lead to multimodality artifacts, especially over smooth transitions. Instead, CoCoAFusE can avoid these artifacts even under the same structure and priors for the experts, leading to greater expressiveness and flexibility in modeling. This new approach is showcased extensively on a suite of motivating numerical examples and a collection of real-data ones, demonstrating its efficacy in tackling complex regression problems where uncertainty is a key quantity of interest.

Comment: The paper introduces CoCoAFusE, which extends Mixture of Experts (MoE) with a novel fusion mechanism. This aligns closely with the model architecture criterion, particularly innovations in MoE frameworks.

Relevance: 10 Novelty: 8

3. Multi-Step Consistency Models: Fast Generation with Theoretical Guarantees

ArXiv ID: 2505.01049

Authors: Nishant Jain, Xunpeng Huang, Yian Ma, Tong Zhang

Abstract: Consistency models have recently emerged as a compelling alternative to traditional SDE based diffusion models, offering a significant acceleration in generation by producing high quality samples in very few steps. Despite their empirical success, a proper theoretic justification for their speed up is still lacking. In this work, we provide the analysis which bridges this gap, showing that given a consistency model which can map the input at a given time to arbitrary timestamps along the reverse trajectory, one can achieve KL divergence of order $ O(\varepsilon^2) $ using only $ O\left(\log\left(\frac{d}{\varepsilon}\right)\right) $ iterations with constant step size, where d is the data dimension. Additionally, under minimal assumptions on the data distribution an increasingly common setting in recent diffusion model analyses we show that a similar KL convergence guarantee can be obtained, with the number of steps scaling as $ O\left(d \log\left(\frac{d}{\varepsilon}\right)\right) $. Going further, we also provide a theoretical analysis for estimation of such consistency models, concluding that accurate learning is feasible using small discretization steps, both in smooth and non smooth settings. Notably, our results for the non smooth case yield best in class convergence rates compared to existing SDE or ODE based analyses under minimal assumptions.

Comment: The paper provides theoretical guarantees for consistency models, which are an emerging trend in generative modeling. It offers foundational insights into the efficiency and theoretical underpinnings of these models, aligning with the 'Emerging Trends' criterion.

Relevance: 9 Novelty: 9

4. How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

ArXiv ID: 2505.00926

Authors: Ruiquan Huang, Yingbin Liang, Jing Yang

Abstract: Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as even pairs' andparity check', the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.

Comment: This paper provides theoretical insights into how transformers learn regular language recognition tasks, analyzing training dynamics and implicit bias. It aligns with the core topic of representation learning and offers foundational insights into transformer behavior.