Personalized Daily ArXiv Papers 2025-04-29

[gpt-4o]	Prompt	Completion	Total
Token	44635	6224	50859
Cost	$0.11	$0.06	$0.17

Total arXiv papers: 691

Total scanned papers: 406

Total relevant papers: 27

Table of contents with paper titles:

Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity Authors: Ruifeng Ren, Yong Liu
Accelerating Mixture-of-Experts Training with Adaptive Expert Replication Authors: Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, Christos Kozyrakis
Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation Authors: Yi Lu, Wanxu Zhao, Xin Zhou, Chenxin An, Chenglong Wang, Shuo Li, Yuming Yang, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Sparks: Multi-Agent Artificial Intelligence Model Discovers Protein Design Principles Authors: Alireza Ghafarollahi, Markus J. Buehler
Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models Authors: Xin Wang, Haoyang Li, Zeyang Zhang, Haibo Chen, Wenwu Zhu
Emergence and scaling laws in SGD learning of shallow neural networks Authors: Yunwei Ren, Eshaan Nichani, Denny Wu, Jason D. Lee
Quantifying Memory Utilization with Effective State-Size Authors: Rom N. Parnichkun, Neehal Tumma, Armin W. Thomas, Alessandro Moro, Qi An, Taiji Suzuki, Atsushi Yamashita, Michael Poli, Stefano Massaroli
TLoRA: Tri-Matrix Low-Rank Adaptation of Large Language Models Authors: Tanvir Islam
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference Authors: Zhenyu Zhang, Zechun Liu, Yuandong Tian, Harshit Khaitan, Zhangyang Wang, Steven Li
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism Authors: Aviv Bick, Eric Xing, Albert Gu
ZipR1: Reinforcing Token Sparsity in MLLMs Authors: Feng Chen, Yefei He, Lequan Lin, Jing Liu, Bohan Zhuang, Qi Wu
Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence Authors: Adwait Datar, Nihat Ay
Improving Reasoning Performance in Large Language Models via Representation Engineering Authors: Bertram H{\o}jer, Oliver Jarvis, Stefan Heinrich
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Authors: Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni
BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts Authors: Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, Daoyuan Wu
From Evidence to Belief: A Bayesian Epistemology Approach to Language Models Authors: Minsu Kim, Sangryul Kim, James Thorne
Sharp higher order convergence rates for the Adam optimizer Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning Authors: Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, Kwan-Yee K. Wong
semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage Authors: Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang
TeleSparse: Practical Privacy-Preserving Verification of Deep Neural Networks Authors: Mohammad M Maheri, Hamed Haddadi, Alex Davidson
Hierarchical Uncertainty-Aware Graph Neural Network Authors: Yoonhyuk Choi, Chong-Kwon Kim
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training Authors: Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, Hui Xiong
Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search Authors: Fei Liu, Qingfu Zhang, Xialiang Tong, Mingxuan Yuan, Kun Mao
Graph Fourier Transformer with Structure-Frequency Information Authors: Yonghui Zhai, Yang Zhang, Minghao Shang, Lihua Pang, Yaxin Ren
Hierarchical Attention Generates Better Proofs Authors: Jianlong Chen, Chao Li, Yang Yuan, Andrew C Yao
Towards Faster and More Compact Foundation Models for Molecular Property Prediction Authors: Yasir Ghunaim, Andr\'es Villa, Gergo Ignacz, Gyorgy Szekely, Motasem Alfarra, Bernard Ghanem
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning Authors: Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia

1. Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity

ArXiv ID: 2504.18929

Authors: Ruifeng Ren, Yong Liu

Abstract: Compression has been a critical lens to understand the success of Transformers. In the past, we have typically taken the target distribution as a criterion to evaluate a model's compression performance. Nevertheless,it often remains challenging to precisely assess how well the model achieves compression and to compare the information content of the learned distribution with that of the target distribution during compression,as the target distribution is typically unknown and entropy computation often incurs exponential cost. In this work, we explore these issues under a controlled experimental setup. We find that Transformers exhibit a unique inductive bias in data compression: beyond approaching the target distribution, they tend to favor learning lower-entropy distributions, with this tendency becoming more pronounced as the model size increases. This preference prevents Transformers from perfectly aligning with the target distribution, instead further compressing its information content. Furthermore, we show that the FFN module plays a critical role in driving this bias. In addition, while models remove informational redundancy from data during compression, they also exhibit redundancy within their parameters, which enables compression and can be characterized through dynamic sparsity. However, the dynamic sparsity patterns in Transformers, particularly in attention and FFN modules, demand further exploration. As for this, we show that larger Transformers show stronger preferences for bypassing attention computations via residual connections and have lower proportion of active neurons. Interestingly, we also find that training instability in larger models strongly correlates with sudden increases in dead neurons. Our work contributes to a deeper understanding of Transformers from the lens of entropy and dynamic sparsity.

Comment: The paper explores Transformers through the lens of entropy and dynamic sparsity, directly addressing compression and efficiency breakthroughs.

Relevance: 10 Novelty: 8

2. Accelerating Mixture-of-Experts Training with Adaptive Expert Replication

ArXiv ID: 2504.19925

Authors: Athinagoras Skiadopoulos, Mark Zhao, Swapnil Gandhi, Thomas Norrie, Shrijeet Mukherjee, Christos Kozyrakis

Abstract: Mixture-of-Experts (MoE) models have become a widely adopted solution to continue scaling model sizes without a corresponding linear increase in compute. During MoE model training, each input token is dynamically routed to a subset of experts -- sparsely-activated feed-forward networks -- within each transformer layer. The distribution of tokens assigned to each expert varies widely and rapidly over the course of training. To handle the wide load imbalance across experts, current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity, incurring high state migration overheads. To break this performance-accuracy tradeoff, we introduce SwiftMoE, an adaptive MoE training system. The key insight of SwiftMoE is to decouple the placement of expert parameters from their large optimizer state. SwiftMoE statically partitions the optimizer of each expert across all training nodes. Meanwhile, SwiftMoE dynamically adjusts the placement of expert parameters by repurposing existing weight updates, avoiding migration overheads. In doing so, SwiftMoE right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overheads. Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SwiftMoE is able to achieve a 30.5% and 25.9% faster time-to-convergence, respectively.

Comment: The paper introduces SwiftMoE, an adaptive training system for Mixture-of-Experts models, which directly aligns with the 'Model Architecture' criterion. The dynamic expert replication is a novel contribution.

Relevance: 10 Novelty: 8

3. Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation

ArXiv ID: 2504.18857

Authors: Yi Lu, Wanxu Zhao, Xin Zhou, Chenxin An, Chenglong Wang, Shuo Li, Yuming Yang, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract: Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE's different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions' position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.

Comment: The paper proposes a training-free framework (DPE) for extending LLM context windows, which aligns with foundational research in LLM architecture and efficiency.

Relevance: 10 Novelty: 8

4. Sparks: Multi-Agent Artificial Intelligence Model Discovers Protein Design Principles

ArXiv ID: 2504.19017

Authors: Alireza Ghafarollahi, Markus J. Buehler

Abstract: Advances in artificial intelligence (AI) promise autonomous discovery, yet most systems still resurface knowledge latent in their training data. We present Sparks, a multi-modal multi-agent AI model that executes the entire discovery cycle that includes hypothesis generation, experiment design and iterative refinement to develop generalizable principles and a report without human intervention. Applied to protein science, Sparks uncovered two previously unknown phenomena: (i) a length-dependent mechanical crossover whereby beta-sheet-biased peptides surpass alpha-helical ones in unfolding force beyond ~80 residues, establishing a new design principle for peptide mechanics; and (ii) a chain-length/secondary-structure stability map revealing unexpectedly robust beta-sheet-rich architectures and a "frustration zone" of high variance in mixed alpha/beta folds. These findings emerged from fully self-directed reasoning cycles that combined generative sequence design, high-accuracy structure prediction and physics-aware property models, with paired generation-and-reflection agents enforcing self-correction and reproducibility. The key result is that Sparks can independently conduct rigorous scientific inquiry and identify previously unknown scientific principles.

Comment: The paper presents Sparks, a multi-agent AI model discovering protein design principles, which aligns with AI for Science and introduces novel generative paradigms for protein modeling.

Relevance: 9 Novelty: 9

5. Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models

ArXiv ID: 2504.20020

Authors: Xin Wang, Haoyang Li, Zeyang Zhang, Haibo Chen, Wenwu Zhu

Abstract: Large language models (LLMs) have dramatically advanced machine learning research including natural language processing, computer vision, data mining, etc., yet they still exhibit critical limitations in reasoning, factual consistency, and interpretability. In this paper, we introduce a novel learning paradigm -- Modular Machine Learning (MML) -- as an essential approach toward new-generation LLMs. MML decomposes the complex structure of LLMs into three interdependent components: modular representation, modular model, and modular reasoning, aiming to enhance LLMs' capability of counterfactual reasoning, mitigating hallucinations, as well as promoting fairness, safety, and transparency. Specifically, the proposed MML paradigm can: i) clarify the internal working mechanism of LLMs through the disentanglement of semantic components; ii) allow for flexible and task-adaptive model design; iii) enable interpretable and logic-driven decision-making process. We present a feasible implementation of MML-based LLMs via leveraging advanced techniques such as disentangled representation learning, neural architecture search and neuro-symbolic learning. We critically identify key challenges, such as the integration of continuous neural and discrete symbolic processes, joint optimization, and computational scalability, present promising future research directions that deserve further exploration. Ultimately, the integration of the MML paradigm with LLMs has the potential to bridge the gap between statistical (deep) learning and formal (logical) reasoning, thereby paving the way for robust, adaptable, and trustworthy AI systems across a wide range of real-world applications.

Comment: The paper introduces Modular Machine Learning (MML) as a paradigm for improving LLMs, which aligns with the 'Large Language Models' criterion. The focus on disentangled representation and modular reasoning is novel and impactful.

Relevance: 9 Novelty: 9

6. Emergence and scaling laws in SGD learning of shallow neural networks

ArXiv ID: 2504.19983

Authors: Yunwei Ren, Eshaan Nichani, Denny Wu, Jason D. Lee

Abstract: We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_(\boldsymbol{x}) = \sum_{p=1}^P a_p\cdot \sigma(\langle\boldsymbol{x},\boldsymbol{v}_p^\rangle)$, $\boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I}d)$, where the activation $\sigma:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent $k>2$ (defined as the lowest degree in the Hermite expansion), ${\boldsymbol{v}^p}$. We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of $P\gg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.}\subset \mathbb{R}^d$ are orthonormal signal directions, and the non-negative second-layer coefficients satisfy $\sum_{p} a_p^2=1$. We focus on the challenging ``extensive-width'' regime $P\gg 1$ and permit diverging condition number in the second-layer, covering as a special case the power-law scaling $a_p\asymp p^{-\beta}$ where $\beta\in\mathbb{R}_{\ge 0

Comment: The paper provides a theoretical analysis of SGD dynamics in learning two-layer neural networks, offering insights into training dynamics and scaling laws. This is highly relevant to representation learning.