Personalized Daily ArXiv Papers 2025-08-14

[gpt-4o]	Prompt	Completion	Total
Token	34630	4045	38675
Cost	$0.09	$0.04	$0.13

Total arXiv papers: 538

Total scanned papers: 320

Total relevant papers: 17

Table of contents with paper titles:

$\mu$-Parametrization for Mixture of Experts Authors: Jan Ma{\l}a\'snicki, Kamil Ciebiera, Mateusz Boru\'n, Maciej Pi\'oro, Jan Ludziejewski, Maciej Stefaniak, Micha{\l} Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski
Provable In-Context Vector Arithmetic via Retrieving Task Concepts Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki
EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models Authors: Omar Bazarbachi, Zijun Sun, Yanning Shen
HKT: A Biologically Inspired Framework for Modular Hereditary Knowledge Transfer in Neural Networks Authors: Yanick Chistian Tchenko, Felix Mohr, Hicham Hadj Abdelkader, Hedi Tabia
Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization Authors: Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi
DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic Authors: Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Francesca Palermo, Diana Trojaniello, Manuel Roveri
CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge Authors: Muqing Li, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang
Global Convergence Analysis of Vanilla Gradient Descent for Asymmetric Matrix Completion Authors: Xu Zhang, Shuo Chen, Jinsheng Li, Xiangying Pang, Maoguo Gong
HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap Authors: Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu
NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation Authors: Devvrat Joshi, Islem Rekik
Pattern-based Knowledge Component Extraction from Student Code Using Representation Learning Authors: Muntasir Hoq, Griffin Pitts, Andrew Lan, Peter Brusilovsky, Bita Akram
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks Authors: Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models Authors: Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning Authors: Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, Jun Wang
Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA Authors: Yuan-Hao Wei, Fu-Hao Deng, Lin-Yong Cui, Yan-Jie Sun
Improving Diversity in Language Models: When Temperature Fails, Change the Loss Authors: Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, Benjamin Negrevergne
Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation Authors: Sameer Ambekar, Daniel M. Lang, Julia A. Schnabel

1. $\mu$-Parametrization for Mixture of Experts

ArXiv ID: 2508.09752

Authors: Jan Ma{\l}a\'snicki, Kamil Ciebiera, Mateusz Boru\'n, Maciej Pi\'oro, Jan Ludziejewski, Maciej Stefaniak, Micha{\l} Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski

Abstract: Recent years have seen a growing interest and adoption of LLMs, with $\mu$Transfer becoming a key technique for tuning hyperparameters in large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a leading architecture in extremely large models. However, the intersection of these two advancements has remained unexplored. In this work, we derive a $\mu$-Parameterization ($\mu$P) for MoE, providing theoretical guarantees for feature learning across model widths in both the router and experts. We empirically validate our parameterization and further investigate how scaling the number of experts and granularity affects the optimal learning rate.

Comment: The paper provides a theoretical framework for MoE parameterization, aligning with model architecture insights and foundational research.

Relevance: 10 Novelty: 8

2. Provable In-Context Vector Arithmetic via Retrieving Task Concepts

ArXiv ID: 2508.09820

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

Abstract: In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.

Comment: The paper provides a theoretical framework for in-context learning in LLMs, focusing on vector arithmetic and task concept retrieval, which is relevant to foundational research in LLMs.