Personalized Daily Arxiv Papers 03/10/2025

[gpt-4o]	Prompt	Completion	Total
Token	49089	6516	55605
Cost	$0.12	$0.07	$0.18

Total ArXiv papers: 521

Total scanned papers: 332

Total relevant papers: 31

Table of contents with paper titles:

Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts Authors: Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning Authors: Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts Authors: Shwai He, Weilin Cai, Jiayi Huang, Ang Li
Disentangling Task Interference within Neurons: Model Merging in Alignment with Neuronal Mechanisms Authors: Zitao Fang, Guodong DU, Shuyang Yu, Yifei Guo, Yiwei Zhang, Jing Li, Ho-Kin Tang, Sim Kuan Goh
Quantum-PEFT: Ultra parameter-efficient fine-tuning Authors: Toshiaki Koike-Akino, Francesco Tonin, Yongtao Wu, Frank Zhengqing Wu, Leyla Naz Candogan, Volkan Cevher
Distilling Dataset into Neural Field Authors: Donghyeok Shin, HeeSun Bae, Gyuwon Sim, Wanmo Kang, Il-Chul Moon
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation Authors: Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Change Jia, Linglin Zhang, Jinzhu Wu, Junfeng Ran, Sai-er Hu, Zihan Jiang, Junting Zhou, Wenrui Liu, Bin Cui, Tong Yang, Xiangzheng Zhang
Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models Authors: Benyamin Jamialahmadi, Parsa Kavehzadeh, Mehdi Rezagholizadeh, Parsa Farinneya, Hossein Rajabzadeh, Aref Jafari, Boxing Chen, Marzieh Tahaei
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning Authors: Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti
Wanda++: Pruning Large Language Models via Regional Gradients Authors: Yifan Yang, Kai Zhen, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus M\"uller, Jonas M. K\"ubler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs Authors: Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, Fakang Wang, Gangshan Wang, Guangyao Zhai, Haitao Zhang, Huizhong Li, Jun Zhou, Jia Liu, Junpeng Fang, Junjie Ou, Jun Hu, Ji Luo, Ji Zhang, Jian Liu, Jian Sha, Jianxue Qian, Jiewei Wu, Junping Zhao, Jianguo Li, Jubao Feng, Jingchao Di, Junming Xu, Jinghua Yao, Kuan Xu, Kewei Du, Longfei Li, Lei Liang, Lu Yu, Li Tang, Lin Ju, Peng Xu, Qing Cui, Song Liu, Shicheng Li, Shun Song, Song Yan, Tengwei Cai, Tianyi Chen, Ting Guo, Ting Huang, Tao Feng, Tao Wu, Wei Wu, Xiaolu Zhang, Xueming Yang, Xin Zhao, Xiaobo Hu, Xin Lin, Yao Zhao, Yilong Wang, Yongzhen Guo, Yuanyuan Wang, Yue Yang, Yang Cao, Yuhao Fu, Yi Xiong, Yanzhe Li, Zhe Li, Zhiqiang Zhang, Ziqi Liu, Zhaoxin Huan, Zujie Wen, Zhenhang Sun, Zhuoxuan Du, Zhengyu He
Continual Pre-training of MoEs: How robust is your router? Authors: Benjamin Th\'erien, Charles-\'Etienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish
Strategy Coopetition Explains the Emergence and Transience of In-Context Learning Authors: Aaditya K. Singh, Ted Moskovitz, Sara Dragutinovic, Felix Hill, Stephanie C. Y. Chan, Andrew M. Saxe
Neuromorphic Quantum Neural Networks with Tunnel-Diode Activation Functions Authors: Jake McNaughton, A. H. Abbas, Ivan S. Maksymov
Global graph features unveiled by unsupervised geometric deep learning Authors: Mirja Granfors, Jes\'us Pineda, Blanca Zufiria Gerbol\'es, Joana B. Pereira, Carlo Manzo, Giovanni Volpe
Riemann$^2$: Learning Riemannian Submanifolds from Riemannian Data Authors: Leonel Rozo, Miguel Gonz\'alez-Duque, No\'emie Jaquier, S{\o}ren Hauberg
Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations Authors: Eren Erogullari, Sebastian Lapuschkin, Wojciech Samek, Frederik Pahde
VQEL: Enabling Self-Developed Symbolic Language in Agents through Vector Quantization in Emergent Language Games Authors: Mohammad Mahdi Samiei Paqaleh, Mahdieh Soleymani Baghshah
Towards Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients Authors: Niklas Penzel, Joachim Denzler
Extrapolation Merging: Keep Improving With Extrapolation and Merging Authors: Yiguan Lin, Bin Xu, Yinghao Li, Yang Gao
Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning Authors: Victor Sebastian Martinez Pozos, Ivan Vladimir Meza Ruiz
Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs? Authors: Qingyuan Liang, Zhao Zhang, Zeyu Sun, Zheng Lin, Qi Luo, Yueyi Xiao, Yizhou Chen, Yuqun Zhang, Haotian Zhang, Lu Zhang, Bin Chen, Yingfei Xiong
KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework Authors: Jiexiong Liu, Yixuan Chen, Yanqin Jia, Zhepeng Li
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models Authors: Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du
An Analytical Model for Overparameterized Learning Under Class Imbalance Authors: Eliav Mor, Yair Carmon
Statistical Deficiency for Task Inclusion Estimation Authors: Lo\"ic Fosse, Fr\'ed\'eric B\'echet, Beno\^it Favre, G\'eraldine Damnati, Gw\'enol\'e Lecorv\'e, Maxime Darrin, Philippe Formont, Pablo Piantanida
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning Authors: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
Optimizing Multi-Hop Document Retrieval Through Intermediate Representations Authors: Jiaen Lin, Jingyu Liu
A kinetic-based regularization method for data science applications Authors: Abhisek Ganguly, Alessandro Gabbana, Vybhav Rao, Sauro Succi, Santosh Ansumali
Robust Conformal Prediction with a Single Binary Certificate Authors: Soroush H. Zargarbashi, Aleksandar Bojchevski
A new local time-decoupled squared Wasserstein-2 method for training stochastic neural networks to reconstruct uncertain parameters in dynamical systems Authors: Mingtao Xia, Qijing Shen, Philip Maini, Eamonn Gaffney, Alex Mogilner

1. Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts

ArXiv ID: 2503.05447

Authors: Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng

Abstract: Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: https://github.com/OpenSparseLLMs/Linear-MoE.

Comment: The paper introduces Linear-MoE, combining linear sequence modeling with Mixture-of-Experts, which is highly relevant to architectural innovations and foundational research in MoE.

Relevance: 10 Novelty: 8

2. Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

ArXiv ID: 2503.05641

Authors: Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

Abstract: Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting experts at the task level is often too coarse-grained, as heterogeneous tasks may require different expertise for each instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch inference strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we demonstrate that Symbolic-MoE outperforms strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute average improvement of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.

Comment: The paper introduces a symbolic Mixture-of-Experts framework, which directly aligns with the MoE topic under model architecture. The instance-level expert selection and efficiency improvements are notable contributions.

Relevance: 10 Novelty: 8

3. Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

ArXiv ID: 2503.05066

Authors: Shwai He, Weilin Cai, Jiayi Huang, Ang Li

Abstract: The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the \textbf{\textit{Straggler Effect}}. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) \textbf{\textit{Capacity-Aware Token Drop}}, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) \textbf{\textit{Capacity-Aware Token Reroute}}, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2\% average performance increase and a 1.94$\times$ inference speedup on Mixtral-8$\times$7B-Instruct.

Comment: The paper addresses the Straggler Effect in Mixture-of-Experts, which is directly relevant to model architecture and efficiency improvements. The proposed techniques are innovative.

Relevance: 10 Novelty: 8

4. Disentangling Task Interference within Neurons: Model Merging in Alignment with Neuronal Mechanisms

ArXiv ID: 2503.05320

Authors: Zitao Fang, Guodong DU, Shuyang Yu, Yifei Guo, Yiwei Zhang, Jing Li, Ho-Kin Tang, Sim Kuan Goh

Abstract: Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic at various levels: model, layer, or parameter, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlook the fundamental role of individual neurons and their connectivity, resulting in a lack of interpretability in both the merging process and the merged models. In this work, we present the first study on the impact of neuronal alignment in model merging. We decompose task-specific representations into two complementary neuronal subspaces that regulate neuron sensitivity and input adaptability. Leveraging this decomposition, we introduce NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrate that NeuroMerging achieves superior performance compared to existing methods on multi-task benchmarks across both vision and natural language domains. Our findings highlight the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion.

Comment: The paper introduces NeuroMerging, a novel framework for model merging that addresses task interference at the neuronal level, aligning with the representation learning and model architecture criteria.

Relevance: 9 Novelty: 9

5. Quantum-PEFT: Ultra parameter-efficient fine-tuning

ArXiv ID: 2503.05431

Authors: Toshiaki Koike-Akino, Francesco Tonin, Yongtao Wu, Frank Zhengqing Wu, Leyla Naz Candogan, Volkan Cevher

Abstract: This paper introduces Quantum-PEFT that leverages quantum computations for parameter-efficient fine-tuning (PEFT). Unlike other additive PEFT methods, such as low-rank adaptation (LoRA), Quantum-PEFT exploits an underlying full-rank yet surprisingly parameter efficient quantum unitary parameterization. With the use of Pauli parameterization, the number of trainable parameters grows only logarithmically with the ambient dimension, as opposed to linearly as in LoRA-based PEFT methods. Quantum-PEFT achieves vanishingly smaller number of trainable parameters than the lowest-rank LoRA as dimensions grow, enhancing parameter efficiency while maintaining a competitive performance. We apply Quantum-PEFT to several transfer learning benchmarks in language and vision, demonstrating significant advantages in parameter efficiency.

Comment: The paper proposes Quantum-PEFT, a novel parameter-efficient fine-tuning method leveraging quantum computations, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 9

6. Distilling Dataset into Neural Field

ArXiv ID: 2503.04835

Authors: Donghyeok Shin, HeeSun Bae, Gyuwon Sim, Wanmo Kang, Il-Chul Moon

Abstract: Utilizing a large-scale dataset is essential for training high-performance deep learning models, but it also comes with substantial computation and storage costs. To overcome these challenges, dataset distillation has emerged as a promising solution by compressing the large-scale dataset into a smaller synthetic dataset that retains the essential information needed for training. This paper proposes a novel parameterization framework for dataset distillation, coined Distilling Dataset into Neural Field (DDiF), which leverages the neural field to store the necessary information of the large-scale dataset. Due to the unique nature of the neural field, which takes coordinates as input and output quantity, DDiF effectively preserves the information and easily generates various shapes of data. We theoretically confirm that DDiF exhibits greater expressiveness than some previous literature when the utilized budget for a single synthetic instance is the same. Through extensive experiments, we demonstrate that DDiF achieves superior performance on several benchmark datasets, extending beyond the image domain to include video, audio, and 3D voxel. We release the code at https://github.com/aailab-kaist/DDiF.

Comment: The paper introduces a novel parameterization framework for dataset distillation using neural fields, which is highly relevant to foundational research in representation learning and efficiency.

Relevance: 9 Novelty: 8

7. TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

ArXiv ID: 2503.04872

Authors: Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Change Jia, Linglin Zhang, Jinzhu Wu, Junfeng Ran, Sai-er Hu, Zihan Jiang, Junting Zhou, Wenrui Liu, Bin Cui, Tong Yang, Xiangzheng Zhang

Abstract: The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.

Comment: The paper introduces a novel Branch-Merge distillation approach for model compression, which aligns with the model compression criterion, particularly in the context of LLMs.

Relevance: 9 Novelty: 8

8. Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models

ArXiv ID: 2503.05005

Authors: Benyamin Jamialahmadi, Parsa Kavehzadeh, Mehdi Rezagholizadeh, Parsa Farinneya, Hossein Rajabzadeh, Aref Jafari, Boxing Chen, Marzieh Tahaei

Abstract: Deploying large language models (LLMs) in real-world applications is often hindered by strict computational and latency constraints. While dynamic inference offers the flexibility to adjust model behavior based on varying resource budgets, existing methods are frequently limited by hardware inefficiencies or performance degradation. In this paper, we introduce Balcony, a simple yet highly effective framework for depth-based dynamic inference. By freezing the pretrained LLM and inserting additional transformer layers at selected exit points, Balcony maintains the full model's performance while enabling real-time adaptation to different computational budgets. These additional layers are trained using a straightforward self-distillation loss, aligning the sub-model outputs with those of the full model. This approach requires significantly fewer training tokens and tunable parameters, drastically reducing computational costs compared to prior methods. When applied to the LLaMA3-8B model, using only 0.2% of the original pretraining data, Balcony achieves minimal performance degradation while enabling significant speedups. Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip as well as other leading compression techniques on multiple models and at various scales, across a variety of benchmarks.

Comment: The paper introduces Balcony, a framework for dynamic inference in LLMs, which aligns with the model compression and efficiency criterion through its innovative depth-based dynamic inference approach.

Relevance: 9 Novelty: 8

9. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

ArXiv ID: 2503.04973

Authors: Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti

Abstract: Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

Comment: The paper proposes task-aware KV cache compression, which aligns with model compression and efficiency improvements in LLMs. The task-aware approach is a novel contribution.

Relevance: 9 Novelty: 8

10. Wanda++: Pruning Large Language Models via Regional Gradients

ArXiv ID: 2503.04992

Authors: Yifan Yang, Kai Zhen, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus M\"uller, Jonas M. K\"ubler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar

Abstract: Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal performance impact. However, existing methods often suffer from performance loss without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32\% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Further experiments indicate our proposed method is orthogonal to sparsity-aware fine-tuning, where Wanda++ can be combined with LoRA fine-tuning to achieve a similar perplexity improvement as the Wanda method. The proposed method is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single NVIDIA H100 GPU.

Comment: The paper introduces Wanda++, a pruning framework for LLMs, which aligns with model compression and sparsity. The use of regional gradients is a novel approach.

Relevance: 9 Novelty: 8

11. Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

ArXiv ID: 2503.05139

Authors: Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, Fakang Wang, Gangshan Wang, Guangyao Zhai, Haitao Zhang, Huizhong Li, Jun Zhou, Jia Liu, Junpeng Fang, Junjie Ou, Jun Hu, Ji Luo, Ji Zhang, Jian Liu, Jian Sha, Jianxue Qian, Jiewei Wu, Junping Zhao, Jianguo Li, Jubao Feng, Jingchao Di, Junming Xu, Jinghua Yao, Kuan Xu, Kewei Du, Longfei Li, Lei Liang, Lu Yu, Li Tang, Lin Ju, Peng Xu, Qing Cui, Song Liu, Shicheng Li, Shun Song, Song Yan, Tengwei Cai, Tianyi Chen, Ting Guo, Ting Huang, Tao Feng, Tao Wu, Wei Wu, Xiaolu Zhang, Xueming Yang, Xin Zhao, Xiaobo Hu, Xin Lin, Yao Zhao, Yilong Wang, Yongzhen Guo, Yuanyuan Wang, Yue Yang, Yang Cao, Yuhao Fu, Yi Xiong, Yanzhe Li, Zhe Li, Zhiqiang Zhang, Ziqi Liu, Zhaoxin Huan, Zujie Wen, Zhenhang Sun, Zhuoxuan Du, Zhengyu He

Abstract: In this technical report, we tackle the challenges of training large-scale Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and resource limitations prevalent in such systems. To address these issues, we present two differently sized MoE large language models (LLMs), namely Ling-Lite and Ling-Plus (referred to as "Bailing" in Chinese, spelled B\v{a}il\'ing in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. Both models exhibit comparable performance to leading industry benchmarks. This report offers actionable insights to improve the efficiency and accessibility of AI development in resource-constrained settings, promoting more scalable and sustainable technologies. Specifically, to reduce training costs for large-scale MoE models, we propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency. Additionally, leveraging high-quality data generated from knowledge graphs, our models demonstrate superior capabilities in tool use compared to other models. Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be effectively trained on lower-performance devices while achieving comparable performance to models of a similar scale, including dense and MoE models. Compared to high-performance devices, utilizing a lower-specification hardware system during the pre-training phase demonstrates significant cost savings, reducing computing costs by approximately 20%. The models can be accessed at https://huggingface.co/inclusionAI.

Comment: The paper discusses scaling Mixture-of-Experts (MoE) models efficiently, which directly aligns with foundational research in model architecture and efficiency.

Relevance: 9 Novelty: 8

12. Continual Pre-training of MoEs: How robust is your router?

ArXiv ID: 2503.05029

Authors: Benjamin Th\'erien, Charles-\'Etienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish

Abstract: Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay and learning rate re-warming and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer's routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale (>2B parameter switch and DeepSeek MoE LLMs trained for 600B tokens) empirical study across four MoE transformers to answer these questions. Our results establish a surprising robustness to distribution shifts for both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.

Comment: The paper investigates continual pre-training of MoE models, providing insights into routing algorithms and robustness, which is highly relevant to foundational research in MoE architectures.

Relevance: 9 Novelty: 8

13. Strategy Coopetition Explains the Emergence and Transience of In-Context Learning

ArXiv ID: 2503.05631

Authors: Aaditya K. Singh, Ted Moskovitz, Sara Dragutinovic, Felix Hill, Stephanie C. Y. Chan, Andrew M. Saxe

Abstract: In-context learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. In this work, we sought a mechanistic understanding of these transient dynamics. Firstly, we find that, after the disappearance of ICL, the asymptotic strategy is a remarkable hybrid between in-weights and in-context learning, which we term "context-constrained in-weights learning" (CIWL). CIWL is in competition with ICL, and eventually replaces it as the dominant strategy of the model (thus leading to ICL transience). However, we also find that the two competing strategies actually share sub-circuits, which gives rise to cooperative dynamics as well. For example, in our setup, ICL is unable to emerge quickly on its own, and can only be enabled through the simultaneous slow development of asymptotic CIWL. CIWL thus both cooperates and competes with ICL, a phenomenon we term "strategy coopetition." We propose a minimal mathematical model that reproduces these key dynamics and interactions. Informed by this model, we were able to identify a setup where ICL is truly emergent and persistent.

Comment: The paper provides a mechanistic understanding of in-context learning dynamics, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 8

14. Neuromorphic Quantum Neural Networks with Tunnel-Diode Activation Functions

ArXiv ID: 2503.04978

Authors: Jake McNaughton, A. H. Abbas, Ivan S. Maksymov

Abstract: The mathematical complexity and high dimensionality of neural networks hinder the training and deployment of machine learning (ML) systems while also requiring substantial computational resources. This fundamental limitation drives ML research, particularly in the exploration of alternative neural network architectures that integrate novel building blocks, such as advanced activation functions. Tunnel diodes are well-known electronic components that utilise the physical effect of quantum tunnelling (QT). Here, we propose using the current voltage characteristic of a tunnel diode as a novel, physics-based activation function for neural networks. We demonstrate that the tunnel-diode activation function (TDAF) outperforms traditional activation functions in terms of accuracy and loss during both training and evaluation. We also highlight its potential for implementation in electronic circuits suited to developing neuromorphic, quantum-inspired AI systems capable of operating in environments not suitable for qubit-based quantum computing hardware.

Comment: The use of tunnel-diode activation functions introduces a novel physics-based activation mechanism, which is relevant to architectural innovations in neural networks.

Relevance: 8 Novelty: 8

15. Global graph features unveiled by unsupervised geometric deep learning

ArXiv ID: 2503.05560

Authors: Mirja Granfors, Jes\'us Pineda, Blanca Zufiria Gerbol\'es, Joana B. Pereira, Carlo Manzo, Giovanni Volpe

Abstract: Graphs provide a powerful framework for modeling complex systems, but their structural variability makes analysis and classification challenging. To address this, we introduce GAUDI (Graph Autoencoder Uncovering Descriptive Information), a novel unsupervised geometric deep learning framework that captures both local details and global structure. GAUDI employs an innovative hourglass architecture with hierarchical pooling and upsampling layers, linked through skip connections to preserve essential connectivity information throughout the encoding-decoding process. By mapping different realizations of a system - generated from the same underlying parameters - into a continuous, structured latent space, GAUDI disentangles invariant process-level features from stochastic noise. We demonstrate its power across multiple applications, including modeling small-world networks, characterizing protein assemblies from super-resolution microscopy, analyzing collective motion in the Vicsek model, and capturing age-related changes in brain connectivity. This approach not only improves the analysis of complex graphs but also provides new insights into emergent phenomena across diverse scientific domains.

Comment: The paper introduces GAUDI, a novel unsupervised geometric deep learning framework for graph analysis, which aligns with representation learning through its innovative autoencoder architecture.

Relevance: 8 Novelty: 8

16. Riemann$^2$: Learning Riemannian Submanifolds from Riemannian Data

ArXiv ID: 2503.05540

Authors: Leonel Rozo, Miguel Gonz\'alez-Duque, No\'emie Jaquier, S{\o}ren Hauberg

Abstract: Latent variable models are powerful tools for learning low-dimensional manifolds from high-dimensional data. However, when dealing with constrained data such as unit-norm vectors or symmetric positive-definite matrices, existing approaches ignore the underlying geometric constraints or fail to provide meaningful metrics in the latent space. To address these limitations, we propose to learn Riemannian latent representations of such geometric data. To do so, we estimate the pullback metric induced by a Wrapped Gaussian Process Latent Variable Model, which explicitly accounts for the data geometry. This enables us to define geometry-aware notions of distance and shortest paths in the latent space, while ensuring that our model only assigns probability mass to the data manifold. This generalizes previous work and allows us to handle complex tasks in various domains, including robot motion synthesis and analysis of brain connectomes.

Comment: The paper introduces a novel approach to learning Riemannian latent representations, which aligns with representation learning and provides theoretical insights into constrained data geometry.

Relevance: 8 Novelty: 8

17. Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations

ArXiv ID: 2503.05522

Authors: Eren Erogullari, Sebastian Lapuschkin, Wojciech Samek, Frederik Pahde

Abstract: Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as "beard" and "necktie" within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.

Comment: The paper introduces a method for disentangling concept representations in neural networks, which aligns with representation learning and provides insights into latent space interpretability.

Relevance: 8 Novelty: 8

18. VQEL: Enabling Self-Developed Symbolic Language in Agents through Vector Quantization in Emergent Language Games

ArXiv ID: 2503.04940

Authors: Mohammad Mahdi Samiei Paqaleh, Mahdieh Soleymani Baghshah

Abstract: In the field of emergent language, efforts have traditionally focused on developing communication protocols through interactions between agents in referential games. However, the aspect of internal language learning, where language serves not only as a communicative tool with others but also as a means for individual thinking, self-reflection, and problem-solving remains underexplored. Developing a language through self-play, without another agent's involvement, poses a unique challenge. It requires an agent to craft symbolic representations and train them using direct gradient methods. The challenge here is that if an agent attempts to learn symbolic representations through self-play using conventional modeling and techniques such as REINFORCE, the solution will offer no advantage over previous multi-agent approaches. We introduce VQEL, a novel method that incorporates Vector Quantization into the agents' architecture, enabling them to autonomously invent and develop discrete symbolic representations in a self-play referential game. Following the self-play phase, agents can enhance their language through reinforcement learning and interactions with other agents in the mutual-play phase. Our experiments across various datasets demonstrate that VQEL not only outperforms the traditional REINFORCE method but also benefits from improved control and reduced susceptibility to collapse, thanks to the incorporation of vector quantization.

Comment: The paper introduces a novel method for emergent language learning using vector quantization, which aligns with foundational research in representation learning and symbolic representation.

Relevance: 8 Novelty: 8

19. Towards Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients

ArXiv ID: 2503.05424

Authors: Niklas Penzel, Joachim Denzler

Abstract: Deep learning models achieve high predictive performance but lack intrinsic interpretability, hindering our understanding of the learned prediction behavior. Existing local explainability methods focus on associations, neglecting the causal drivers of model predictions. Other approaches adopt a causal perspective but primarily provide more general global explanations. However, for specific inputs, it's unclear whether globally identified factors apply locally. To address this limitation, we introduce a novel framework for local interventional explanations by leveraging recent advances in image-to-image editing models. Our approach performs gradual interventions on semantic properties to quantify the corresponding impact on a model's predictions using a novel score, the expected property gradient magnitude. We demonstrate the effectiveness of our approach through an extensive empirical evaluation on a wide range of architectures and tasks. First, we validate it in a synthetic scenario and demonstrate its ability to locally identify biases. Afterward, we apply our approach to analyze network training dynamics, investigate medical skin lesion classifiers, and study a pre-trained CLIP model with real-life interventional data. Our results highlight the potential of interventional explanations on the property level to reveal new insights into the behavior of deep models.

Comment: The paper proposes a novel framework for local interventional explanations, which aligns with representation learning and provides insights into model interpretability.

Relevance: 8 Novelty: 8

20. Extrapolation Merging: Keep Improving With Extrapolation and Merging

ArXiv ID: 2503.04834

Authors: Yiguan Lin, Bin Xu, Yinghao Li, Yang Gao

Abstract: Large Language Models (LLMs) require instruction fine-tuning to perform different downstream tasks. However, the instruction fine-tuning phase still demands significant computational resources and labeled data, lacking a paradigm that can improve model performance without additional computational power and data. Model merging aims to enhance performance by combining the parameters of different models, but the lack of a clear optimization direction during the merging process does not always guarantee improved performance. In this paper, we attempt to provide a clear optimization direction for model merging. We first validate the effectiveness of the model extrapolation method during the instruction fine-tuning phase. Then, we propose Extrapolation Merging, a paradigm that can continue improving model performance without requiring extra computational resources or data. Using the extrapolation method, we provide a clear direction for model merging, achieving local optimization search, and consequently enhancing the merged model's performance. We conduct experiments on seven different tasks, and the results show that our method can consistently improve the model's performance after fine-tuning.

Comment: The extrapolation merging paradigm for improving model performance without additional resources is relevant to foundational research in model optimization and efficiency.