ACL2025

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

42 citations

Abstract

This paper revisits the implementation of L\textbf{L}oad-b\textbf{b}alancing L\textbf{L}oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as NEi=1NEfipiN_E \sum_{i=1}^{N_E} f_i p_i, where NEN_E is the total number of experts, fif_i represents the frequency of expert ii being selected, and pip_i denotes the average gating score of the expert ii. Existing MoE training frameworks usually employ the parallel training strategy so that fif_i and the LBL are calculated within a micro-batch\textbf{micro-batch} and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence (e.g.\textit{e.g.}, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a global-batch\textbf{global-batch} to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize fif_i across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to 42.8B\textbf{42.8B} total parameters and 400B\textbf{400B} tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.