ACL2025
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
被引用 42 次
摘要
This paper revisits the implementation of oad-alancing oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as , where is the total number of experts, represents the frequency of expert being selected, and denotes the average gating score of the expert . Existing MoE training frameworks usually employ the parallel training strategy so that and the LBL are calculated within a and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence (, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to total parameters and tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.