ACL2024

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Yuanhang Yang, Shiyi Qi, Wenchao Gu, Chaozheng Wang, Cuiyun Gao, Zenglin Xu

Abstract

Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations via multiplying values by zero or low activation values. To address this issue, we present XMoE, a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. XMoE leverages small experts and a threshold-based router to enable tokens to selectively engage only essential parameters. Our extensive experiments on language modeling and machine translation tasks demonstrate that XMoE can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance. Furthermore, we present the versatility of XMoE by applying it to dense models, enabling sparse computation during inference. We provide a comprehensive analysis and make our code available at https: //anonymous.4open.science/r/XMoE. Networks (FFNs) (Vaswani et al., 2017). Unlike tra-042 ditional models that utilize all parameters for each 043 input token, MoE models selectively activate a sub-044 set of experts. This approach effectively decouples 045 computational costs from model size, paving the 046 way for more efficient scaling.