ACL2023

Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation

Maha Elbayad, Anna Y. Sun, Shruti Bhosale

5 citations

Abstract

Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for lowresource tasks, MoE models severely over-fit. We introduce in this work effective regularization strategies, namely ( 1 ) dropout techniques for MoE layers in Expert Output Masking (EOM) and Final Output Masking (FOM), (2) Conditional MoE Routing (CMR) that learns what tokens require the extra capacity of MoE layers and ( 3 ) Curriculum Learning methods that introduce low-resource pairs at later stages of training. All these methods prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF ++ improvement in very low resource language pairs.