ACL2025

Masks Can be Learned as an Alternative to Experts

Peiyu Liu, Tianwen Wei, Bo Zhu, Xin Zhao, Shuicheng Yan

被引用 1 次

摘要

In this work, we investigate how to sparsify a pre-trained dense large language model into a mixture-of-experts (MoE) architecture for faster inference. Our approach applies mask matrix to the activations for each expert, constrained by L 0 regularization to minimize the number of activated parameters. To ensure minimal performance loss under this constraint, we initialize the model with all parameters active and progressively sparsify it during training. This approach proves more efficient than one-shot sparsification techniques, which typically require significant resources for performance recovery. Moreover, our approach automatically identifies shared, token-specific, and inactive experts, allowing for more efficient allocation of computational resources. Through extensive experiments, we achieve up to 97% performance retention on downstream tasks with only 50% of the feed-forward parameters activated in dense models. Beyond improving inference efficiency, this strategy of sharing computational units among experts provides a principled foundation for building more scalable and generalizable MoE architectures, paving the way for future expert-based model designs. Our code is available at https:// github.com/lpyhdzx/Mixture-of-Masks .