ACL2025

A Self-Distillation Recipe for Neural Machine Translation

Hongfei Xu, Zhuofei Liang, Qiuhui Liu, Lingling Mu

1 citation

Abstract

Self-distillation distills the deeper subnetworks to the shallower sub-networks without using an extra teacher model, and has been proven effective in improving the performance of a series of computer vision tasks. In this paper, we study the representation-based self-distillation methods for Neural Machine Translation (NMT) to avoid the efficiency issue of probability distribution based Knowledge Distillation (KD) with a large vocabulary. We present a rank-order augmented Pearson correlation loss and an iterative distillation method to prevent the discrepancy of predictions between the student and a stronger teacher from disturbing the training. To prevent the teacher from misleading the student's learning, we utilize a warm-up strategy and present a gradient adaption method to scale down or zero the knowledge distillation gradients which are opposite to the translation. Experiments on the low-resource IWSLT 14 German to English, middle-resource WMT 14 English to German, and high-resource WMT 15 Czech to English and WMT 14 English to French tasks show that our method can lead to significant improvements over the strong Transformer baselines, obtaining comparable performance to previous machine translation knowledge distillation studies without pre-training a teacher. Experiments with shallower/deeper Transformers show that our method can lead to comparable or better performance efficiently with fewer layers. Our method is also effective in the multilingual setting or with recurrent decoder.