ACL2024

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, Yang Liu

Abstract

While large language models (LLMs) have been pre-trained on multilingual corpora, their performance still lags behind in most languages compared to a few resource-rich languages. One common approach to mitigate this issue is to translate training data from resource-rich languages into other languages and then continue training. However, using the data obtained solely relying on translation while ignoring the original capabilities of LLMs across languages is not always effective, which we show will limit the performance of cross-lingual knowledge transfer. In this work, we propose SDRRL, a method based on Self-Distillation from Resource-Rich Languages that effectively improve multilingual performance by leveraging the internal capabilities of LLMs on resource-rich languages. We evaluate on different LLMs (LLaMA-2 and SeaLLM) and source languages (English and French) across various comprehension and generation tasks, experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages. fore continuing the training process, thereby offer-046 ing more data in the target language and improving 047 multilingual capabilities. 048 However, the translate-then-SFT method encoun-049 ters several challenges: First, the multilingual 050 enhancement gained from translated "question-051 answer" pairs is limited and may sometimes even 052 degrade the capabilities in the original primary lan-053 guage (Zhu et al., 2024). Second, constrained by 054 the accuracy of machine translation (especially for 055 the low-resource languages), the translated texts 056 used for training can be highly noisy, containing 057 numerous awkward sentences and incorrect con-058 tent, adversely affecting the quality of the gener-059 ated text and the multilingual abilities of the LLMs. 060 Therefore, we explore a new question along this 061 1 trajectory: Besides translating the training pairs, 062 can we enhance the abilities in other languages by 063 leveraging the original relatively strong capabili-064 ties of LLMs in resource-rich language? 065 In this paper, we introduce SDRRL, a method 066 that uses Self-Distillation from Resource-Rich 067 Languages) to achieve the goal mentioned above. 068 Specifically, as illustrated in Figure 1(c), SDRRL 069 comprises two parts: (1) Self-Distillation: Instead 070 of the ground-truth answer, responses from LLMs 071 in resource-rich languages are collected to con-072 struct a transfer set. These are then translated 073 into other languages using machine translation sys-074 tems and code-switching tools, forming "question-075 answer" pairs that are semantically identical but 076 linguistically varied, and conducting sentence-level 077 knowledge self-distillation within the same batch. 078 (2) Incorporating External Parallel Corpus: We 079 further involve a small amount of machine transla-080 tion data in the distillation, aiming to align the lin-081 guistic representation spaces better and mitigate the 082 negative impact of the noise in machine translation 083 systems on the generative capabilities of LLMs. 084 Our experiments, based on LLaMA-2-7B (Tou-085 vron et al., 2023b) and SeaLLM-7B (Nguyen et al., 086 2023) with English as the resource-rich language, 087 demonstrate that even with a smaller set of English 088 instruction data as the transfer set, SDRRL can ef-089 fectively distill English capabilities into 14 other 090 languages, showing effectiveness in both multilin-091 gual comprehension and generation tasks. Further 092 analysis indicates that SDRRL helps preserve the 093 original capabilities in high-resource languages and 094 improves the quality of generated responses. 095 2 Related Work 096 Multilingual Language Models. Using multilin-097 gual data during the pre-training is a common ap-098 proach to enhance the multilingual capabilities of 099