ACL2024

Question Translation Training for Better Multilingual Reasoning

Wenhao Zhu, Shujian Huang, Fei Yuan, Shuaijie She, Jiajun Chen, Alexandra Birch

Abstract

Large language models show compelling per-001 formance on reasoning tasks but they tend to 002 perform much worse in languages other than 003 English. This is unsurprising given that their 004 training data largely consists of English text 005 and instructions. A typical solution is to trans-006 late instruction data into all languages of in-007 terest, and then train on the resulting multi-008 lingual data, which is called translate-training. 009 This approach not only incurs high cost, but 010 also results in poorly translated data due to 011 the non-standard formatting of mathematical 012 chain-of-thought. In this paper, we explore the 013 benefits of question alignment, where we train 014 the model to translate reasoning questions into 015 English by finetuning on X-English parallel 016 question data. In this way we perform targeted, 017 in-domain language alignment which makes 018 best use of English instruction data to unlock 019 the LLMs' multilingual reasoning abilities. Ex-020 perimental results on LLaMA2-13B show that 021 question alignment leads to consistent improve-022 ments over the translate-training approach: an 023 average improvement of 11.3% and 16.1% ac-024 curacy across ten languages on the MGSM and 025 MSVAMP multilingual reasoning benchmarks. 026 1 Introduction 027 Large language models have recently shown a 028 strong ability to reason in English, but performance 029 in other languages, especially more distant lan-030 guages, still trails far behind (Shi et al., 2022; 031 Huang et al., 2023). It is unsurprising, considering 032 that their training data is predominantly composed 033 of English text and instructions (Blevins and Zettle-034 moyer, 2022; Touvron et al., 2023; Wang et al., 035 2023). To elicit LLM's multilingual performance, 036 previous approach typically follows the translate-037 training paradigm (Chen et al., 2023), which first 038 translates English instruction data into non-English 039 with a translation engine and then uses the multi-040 lingual data for instruction-tuning. the behavior of LLMs with human expectations, Wei et al. (2022a) propose instruction-tuning, training LLM to generate desired response based on the given instruction. Subsequently, many efforts are put into creating effective instruction data to 132 further unlock LLM's potential (Wang et al., 2022;