ACL2024

Language-Informed Beam Search Decoding for Multilingual Machine Translation

Yilin Yang, Stefan Lee, Prasad Tadepalli

Abstract

Beam search decoding is the de-facto method 001 for decoding auto-regressive Neural Machine 002 Translation (NMT) models, including multilin-003 gual NMT where the target language is speci-004 fied as an input. However, decoding multilin-005 gual NMT models commonly produces "off-006 target" translations -yielding translation out-007 puts not in the intended language. In this 008 paper, we first conduct an error analysis of 009 off-target translations for a strong multilingual 010 NMT model and identify how these decodings 011 are produced during beam search. We then pro-012 pose Language-informed Beam Search (LiBS), 013 a general decoding algorithm incorporating 014 an off-the-shelf Language Identification (LiD) 015 model into beam search decoding to reduce 016 off-target translations. LiBS is an inference-017 time procedure that is NMT-model agnostic 018 and does not require any additional parallel 019 data. Results show that our proposed LiBS al-020 gorithm on average improves +1.1 BLEU and 021 +0.9 BLEU on WMT and OPUS datasets, and 022 reduces off-target rates from 22.9% to 7.7% 023 and 65.8% to 25.3% respectively. 1 024 1 Motivation 025 With Neural Machine Translation (NMT) (Bah-026 danau et al., 2014; Vaswani et al., 2017) becoming 027 the state-of-the-art approach in the bilingual Ma-028 chine Translation literature, Multilingual Neural 029 Machine Translation (MNMT) has attracted much 030 attention (Johnson et al., 2017). MNMT has two 031 main advantages: a) it enables one model to trans-032 late between multiple language pairs and thus re-033 duces the model and deployment complexity from 034 O(N 2 ) to O(1), and b) it enables transfer learning 035 between high-resource and low-resource languages. 036 One attractive feature of such transfer learning is 037 zero-shot translation, where the multilingual model 038 is able to translate between language pairs unseen 039 during training. For example, after training from 040 1 Code to be released after publication. 129 lation performance usually drops significantly with 130 increasing beam sizes. In our study, we also found 131 this phenomenon prevailing in the multilingual sys-132 tem and highly related to the off-target translation 133 error. 134 As an example, we demonstrate the beam search 135 curse on WMT De→Fr and Cs→De translation, 136 since both are between high-resource languages 137 and with decent translation performance (between 138 15 to 20 BLEU). 139 Table 1 illustrates the results on WMT De→Fr 140 and Cs→De. We could clearly observe that the off-141 target rate grows sub-linearly with the beam size, 142 and as a result the BLEU score drops significantly 143 with increasing beam sizes. It then raises the cu-144 rious question of why the off-target rate increases 145 drastically with larger beam sizes, and whether the 146 performance drop (i.e. BLEU decrease) is mainly 147 due to the off-target errors. 148 3.2 Off-Target Error Analysis 149 As part of a detailed analysis, we study the off-150 target error type between six zero-shot pairs (i.e. 151 12 translation directions) from the WMT dataset. 152 We categorize the off-target errors into three types: 153 that even though the off-target error is overwhelm-158 ing across languages, it could easily be categorized 159 into mostly two types: translating into English and 160 "translating" into source. The "Others" error type 161 only comprises a negligible 1.1% of cases, given 162 the FastText LiD model has an error margin of 163 0.81% (Yang et al., 2021). 164 "→Source" errors We hypothesize that this er-165 ror is related to the previously studied "source copy-166 ing" behavior (Ott et al., 2018) on the bilingual 167 NMT model. We then sample three cases from 168 this error type (shown in Table 3 ). The case study 169 confirms that the "→Source" error type is the same 170 as source copying behavior on bilingual models for 171 these cases. To quantify the degree of source copy-172 ing, we run Sentence BLEU evaluation 8 between 173 source and system translation on WMT Fr→De 174 "→Source" errors. The sentence BLEU distribu-175 tion is shown in Figure 1 with an average sentence 176 BLEU of 85.3. It clearly demonstrates that the 177 "→Source" error strongly displays a source copy-178 ing behavior and is somehow promoted by larger 179 beam sizes. 180 "→English" errors Since none of our evaluated 181 direction includes English as the target language,