ACL2024

Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering

ChaeHun Park, Koanho Lee, Hyesu Lim, Jaeseok Kim, Junmo Park, Yu-Jung Heo, Du-Seong Chang, Jaegul Choo

Abstract

Building a reliable visual question answering (VQA) system across different languages is a challenging problem, primarily due to the lack of abundant samples for training. To address this challenge, recent studies have employed machine translation systems for the cross-lingual VQA task. This involves translating the evaluation samples into a source language (usually English) and using monolingual models (i.e., translate-test). However, our analysis reveals that translated texts contain unique characteristics distinct from humanwritten ones, referred to as translation artifacts. We find that these artifacts can significantly affect the models, confirmed by extensive experiments across diverse models, languages, and translation processes. In light of this, we present straightforward data augmentation strategies that can alleviate the adverse impacts of translation artifacts. translate-train, translates training samples into in-041 dividual target languages and uses them to train 042 models for target languages. This approach is ad-043 vantageous as it does not perform translation during 044 inference, but it requires training individual models 045 for each target language. Furthermore, recent VL 046 models (Singh et al., 2022; Liu et al., 2023b; Li 047 et al., 2023a), which are mostly tailored in English, 048 are not suitable for the translate-train approach. 049 Another widely adopted approach, called translate-050 test, translates test samples written in target lan-051 guages into the source language and uses VL mod-052 els of the source language for the inference. These 053 translation-based approaches have shown remark-054 able performance in cross-lingual tasks. 055 Despite the effectiveness of translation systems 056 in cross-lingual VL tasks, using machine-translated 057 texts as input inevitably introduces a mismatch 058 between the training and inference phases. In 059 the translate-test approach, models are trained on 060 human-written texts but evaluated on machine-061 translated texts. This distribution shift could hurt 062 the generalization of models to different lan-063 guages (Yu et al., 2022; Wang et al., 2022). For in-064 stance, as illustrated in Fig. 1, leveraging machine-065 1 translated texts might lead to undesirable model 066 outcomes, even when both questions convey the 067 same meaning. In this paper, we refer to artifacts in 068 translations that cause such unwanted behaviors as 069 translation artifacts. We argue that the translation 070 artifacts have been overlooked in previous cross-071 lingual VQA studies despite their significance. 072 To explore the effect of mismatched data distri-073 bution on cross-lingual VQA, we alleviate this mis-074 match in the data origins 1 by employing machine-075 translated texts in both training and inference. Our 076 investigation focuses on the translate-test, which 077 can take advantage of strong monolingual mod-078 els and efficiently serve multiple target languages 079 with a single VL model. Our experimental results 080 reveal that models trained on machine-translated 081 texts generally outperform those trained on human-082 written texts, increasing the averaged accuracy over 083 languages and models from 51.82 to 53.14 points. 084 This improvement, as confirmed by our qualitative 085 analysis, is primarily attributed to the subtle nu-086 ances in translated texts (i.e., translation artifacts). 087 Our comprehensive study covers various compo-088 nents in cross-lingual VQA, including 14 models, 089 13 languages, 5 machine translation systems, and 090 diverse translation settings. We also observe that 091 recent VL models (Li et al., 2023a; Dai et al., 2023; 092 Gao et al., 2023) integrated with large language 093 models also suffer from translation artifacts. We 094 also present simple data augmentation techniques, 095 verifying their effectiveness in both human-written 096 and machine-translated texts. 097 Our contribution can be summarized as follows: 098 1. This is, to our knowledge, the first study to in-099 vestigate translation artifacts in cross-lingual 100 visual question answering. 101 2. We provide extensive analyses across a variety 102 of languages and models, providing a founda-103 tion for future research. 104 3. We present simple yet effective data augmen-105 tation strategies using translated texts. 106 2 Related Work 107 2.1 Cross-lingual VQA 108 The study of VQA has predominantly focused on 109 English and other high-resource languages (Zhu 110