EMNLP2024
Using Language Models to Disambiguate Lexical Choices in Translation
Josh Barua, Sanjay Subramanian, Kayo Yin, Alane Suhr
3 citations
Abstract
In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the bestperforming model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4. Concept: date (fruit) Variations and Generated Rules Khorma refers to the fruit of the date palm when it is fully ripe and dried. It is commonly consumed as a sweet, chewy snack or used in various dishes, particularly desserts. Rotab refers to fresh, soft dates that are partially ripe. These dates are moister and sweeter than fully ripe, dried dates (khorma). Rotab is often eaten as a fresh fruit or used in cooking where a softer, sweeter texture is desired. Kharak refers to dates that are unripe and are in a semi-dried state. They are less sweet compared to rotab and khorma and are often used in cooking or further processed into other forms. Lexical Selection Source Sentence: she brings me dried dates and tells me old stories Correct Variation: khorma