ACL2025

Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation

Yingfeng Luo, Tong Zheng, Yongyu Mu, Bei Li, Qinghong Zhang, Yongqi Gao, Ziqiang Xu, Peinan Feng, Xiaoqian Liu, Tong Xiao, JingBo Zhu

被引用 12 次

DOI arXiv 出版方

摘要

The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve 2.4 ∼ 6.5× inference speedups and a 75% reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks. NiuTrans/LaMaTE * Corresponding author Designing widely applicable models for MT has been an active area of NLP research for decades. Although MT models have evolved significantly over time, most of them still operate on an "analyzethen-generate" paradigm (Brown et al., 1993; Koehn et al., 2003; Bahdanau et al., 2015) . For example, in statistical MT, the source-language sentence is parsed into either syntactic or non-syntactic forms, and the translation is generated by mapping these parsed forms to target-language constructions (Chiang, 2005) . LLMs can broadly be categorized as following a similar design: they first compute the key-value cache for each input sequence, and then produce output tokens in a left-to-right manner based on this cache. From a modeling perspective, therefore, decoupling the encoding and decoding processes is a natural design choice for both traditional MT and LLMs. In NLP, a large body of work has focused on pre-training and applying text encoders, such as the BERT series of models (Devlin et al., 2019) , which are primarily designed to address language understanding problems. More recently, there have been attempts to use LLMs as text encoders, though these models are more commonly used for generating text. For example, BehnamGhader et al. (2024) and Muennighoff et al. (2024) fine-tuned LLMs for text encoding and showed that LLMs can produce high-quality representations of text in various embedding tasks. However, it is rare to see studies on incorporating LLMs in the encoding for NMT or other language generation tasks. In fact, the focus of research in MT has shifted. Recent studies in this field are now more concerned with the adaptation of LLMs, rather than the improvement of model architectures (Li et al., 2024; Zheng et al., 2024) . One strand of research aims to enable LLMs to translate via prompting techniques, including designing better prompts (