EMNLP2024

Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair

Yusuke Sakai, Mana Makinae, Hidetaka Kamigaito, Taro Watanabe

3 citations

Abstract

In Simultaneous Machine Translation (SiMT), training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, constructing such a corpus is challenging due to high costs, and limitations in annotator capabilities, and as a result, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation (ST) corpora into interpretation-style corpora, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrated that fine-tuning SiMT models using the LLM-SI-Corpus reduces latencies while achieving better quality compared to models fine-tuned with other corpora in both speechto-text and text-to-text settings. The LLM-SI-Corpus is available at https://github.com/ yusuke1997/LLM-SI-Corpus .