ACL2025

Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning

Yexing Du, Youcheng Pan, Ziyang Ma, Bo Yang, Yifan Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin

Abstract

Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of manyto-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in 15 × 14 language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. 1 * Corresponding author. 1 The source code and models are released at https:// github.com/yxduir/LLM-SRT . (The long-lived bridge still stands today.) ASR Model MLLM MLLM MT Model (a) (c) 这座历史悠久的桥至今仍然屹立不倒。 The long-lived bridge still stands today.