ACL2024

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

1 citation

Abstract

Two-pass direct speech-to-speech translation (S2ST) models have shown promising results which decompose S2ST into speech-to-text translation (S2TT) and text-to-speech (TTS), yet conduct end-to-end training by sharing the target text representation between S2TT and TTS models. However, the training of these models still requires large-scale parallel speech data comprising <source speech, target text, target speech> triplets, which is extremely challenging to collect. On the other hand, S2TT and TTS have accumulated a large amount of data and numerous pretrained models, which can be used to reduce the reliance on parallel speech data. To this end, we propose a composite S2ST model named ComSpeech, which connects pretrained S2TT and TTS models by introducing a vocabulary adaptor based on connectionist temporal classification (CTC). The vocabulary adaptor is employed to adapt the output text sequence of S2TT to the input text sequence of TTS, which are different due to the use of different vocabularies. In this way, ComSpeech can still be trained end-to-end and only needs a small amount of parallel speech data to finetune. We further propose a novel training method ComSpeech-ZS to eliminate the reliance on parallel speech data by aligning the text representation space of S2TT and TTS. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind ComSpeech by only 0.7 ASR-BLEU and outperforms the cascaded models. 1