ACL2024

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

被引用 5 次

摘要

Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-totext or speech-to-text models, necessitating additional cascade components to achieve speechto-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2x 1 ), which integrates speechto-text and speech-to-speech tasks into a unified end-to-end framework. We develop a nonautoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2x outperforms state-of-theart models in both speech-to-text and speechto-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28× decoding speedup in offline generation. 2