ACL2025

When Large Language Models Meet Speech: A Survey on Integration Approaches

Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu

被引用 6 次

摘要

Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: textbased, latent-representation-based, and audiotoken-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for future research. 1 One challenge that affects the scope of studies is the lack of standard definition for LLMs. In this paper, we adopt the loose definition by Zhao et al. (2023), focusing on models with over 10 billion parameters, while also including notable