CVPR2024

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hongsun Yang, Yooncheol Ju, Ilhwan Kim, Byeong-Yeol Kim, Joon Son Chung

DOI 出版方

摘要

Motion condition Speaker condition Figure 1. Our framework integrates Talking Face Generation (TFG) and Text-to-Speech (TTS) systems, generating synchronised natural speech and a talking face video from a single portrait and text input. Our model is capable of variational motion generation by conditioning the TFG model with the intermediate representations of the TTS model. The speech is conditioned using the identity features extracted in the TFG model to align with the input identity.