CVPR2024
Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hongsun Yang, Yooncheol Ju, Ilhwan Kim, Byeong-Yeol Kim, Joon Son Chung
摘要
Motion condition Speaker condition Figure 1. Our framework integrates Talking Face Generation (TFG) and Text-to-Speech (TTS) systems, generating synchronised natural speech and a talking face video from a single portrait and text input. Our model is capable of variational motion generation by conditioning the TFG model with the intermediate representations of the TTS model. The speech is conditioned using the identity features extracted in the TFG model to align with the input identity.