CVPR2025

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Sung-Bin Kim, Suekyeong Nam, Tae-Hyun Oh

Abstract

Temporal Synchronization (a) Essential Criteria for Perceptually Accurate 3D Talking Head (b) Conceptual Diagram of Desired Representation Space Figure 1. What defines perceptually accurate lip movement for a speech signal? In this work, we define three criteria to assess perceptual alignment between speech and lip movements of 3D talking heads: Temporal Synchronization, Lip Readability, and Expressiveness (a). The motivational hypothesis is the existence of a desirable representation space that models and complies well with the three criteria between diverse speech characteristics and 3D facial movements, as illustrated in (b); where representations with the same phonemes are clustered, are sensitive to temporal synchronization, and follow a certain pattern as the speech intensity increases. Consequently, we build a rich speech-mesh synchronized representation space that exhibits the desirable properties.