ICLR2025

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

Haiyang Liu, Xingchao Yang, Tomoya Akiyama, Yuantian Huang, Qiaoge Li, Shigeru Kuriyama, Takafumi Taketomi

摘要

TANGO is a framework designed to generate co-speech body-gesture videos using a motion graph-based retrieval approach. It first retrieves most of the reference video clips that match the target speech audio by utilizing an implicit hierarchical audio-motion embedding space. Then, it adopts a diffusion-based interpolation network to generate the remaining transition frames and smooth the discontinuities at clip boundaries.