ICLR2025
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation
Haiyang Liu, Xingchao Yang, Tomoya Akiyama, Yuantian Huang, Qiaoge Li, Shigeru Kuriyama, Takafumi Taketomi
摘要
TANGO is a framework designed to generate co-speech body-gesture videos using a motion graph-based retrieval approach. It first retrieves most of the reference video clips that match the target speech audio by utilizing an implicit hierarchical audio-motion embedding space. Then, it adopts a diffusion-based interpolation network to generate the remaining transition frames and smooth the discontinuities at clip boundaries.