CVPR2025

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, Errui Ding, Jingdong Wang, Youjian Zhao, Hang Zhou, Ziwei Liu

DOI 出版方

摘要

Figure 1 . Zero-Shot Results by AudCast. Our method generates lifelike human videos with a realistic style, conditioned on any reference subject and driving audio, in various resolutions. The synthesized videos exhibit natural, rhythmic motion and expressive expressions, with fine details in both face and hands.