CVPR2025
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, Yanwei Fu
摘要
Model roads that interconnect our world are dominated by… Figure 1 . HOP: We propose a topology-based heterogeneous multimodal model that integrates features from audio, text, and action, accounting for their inherent heterogeneity through cross-modality adaptation. The model achieves superior performance on both the TED-Expressive dataset (first row) and the TED dataset (second row), generating gestures that align with the semantics and rhythmic qualities of the speech, as well as the motion characteristics of the real speaker.