CVPR2025

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, Yanwei Fu

摘要

Model roads that interconnect our world are dominated by… Figure 1 . HOP: We propose a topology-based heterogeneous multimodal model that integrates features from audio, text, and action, accounting for their inherent heterogeneity through cross-modality adaptation. The model achieves superior performance on both the TED-Expressive dataset (first row) and the TED dataset (second row), generating gestures that align with the semantics and rhythmic qualities of the speech, as well as the motion characteristics of the real speaker.