CVPR2025

AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

Haonan Han, Xiangzuo Wu, Huan Liao, Zunnan Xu, Zhongyuan Hu, Ronghui Li, Yachao Zhang, Xiu Li

摘要

A person walks forward, bends at the waist, and picks up something" "A person waves his arm first, walks forward in a straight line, then turns left." "A person leaps forward for 3 times" Left Right Left Right Left Right : Left is better : Left is better : Right is better Integrity Temporal Frequency Figure 1. Showcases of motion samples for three scenarios. The two motion samples for each scenario were generated based on the prompt above the samples. Moreover, we leverage GPT-4V to compare two motion samples according to the degree of alignment between the motion samples and the input prompt.