CVPR2024

AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond

Zixiang Zhou, Yu Wan, Baoyuan Wang

19 citations

Abstract

Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm, however, researchers still develop siloed models for each task. In-spired by InstuctGPT[16], and the generalist concept be-hind Gato [27], we introduce AvatarGPT, an All-in-One framework for motion understanding, planning, generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of in-struction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the univer-sal interface, constituting a closed-loop within the frame-work. To achieve this, human motion sequences are first encoded as discrete tokens, which serve as the extended vo-cabulary of LLM. Then, an unsupervised pipeline to gen-erate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally, all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks, and promising results on high-level tasks, demonstrating the effectiveness of our proposed All-in-One framework. Moreover, for the first time, AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for un-limited long-motion synthesis.