CVPR2024

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

Lei Li, Angela Dai

摘要

riding a motorcycle, sitting" "hands knocking on the closed door, facing the door, standing" "stretching legs and sitting on the saddle on a standing cow" "picking up dumbbells on a shelf, standing, bending over" Figure 1 . Given an arbitrary 3D scene, GenZI can synthesize virtual humans interacting with the 3D environment at specified locations from a brief text description. Our approach does not require any 3D human-scene interaction training data or 3D learning. By distilling interaction priors from powerful 2D vision-language models, we optimize for 3D human-scene interaction synthesis in a flexible fashion, with simple language-based control and high generality to various types of scene environments.