CVPR2025

Pose Priors from Language Models

Sanjay Subramanian, Evonne Ng, Lea Müller, Dan Klein, Shiry Ginosar, Trevor Darrell

摘要

Initial ProsePose Final Figure 1 . Optimizing human-to-human contacts in 3D pose. Our approach leverages the semantic priors of a Large Multimodal Model (LMM) to infer meaningful information about physical contact from images. Instead of relying on human annotations or motion capture data, we extract not only descriptive insights ("... engaged in a dance or embrace ...") but also structured constraints between body parts (underlined). By incorporating these LMM-derived constraints, we refine initial 3D human pose estimates, achieving realistic and semantically consistent reconstructions of contact. This scalable approach opens up new possibilities for contact-aware pose estimation without explicit contact annotations, making it a promising alternative to traditional methods.