CVPR2025

FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Kefan Chen, Chaerin Min, Linguang Zhang, Shreyas Hampali, Cem Keskin, Srinath Sridhar

摘要

Figure 1 . We present FoundHand, a domain-specific image generation model that can synthesize realistic single and dual hand images. FoundHand is trained on our large-scale FoundHand-10M dataset which contains automatically extracted 2D keypoints and segmentation mask annotations (top left). FoundHand is formulated as a 2D pose-conditioned image-to-image diffusion model that enables precise hand pose and camera viewpoint control (top right). Optionally, we can condition the generation with a reference image to preserve its style (top right). Our model demonstrates robust in-the-wild generalization across hand-centric applications and has core capabilities such as gesture transfer, domain transfer, and novel view synthesis (middle row). This endows FoundHand with zero-shot applications to fix malformed hand images and synthesize coherent hand and hand-object videos, without explicitly giving object cues (bottom row).