CVPR2024

HanDiffuser: Text-to-Image Generation with Realistic Hand Appearances

Supreeth Narasimhaswamy, Uttaran Bhattacharya, Xiang Chen, Ishita Dasgupta, Saayan Mitra, Minh Hoai

17 citations

Abstract

Text-to-image generative models can generate highquality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands. Project page: https://supreethn.github.io/ research/handiffuser/index.html * Work started when Supreeth was an intern at Adobe Research gers can bend to various degrees relatively independently. Hands can also occur in various shapes, sizes, and orientations and can be occluded with other human body parts. Further, hands often interact with objects and can have a wide range of grasps depending on the object's size, shape, and affordance. Therefore, capturing such a vast range of articulations and interactions directly from text inputs remains challenging. Despite having billions of parameters and several millions of trainable images, T2I models struggle to generate realistic hands.