CVPR2023

Unite and Conquer: Plug & Play Multi-Modal Synthesis Using Diffusion Models

Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, Vishal M. Patel

Abstract

Tibetan terrier Teddy Bear Triceratops Tree frog Otterhound Tibetan terrier Teddy Bear Triceratops Tree frog Otterhound GLIDE [21] OURS (b) (Face, hair) semantic labels, Text-→ Facial image (c) Sketch, Text-→ Facial image TediGAN [45] OURS Semantic Label This person is chubby and has wavy black hair An old person with brown hair This person has black hair and wears beard This person has brown hair and wears eyeglasses Sketch This person has blonde hair and black eyebrows This person has brown hair and dark skin tone strategy. We also introduce a novel reliability parameter that allows using different off-the-shelf diffusion models trained across various datasets during sampling time alone to guide it to the desired outcome satisfying multiple constraints. We perform experiments on various standard multimodal tasks to demonstrate the effectiveness of our approach. More details can be found at: https://nithin- gk.github.io/projectpages/Multidiff