ICCV2023

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Xingqian Xu, Zhangyang Wang, Eric J. Zhang, Kai Wang, Humphrey Shi

265 citations

Abstract

https://github.com/SHI-Labs/Versatile-Diffusion Semantic Style A picture in oil painting style. A painting of an elegant woman in front of the moon A dream of a village in China, by Caspar David Friedrich, matte painting trending on artstation-HQ. Grand nebula in the universe.  There are stars that a child is watching about.  Two young girls and a boy standing near a star.  Two young girls are watching a star.  Kids standing for their stars.  Houses on the lake with boats and trees beside there with the mountains on the background.  House, mountain, boat, somewhere near lake  House on the cliff near the lake.  Houses on the lake with the trees. Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual-and multicontext blending, etc.;