CVPR2023

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel

摘要

Input Real Image Input Real Image "A photo of a pink horse on the beach" "A photo of a robot horse" "a photo of a bronze horse in a museum" "A cartoon of a couple dancing" "a photo of robots dancing" "A wooden sculpture of a couple dancing" "A polygonal illustartion of fish in the ocean" "A photo of sharks in the ocean" Input Real Image Input Generated Image "A polygonal illustration of a cat and a bunny" "A photo of bear cubs in the snow" Figure 1 . Given a single real-world image as input, our framework enables versatile text-guided translations of the original content. Our results exhibit high fidelity to the input structure and scene layout, while significantly changing the perceived semantic meaning of objects and their appearance. Our method does not require any training, but rather harnesses the power of a pre-trained text-to-image diffusion model through its internal representation. We present new insights about deep features encoded in such models, and an effective framework to control the generation process through simple modification of these features.