CVPR2024

VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu

摘要

https://vchitect.github.io/VideoBooth-project/ Image Prompt Portrait of a dog, looks out the car window. Image Prompt Cat is looking at a laptop. A horse eating grass. Image Prompt Image Prompt Dog walking in the green farm 4k Elephant walk in the yellow grass of savannah Elephant drinking water in masai mara reserve, kenya Close up of cat on top of a vintage chair Horse grazes in snowy meadow * This work was done when Shuai Yang was in S-Lab, NTU. image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with only feed-forward passes.