AAAI2024

ConditionVideo: Training-Free Condition-Guided Video Generation

Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao

7 citations

Abstract

Recent works have successfully extended large-scale text-toimage models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce Condition-Video, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming compared methods. For the project website, see https://pengbo807.github.io/conditionvideo-website/ * Work done as an intern at Shanghai AI Lab.