CVPR2025

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Tongtong Su, Chengyu Wang, Bingyan Liu, Jun Huang, Dongming Lu

Abstract

VideoCrafter-2.0+Ours EVS "Iron Man flying in the sky" "A cat running happily" "A car turning a corner" * Co-corresponding authors. ing steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, EVS successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches. Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time. 1 This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore.