CVPR2025

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

摘要

CPII under InnoHK, A dog, playing on the grass, soft lightening, high quality, ... A panda plays in the snow covered forest, 4k resolution, film grain, ... Vanilla ByTheWay Vanilla ByTheWay Enhancing structural plausibility and temporal consistency Enriching motion patterns and amplifying the motion magnitude Figure 1. Unlock the potential of pretrained text-to-video (T2V) generation models in a training-free approach. (1) ByTheWay helps to enhance structural plausibility and temporal consistency in generated videos, significantly reducing artifacts and flickering. (2) ByTheWay contributes to enriching motion patterns and amplifying the motion magnitude in generated videos. Further, ByTheWay can be seamlessly integrated into various powerful T2V backbones (e.g., AnimateDiff[12] and VideoCrafter2[6]) in a plug-and-play manner, serving as a highly extensible module without introducing additional parameters or sampling cost.