CVPR2025

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao

摘要

Multiple apples, bouncing Single apple, no bouncing Drawing content disappears Drawing with causality No water splashing Water splashing Calm water Flooding river No tumbling rock Rock tumbling No tea filling and steam of hot tea Tea is filling the cup with steam Figure 1. Left: videos generated by the current text-to-video generation model (CogVideoX-5B [50] ) cannot adhere to the real-world physical rules (described in brackets following the user prompt). Right: our method PhyT2V, when applied to the same model, better reflects the real-world physical knowledge.