ACL2025
TC-Bench: Benchmarking Temporal Compositionality in Conditional Video Generation
Weixi Feng, Jiachen Li, Michael Saxon, Tsu-Jui Fu, Wenhu Chen, William Yang Wang
Abstract
Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this work, we evaluate the emergence of new concepts and relation transitions as time progresses in a video, which we refer to as Temporal Compositionality. We propose TC-Bench, a benchmark of meticulously crafted text prompts, ground truth videos, and new evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development. In addition, by collecting corresponding ground-truth videos, the benchmark can be used for text-tovideo and image-to-video generation. We develop new metrics to measure the completeness of component transitions, which demonstrate significantly higher correlations with human judgments than existing metrics. Our experiments reveal that contemporary video generators are still weak in prompt understanding and achieve less than 20% of the compositional changes, highlighting enormous improvement space. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps. Code & data: https://github.com/ weixi-feng/tc-bench Height ℎ Width 𝑤 Time 𝑡 Height ℎ Width 𝑤 Time 𝑡 ℎ×𝑤 𝑡 ℎ×𝑤 𝑡 No vertical "edges", no compositional change across time Vertical "edges": disappearance and emergence of concepts "A horse running on the beach." No or ambiguous temporal compositionality "An orange chameleon turns pink, … from left to right" Specific temporal compositionality Shift of horizontal edges: object position change