ICLR2026

PhyWorldBenchPhyWorldBench: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

被引用 14 次

摘要

Figure 2 : Success rates of video generation models on PhyWorldBench. Among open-source models, Wanx demonstrated the highest performance, while Pika achieved the best results among proprietary models with a success rate of 0.262. Despite these advancements, substantial progress remains necessary to refine the capability of these models to accurately simulate the intricate dynamics of the real world. ical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles. INTRODUCTION The field of video generation has made remarkable progress, with models producing visually compelling and often photorealistic outputs. These advances have enabled transformative applications across industries such as entertainment, education, and scientific visualization. However, despite their visual fidelity, do video generation models truly understand the laws of physics in the real world? To answer this question, we introduce PhyWorldBench, a rigorous benchmark designed to evaluate how well video generation models can simulate real-world physics. As illustrated in Figure 1 , PhyWorldBench systematically tests models across multiple levels of physical phenomena, from fundamental concepts like object motion to complex dynamics, including rigid body interactions and human/animal motion. Additionally, we propose a novel Anti-Physics category, where prompts deliberately violate real-world physics. On one hand, this design verifies whether models genuinely understand physical laws-rather than merely reproducing patterns from real-world training data. On the other hand, anti-physics content itself holds practical value in creative applications, where imaginative or otherwise impossible scenarios are beneficial. We meticulously designed and annotated 1,050 prompts and the standard set for each prompt individually to cover a broad range of physical scenarios. This substantial annotation work ensures that our benchmark is both comprehensive and precise, allowing for a more thorough assessment of video generation models' capabilities. Furthermore, we present a context-aware-prompt metric using MLLM (OpenAI Team, 2024; Gemini Team, 2024) , which directly assesses if the video satisfies the physics standards or not. Such evaluation not only provided an unbiased metric but also significantly reduced the evaluation cost. To examine the current status of video generation models and provide a detailed analysis, we selected five proprietary models-