CVPR2024

On the Content Bias in Fréchet Video Distance

Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang

摘要

a) Reference Videos (b) Medium Spatial & No Temporal Corruption (c) Small Spatial & Severe Temporal Corruption FVD=317.10 FVD=310.52 Figure 1. FVD is biased towards per-frame quality than temporal consistency. FVD [72], a commonly used video generation evaluation metric, should ideally capture both spatial and temporal aspects. However, our experiments reveal a strong bias toward individual frame quality. (b) First, we apply mild spatial distortions through local warping, which results in an FVD score of 317.10. (c) Next, we induce slightly less spatial corruptions but severe temporal inconsistencies by altering each frame differently. These changes create artifacts that are noticeable to humans and evident in the spatiotemporal x-t slice, as seen in the bottom row, but surprisingly lead to a lower FVD score of 310.52. This discrepancy highlights the metric's bias towards individual frame quality. We encourage readers to view the videos with Acrobat Reader or visit our website to observe the inconsistencies.