ICLR2025

Beyond FVD: An Enhanced Evaluation Metrics for Video Generation Distribution Quality

Ge Ya Luo, Gian Mario Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal

Abstract

The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D ConvNet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average. Project page: https://oooolga.github.io/JEDi.github.io/ . Published as a conference paper at ICLR 2025 2. JEDi significantly reduces the number of samples needed to make an accurate estimate by using an MMD metric in a V-JEPA feature space, enabling reliable use in smaller datasets that do not meet the requirement when using FVD. 3. JEDi leverages the robust representations of a V-JEPA model, which are found to be more aligned with human evaluations compared to FVD. BACKGROUND AND NOTATIONS 2.1 VIDEO FEATURE REPRESENTATION Inflated 3D ConvNet: The Inflated 3D ConvNet (I3D) (Carreira & Zisserman, 2018) is a convolutional neural network model based on the pre-trained Inception-v1. It extends the 2D convolutional filters to 3D by replicating them along the temporal dimension. I3D, pre-trained on Kinetics, has demonstrated excellent classification performance on UCF-101 (Soomro et al., 2012), HMDB-51 (Kuehne et al., 2011), and Kinetics datasets (Kay et al., 2017), proving to be a valuable network for video recognition tasks. The original FVD work by Unterthiner et al. (2019) explores the use of I3D features trained on the Kinetics datasets. They analyze the features from the logits layer, as well as the features from the last pooling layer trained on the Kinetics-400 and Kinetics-600 datasets. Their experiments suggest that the features from the logits layer trained on the Kinetics-400 dataset are the most suitable for the FVD metric.