CVPR2023

MAGVIT: Masked Generative Video Transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

DOI Publisher

Abstract

60x 250x (b) Efficiency (a) Quality 0 200 400 MAGVIT TATS (prior best) CCVS 76 -77% 332 386 0.0 45.0 90.0 MAGVIT TATS (prior best) Video Diffusion 89.3 +13% 79.3 57.0 30 60 90 MAGVIT RaMViD (prior best) NÜWA 62 -26% 84 87 0 10 20 MAGVIT Video Diffusion (prior best) RaMViD 9.9 -39% 16.2 16.5 UCF-101 CG FVD↓ BAIR FP FVD↓ Kinetics-600 FP FVD↓ UCF-101 CG IS↑ Estimated Relative Inference Runtime Inference Throughput At 128×128 native resolution MAGVIT-B 37 fps on 1x (c) Flexibility Class-conditional Generation (CG) Frame Prediction (FP) Frame Interpolation Outpainting Inpainting Squeezing Something 10 tasks in one model MAGVIT-L 65 fps on 1x And other tasks … GPU V100 TPU v4i Figure 1. Overview of the video generation quality, efficiency, and flexibility of the proposed MAGVIT model. (a) MAGVIT achieves the state-of-the-art FVD [61] and Inception Score (IS) [49] on two video generation tasks and three benchmarks, in comparison with prior best diffusion models (RaMViD [35], Video Diffusion [33]) and autoregressive models (CCVS [41], TATS [21], N ÜWA [70]). (b) It is two orders of magnitude faster than diffusion models and 60× faster than autoregressive models. (c) A single MAGVIT model accommodates different generation tasks, ranging from class-conditional generation to dynamic inpainting of a moving object.