CVPR2023

OmniMAE: Single Model Masked Pretraining on Images and Videos

Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Abstract

Figure 1 . OmniMAE is a single model for images and videos that is trained using masked autoencoding [40] . We use a plain Vision Transformer [24] architecture but with spatio-temporal patches as input. At training, we 'patchify' the visual input (images or videos), and feed the encoder only a subset of the patches. The decoder reconstructs the pixels for the missing patches using the encoder's output. The encoder-decoder model is trained using a pixel reconstruction loss. After training, our single plain Transformer encoder performs competitively compared to specialized architectures on downstream image and video recognition tasks.