ICCV2023

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang

67 citations

Abstract

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the 3rd-stage of our detector’s backbone instead of the whole feature extractor. This naturally results in a ConvNet-ViT hybrid architecture. The proposed detector, named MimDet, enables a MIM pre-trained vanilla ViT to outperform leading hierarchical architectures such as Swin Transformer, MViTv2 and ConvNeXt on COCO object detection & instance segmentation, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8× faster. Code and pre-trained models are available at https://github.com/hustvl/MIMDet.