CVPR2023

Token Boosting for Robust Self-Supervised Visual Transformer Pre-training

Tianjiao Li, Lin Geng Foo, Ping Hu, Xindi Shang, Hossein Rahmani, Zehuan Yuan, Jun Liu

摘要

Here, in order to qualitatively show the improvements from using TBM, we visualize the image reconstruction quality during pre-training. In Fig. 1 , we plot the original corrupted image and the reconstructed images of ViT-Huge with the decoder, with and without the use of TBM. When TBM is not used, reconstruction of the masked parts of the image is challenging, and the reconstruction looks blurred and inaccurate. This is because the corruptions in the unmasked parts of the image make it difficult to predict the masked parts. However, when TBM is used, there is a visible improvement in the quality of the reconstructed images, where the image looks sharper and less blurred, and much of the corruptions have been smoothened out. Next, in Fig. 2 we visualize the effects of corruptions on the extracted features. On the left, we see the produced features when the input image is clean. When the same input is perturbed with a corruption and fed to the baseline ViT-Huge, the features undergo significant observable changes, showing that the features are not very robust to added corruptions. However, when TBM is applied, the features show minimal changes when fed with the same corrupted image, and look similar to features obtained under a clean setting. This shows that TBM helps to make the output features of VTs more robust against input corruptions. Investigation on the impact of the number of layers using TBM modules. Firstly, we train a single model to deal with an individual type of corruption in a selfsupervised manner, i.e., we train multiple models (with ViT-Huge as the encoder) to handle the various types of corruptions, and report the average performance of the models over all the corruption types in the top row of Fig. 3 . We find that, after adding our TBM module to a single layer, † equal contribution ‡ corresponding author Corrupted ViT-Huge ViT-Huge + TBM