CVPR2023

Fine-tuned CLIP Models are Efficient Video Learners

Hanoona Abdul Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman H. Khan, Fahad Shahbaz Khan

摘要

CLIP text-side tuned Acc. 48.5(?=+3.9) CLIP image-side tuned Acc. 49.0(?=+4.4) ViFi CLIP Acc. 51.3(?=+6.7) XCLIP (ECCV'22) Acc. 44.6 HMDB51 Vanilla CLIP (ICML'21) Acc. 63.2(?=-8.8) CLIP text-side tuned Acc. 69.8(?=-2.2) CLIP image-side tuned Acc. 72.9(?=+0.9) ViFi CLIP Acc. 76.8(?=+4.8) XCLIP (ECCV'22) Acc. 72.0 UCF101 Figure 1. This work explores the capability of a simple baseline called ViFi-CLIP (Video Fintuned CLIP) for adapting image pretrained CLIP [33] to video domain. The figure compares the zero-shot performance of vanilla CLIP and several of its variants adapted for videos (trained on Kinetics-400, evaluated on UCF-101 and HMDB-51). The t-SNE visualizations of video-embeddings obtained from ViFi-CLIP (4 th col.) are compared with embeddings from vanilla CLIP [33] (1 st col.), individually tuned CLIP text (2 nd col.) and image encoder (3 rd col.) on videos, and recent state-of-the-art work, XCLIP [30] (last col.) (∆ represents difference over XCLIP). The embeddings of ViFi-CLIP are better separable, indicating that a simple fine-tuning of CLIP is sufficient to learn suitable video-specific inductive biases, and can perform competitive to more complex approaches having dedicated components designed to model temporal information in videos.