CVPR2022

SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos

Salar Hosseini Khorasgani, Yuxuan Chen, Florian Shkurti

30 citations

Abstract

Self-supervised methods have significantly closed the gap with end-to-end supervised learning for image classification [13], [24]. In the case of human action videos, however, where both appearance and motion are significant factors of variation, this gap remains significant [28], [58]. One of the key reasons for this is that sampling pairs of similar video clips, a required step for many self-supervised contrastive learning methods, is currently done conservatively to avoid false positives. A typical assumption is that similar clips only occur temporally close within a single video, leading to insufficient examples of motion similarity. To mitigate this, we propose SLIC, a clustering-based self-supervised contrastive learning method for human action videos. Our key contribution is that we improve upon the traditional intra-video positive sampling by using iterative clustering to group similar video instances. This enables our method to leverage pseudo-labels from the cluster assignments to sample harder positives and negatives. SLIC outperforms state-of-the-art video retrieval baselines by <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+15.4%+15.4\%</tex> on top-1 recall on UCF101 and by <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+5.7%+5.7\%</tex> when directly transferred to HMDB51. With end-to-end finetuning for action classi-fication, SLIC achieves 83.2% top-1 accuracy <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(+0.8%)(+0.8\%)</tex> on UCF101 and 54.5% on HMDB51 <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(+1.6%(+1.6\%</tex> ,. SLIC is also competitive with the state-of-the-art in action classification after self-supervised pretraining on Kinetics400.