CVPR2024

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

Abstract

Figure 1 . VideoCutLER is a simple unsupervised video instance segmentation method (UnVIS). We show the first competitive unsupervised results on the challenging YouTubeVIS benchmark. Moreover, unlike most prior approaches, we demonstrate that UnVIS models can be learned without relying on natural videos and optical flow estimates. Row 1: We propose VideoCutLER, a simple cut-synthesis-andlearn pipeline that involves three main steps. Firstly, we generate pseudo-masks for multiple objects in an image using MaskCut [35] . Then, we convert a random pair of images in the minibatch into a video with corresponding pseudo mask trajectories using ImageCut2Video. Finally, we train an unsupervised video instance segmentation model using these mask trajectories. Row 2: Despite being trained only on unlabeled images, at inference time VideoCutLER can be directly applied to unseen videos and can segment and track multiple instances across time (Fig. 1a ), even for small objects (Fig. 1b ), objects that are absent in specific frames (Fig. 1c ), and instances with high overlap (Fig. 1d ). Column 2: Our method surpasses the previous SOTA method OCLR [37] by a factor of 10 in terms of class-agnostic AP video 50 .