CVPR2024

Audio-Visual Segmentation via Unlabeled Frame Exploitation

Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang

Abstract

Labeled Frame Motion Cues … … Unlabeled Frames Labeled Frame Unlabeled Frames … … Distant Frames GT Supervision No exploitation GT Supervision Semantic Cues (a) Previous methods (w/ GTM) (b) Our proposed method (Ours) (c) Performance comparison Neighboring Frames Figure 1. Comparison between previous methods and ours on how to harness the unlabeled frames. (a) Previous methods perform global temporal modeling (GTM) to process all frames from a sequence including labeled and unlabeled ones, without the exploitation of the unlabeled frames. (b) Our method employs two types of unlabeled frames: (i) the neighboring frames (NFs) provide motion cues for accurately segmenting the sounding object; (ii) the distant frames (DFs) contain semantic cues for enhancing data diversity. (c) Based on TPAVI method, compared to the model trained only using labeled frames (w/o GTM), previous methods using global temporal modeling (w/ GTM) only show marginal performance gain; while our method achieves significant improvement with the unlabeled frames.