CVPR2024

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

Yuanhong Chen, Yuyuan Liu, Hu Wang, Fengbei Liu, Chong Wang, Helen Frazer, Gustavo Carneiro

Abstract

We show the distribution of visual classes in VPO-SS, VPO-MS and VPO-MSMI in Figure 1 . Similar to the AVSBench-Semantics [29] , we also observe a data imbalance issue within our VPO dataset. We follow [25] to report an imbalance ratio ( Nmax Nmin ) of 12.48% (female & zebra), 12.43% (female & zebra) and 12.62% (female & cow) on the three VPO subsets, and 59.57% (man & axe & missilerocket) on AVSBench-Semantics [29]. These class imbalance issues can affect the model performance during testing, which will be discussed in Sec. 3.3. For the demonstration of training examples, please refer to the "video demo.mp4" file within the supplementary materials. Creation Procedure We show a graphical illustration of our Visual Postproduction (VPO) benchmark in Fig. 2 . We divide the entire dataset generation process into three major steps: • Data collection: We gather datasets from off-the-shelf segmentation datasets (e.g., COCO [15]) and audio datasets (e.g., VGGSound [2]), focusing on the overlapping classes listed in Tab. 1. We randomly match audio and video files to form new samples based on their semantic labels. • Data processing: We prioritise the collection of images with multiple objects and incorporate spatial location information based on each selected instance mask. • Subset creation: We organize subsets according to their keywords (e.g., single-source, multi-sources, multiinstances) and further partition each subset into training and testing sets.