CVPR2025

EntitySAM: Segment Everything in Video

Mingqiao Ye, Seoung Wug Oh, Lei Ke, Joon-Young Lee

摘要

Figure 1. Zero-shot video entity segmentation performance comparison on VIPEntitySeg dataset using models trained on COCO, showing: 1) SAM 2 [39] using Mask2Former [7] mask prompts for the initial frame, 2) Mask2Former with DEVA [11] association, and 3) our proposed EntitySAM. Our EntitySAM enhances SAM 2 by automatically segmenting and tracking novel entities without requiring userspecified prompts, achieving superior performance compared to existing state-of-the-art zero-shot tracking methods.