CVPR2025

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta

摘要

Require entire video Context propagation TARGET: absent MATCH: red shirt LEARNABLE CORRECTION: detected object switch Query: 'The cyclist in red that overtakes others on the left' MOTION REASONING: action that unfolds over multiple frames TRACKING: occlusion handling Offline + Efficient processing -No global context Clip-based Figure 1. SAMWISE. Our approach infuses knowledge about natural language in the Segment-Anything 2 model, adding explicit temporal cues in the feature extraction for the task of streaming-based Referring Video Segmentation (RVOS). We use a learnable mechanism to mitigate the so-called tracking bias, i.e. SAM2 tendency to overlook a correct object once it becomes identifiable, due to its ongoing tracking of a different object. Our design enables effective streaming processing for RVOS, exploiting the memory from previous frames to propagate past context. The figure shows an example where the target object is not present in the first frame, leading SAM2 to start tracking the wrong one. Afterwards, when the correct object appears, our learnable correction mechanisms guides SAM2 to switch its tracking focus. By adding in its features the notion of temporal evolution, the model is able to recognize that the new object is more aligned with the provided textual query. Finally, we exploit SAM2 tracking skills and robustness to occlusions to keep following the object.