CVPR2025

LiVOS: Light Video Object Segmentation with Gated Linear Matching

Qin Liu, Jianfeng Wang, Zhengyuan Yang, Linjie Li, Kevin Lin, Marc Niethammer, Lijuan Wang

Abstract

We introduced four primary baselines in the main paper. In this section, we present the remaining seven baselines. AOT [24] and DeAOT [22] are two consecutive approaches to improve the efficiency of VOS with multiple objects. Following Cutie [6], we use the model variants with a ResNet-50 backbone as baselines. CFBI [23] and CFBI+ [25] propose a collaborate VOS approach that integrates both foreground and background information into embedding learning. As they only use two memory frames, we classify them as non-STM methods with less strict criteria. Both models use RestNet-101 as the backbone, and we adopt them as our baselines. DEVA [5] decouples task-specific image-level segmentation and mask propagation for universal video segmentation. We use as the model trained solely on YouTube-VOS [21] and DAVIS 2017 [16] as the baseline. SwiftNet [19] balances accuracy and speed by compressing spatiotemporal redundancy in matching-based VOS with a pixel-adaptive memory. We use the model variant with a ResNet-50 backbone as the baseline. MobileVOS [15] distills knowledge from a teacher model utilizing large backbone and infinite memory. We use the best-performing model variant with a ResNet-18 backbone as the baseline.