CVPR2022

VRDFormer: End-to-End Video Visual Relation Detection with Transformers

Sipeng Zheng, Shizhe Chen, Qin Jin

16 citations

Abstract

Visual relation understanding plays an essential role for holistic video understanding. Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatio-temporal contexts in different stages and also suffers from inefficiency. In this paper, we propose a transformer-based framework called VRDFormer to unify these decoupling stages. Our model exploits a query-based approach to autoregressively generate relation instances. We specifically design static queries and recurrent queries to enable efficient object pair tracking with spatio-temporal contexts. The model is jointly trained with object pair detection and relation classification. Extensive experiments on two benchmark datasets, ImageNet-VidVRD and VidOR, demonstrate the effectiveness of the proposed VRDFormer, which achieves the state-of-the-art performance on both relation detection and relation tagging tasks. The code is released at https://github.com/zhengsipeng/VRDFormer_VRD.