CVPR2022

Semi-supervised Video Paragraph Grounding with Contrastive Encoder

Xun Jiang, Xing Xu, Jingran Zhang, Fumin Shen, Zuo Cao, Heng Tao Shen

42 citations

Abstract

Video events grounding aims at retrieving the most relevant moments from an untrimmed video in terms of a given natural language query. Most previous works focus on Video Sentence Grounding (VSG), which localizes the moment with a sentence query. Recently, researchers extended this task to Video Paragraph Grounding (VPG) by retrieving multiple events with a paragraph. However, we find the existing VPG methods may not perform well on context modeling and highly rely on video-paragraph annotations. To tackle this problem, we propose a novel VPG method termed Semi-supervised Video-Paragraph TRansformer (SVPTR), which can more effectively exploit contextual information in paragraphs and significantly reduce the dependency on annotated data. Our SVPTR method consists of two key components: (1) a base model VPTR that learns the videoparagraph alignment with contrastive encoders and tackles the lack of sentence-level contextual interactions and (2) a semi-supervised learning framework with multimodal feature perturbations that reduces the requirements of annotated training data. We evaluate our model on three widelyused video grounding datasets, i.e., ActivityNet-Caption, Charades-CD-OOD, and TACoS. The experimental results show that our SVPTR method establishes the new state-ofthe-art performance on all datasets. Even under the conditions of fewer annotations, it can also achieve competitive results compared with recent VPG methods. * Corresponding author. (b) Video Paragraph Grounding  Two young girls are standing in the kitchen preparing to cook.  They then open a box of brownies…get an egg out of the fridge.  After, the two continue to stir the contents … placing them on a pan.  Once the cookies are … watch the cookies bake.  When they are done, they … begin eating. 15.67s 28.73s 73.78s 105.12s 130.59s 0s Multi-Multi Localization Sentence: The man with red shorts serves the ball. (a) Video Sentence Grounding 12.91s 13.63s