CVPR2022
Semi-supervised Video Paragraph Grounding with Contrastive Encoder
Xun Jiang, Xing Xu, Jingran Zhang, Fumin Shen, Zuo Cao, Heng Tao Shen
被引用 42 次
摘要
Video events grounding aims at retrieving the most relevant moments from an untrimmed video in terms of a given natural language query. Most previous works focus on Video Sentence Grounding (VSG), which localizes the moment with a sentence query. Recently, researchers extended this task to Video Paragraph Grounding (VPG) by retrieving multiple events with a paragraph. However, we find the existing VPG methods may not perform well on context modeling and highly rely on video-paragraph annotations. To tackle this problem, we propose a novel VPG method termed Semi-supervised Video-Paragraph TRansformer (SVPTR), which can more effectively exploit contextual information in paragraphs and significantly reduce the dependency on annotated data. Our SVPTR method consists of two key components: (1) a base model VPTR that learns the videoparagraph alignment with contrastive encoders and tackles the lack of sentence-level contextual interactions and (2) a semi-supervised learning framework with multimodal feature perturbations that reduces the requirements of annotated training data. We evaluate our model on three widelyused video grounding datasets, i.e., ActivityNet-Caption, Charades-CD-OOD, and TACoS. The experimental results show that our SVPTR method establishes the new state-ofthe-art performance on all datasets. Even under the conditions of fewer annotations, it can also achieve competitive results compared with recent VPG methods. * Corresponding author. (b) Video Paragraph Grounding Two young girls are standing in the kitchen preparing to cook. They then open a box of brownies…get an egg out of the fridge. After, the two continue to stir the contents … placing them on a pan. Once the cookies are … watch the cookies bake. When they are done, they … begin eating. 15.67s 28.73s 73.78s 105.12s 130.59s 0s Multi-Multi Localization Sentence: The man with red shorts serves the ball. (a) Video Sentence Grounding 12.91s 13.63s