CVPR2020

Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles

摘要

Figure 1: How to understand and describe a scene from video input? We argue that a detailed understanding of spatiotemporal object interaction is crucial for this task. In this paper, we propose a spatio-temporal graph model to explicitly capture such information for video captioning. Yellow boxes represent object proposals from Faster R-CNN [12]. Red arrows denote directed temporal edges (for clarity, only the most relevant ones are shown), while blue lines indicate undirected spatial connections. Video sample from MSVD [3] with the caption "A cat jumps into a box." Best viewed in color.