CVPR2020

Syntax-Aware Action Targeting for Video Captioning

Qi Zheng, Chaoyue Wang, Dacheng Tao

Abstract

Video captioning aims to describe objects and their interactions in the video using natural language. Existing methods have made great efforts to identify objects in videos, but few of them emphasize the prediction of interactions among objects, which is usually indicated by action/predicate in generated sentences. Different from other components in a sentence, the predicate depends on both the static scene and the dynamic motions in a video. Due to the neglect of such uniqueness, actions generated by existing methods may depend heavily on the co-occurrence of objects, e.g. 'driving' is predicted with high confidence whenever both man and car are detected. In this paper, we propose a Syntax-Aware Action Targeting (SAAT) module that explicitly learns actions by simultaneously referring to the subject and video dynamics. Specifically, we first identify the subject by drawing global dependence among multiple objects, and then decode action from a common space that fuses the embedding of the subject and the temporal feature of the video. Validated on two public datasets, the proposed module increases action accuracy in generated descriptions, which present better semantic consistency with the dynamic content in videos. Codes are available on https://github.com/SydCaption/SAAT .