CVPR2023

Relational Space-Time Query in Long-Form Videos

Xitong Yang, Fu-Jen Chu, Matt Feiszli, Raghav Goyal, Lorenzo Torresani, Du Tran

Abstract

Q: When did I do activity ๐‘Ž that involves interaction with fefobject ๐‘œ ? A: Temporal locations of the corresponding activity fef [๐‘ !, ๐‘’!] !"# % Figure 1. Illustration of the three types of queries in our Relational Space-Time Query (ReST) framework. Given a long video spanning up to 30 minutes, a set of queries are provided to assess a model's ability to understand activities, objects, and their interactions in the video. All queries and answers are generated in the form of pre-defined templates (top-left) to avoid the ambiguity and bias introduced by language input / output. Note that ReST is a holistic framework that supports constructing queries with different levels of complexity beyond the three basic types described in this paper.