CVPR2021

Look Before You Speak: Visually Contextualized Utterances

Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Abstract

It makes it feel healthier. Now slip that nut back on and screw it down. It's going to take about five minutes. … Transcript: I'm going to go ahead and slip that into place and I'm going to make note of which way the arrow is going in relation to the arrow on our guard. They both need to be going the same direction next. Prediction Next utterance candidates Input Video ✔ Figure 1: Visually Contextualised Future Utterance Prediction. Given an instructional video with paired text and video data, we predict the next utterance in the video using a Co-attentional Multimodal Video Transformer. Our model trained on this task also achieves state-of-the-art performance on downstream VideoQA benchmarks.