CVPR2020

Speech2Action: Cross-Modal Supervision for Action Recognition

Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

Abstract

Caption: Hello, it's me Speech2Action classifier [answers] phone Hello, it's me [answers] phone Thanks for calling so soon [answers] phone Hello Dad, are you still there? action: dialogue: action: dialogue: action: dialogue Unlabelled videos She knows he's right. Jane's cell RINGS. She lets it ring again, then answers it. JANE (into phone) Hello, it's me. Movie screenplays Weak label: [answer] phone Figure 1. Weakly Supervised Learning of Actions from Speech Alone: The co-occurrence of speech and scene descriptions in movie screenplays (text) is used to learn a Speech2Action model that predicts actions from transcribed speech alone. Weak labels for visual actions can then be obtained by applying this model to the speech in a large unlabelled set of movies.