ICCV2021

MAAS: Multi-modal Assignation for Active Speaker Detection

Juan León Alcázar, Fabian Caba Heilbron, Ali K. Thabet, Bernard Ghanem

66 citations

Abstract

Active speaker detection requires a mindful integration of multi-modal cues. Current methods focus on modeling and fusing short-term audiovisual features for individual speakers, often at frame level. We present a novel approach to active speaker detection that directly addresses the multimodal nature of the problem and provides a straightforward strategy, where independent visual features (speakers) in the scene are assigned to a previously detected speech event. Our experiments show that a small graph data structure built from local information can approximate an instantaneous audio-visual assignment problem. Moreover, the temporal extension of this initial graph achieves a new state-of-the-art performance on the AVA-ActiveSpeaker dataset with a mAP of 88.8%. We have made our code available at https://github.com/fuankarion/MAAS .