CVPR2021

Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation

Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, Kwanghoon Sohn

Abstract

Video Synchronized Video ahead of audio Figure 1: We wish to hear individual speech of a desired speaker only even if there is frame discontinuity in the audio-visual data. When audio and video segments are taken from different points in time (solid box), it is intuitively difficult to separate speech of each speaker compared to the aligned cases (dashed box). Best viewed in color.