CVPR2021
Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation
Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, Kwanghoon Sohn
Abstract
Video Synchronized Video ahead of audio Figure 1: We wish to hear individual speech of a desired speaker only even if there is frame discontinuity in the audio-visual data. When audio and video segments are taken from different points in time (solid box), it is intuitively difficult to separate speech of each speaker compared to the aligned cases (dashed box). Best viewed in color.