CVPR2021

Localizing Visual Sounds the Hard Way

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

摘要

We localise sound sources in videos without manual annotation. Our key contribution is an automatic negative mining technique through differentiable thresholding of a cross-modal correspondence score map into a Tri-map. We use background regions with low correlation to the given sound as 'hard negatives' in a contrastive learning framework.