CVPR2021
Localizing Visual Sounds the Hard Way
Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
摘要
We localise sound sources in videos without manual annotation. Our key contribution is an automatic negative mining technique through differentiable thresholding of a cross-modal correspondence score map into a Tri-map. We use background regions with low correlation to the given sound as 'hard negatives' in a contrastive learning framework.