CVPR2024

Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Nikhil Singh, Chih-Wei Wu, Iroro Orife, Mahdi M. Kalayeh

摘要

Figure 1. (Left) Audiovisual scenes can be perceptually similar even as the words spoken in them differ, which may be a challenge for self-supervised audiovisual representation learning. (Right) We propose to leverage movie dubs during training and show that it improves the quality of learned representations on a wide range of tasks.