CVPR2025

Self-Supervised Spatial Correspondence Across Modalities

Ayush Shrivastava, Andrew Owens

Abstract

https://ayshrv.com/cmrw RGB Thermal Photo Sketch Photorealistic style Anime style (a) Geometric Correspondence (b) Semantic Correspondence RGB Depth Figure 1. Finding spatial correspondences across modalities. We present a method for cross-modal matching, trained entirely through self-supervision using a simple formulation based on contrastive random walks [14]. (a) Given two images taken by different visual modalities and at different positions and times, we predict the pairs of image patches that physically correspond to the same points. (b) We also apply our method to semantic matching tasks, using a visual encoder initialized with pretrained DINOv2 [27] weights and fine-tuned during training. These include tasks such as photo-sketch alignment [26] and style-based matching between images of different styles generated with a text-to-image model [19].