CVPR2024

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song

Abstract

Figure 1. Sketch-based image retrieval frameworks [12, 102, 103] usually employ ImageNet pre-trained CNNs [3, 15, 103], JFT-trained vision transformers (ViT) [67, 78], or visual encoders of vision-language models like CLIP [77] as backbone feature extractors. Rich knowledge from large-scale pre-training offers a good initialisation, which when further fine-tuned on sketch-photo datasets, performs way better than training from random initialisation [62]. While one can extract features either by discarding the classification head for ImageNet pre-trained models, auxiliary task head for self-supervised models, or by using CLIP's visual encoder, text-to-image diffusion models (e.g., stable diffusion) lack any specific feature embedding space. However, we find that its intermediate representations implicitly hold robust cross-modal features at multiple granularities. Unlike prior SBIR backbones, pre-trained with discriminative tasks, we propose to leverage denoising diffusion models pre-trained with text-to-image generative tasks to bridge the sketch-photo domain gap. Being a text-to-image generation model trained on a large corpus of text-image pairs (LAION [82, 83]), it holds both semantic and shape prior [89]. PCA representation [89] (right) of intermediate UNet features (sketch/photo) from different upsampling blocks (details in Sec. 5) depict that they share significant semantic similarity (denoted by similar colours).