CVPR2025

SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

Subhadeep Koley, Tapas Kumar Dutta, Aneeshan Sain, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Yi-Zhe Song

Abstract

Figure 1. (Left): Apart from high-resolution image generation, text-to-image diffusion models (e.g., Stable diffusion (SD) [71]) with their innate object understanding capability [84, 107], have shown remarkable performance across a wide range of image-based vision tasks (e.g., segmentation [94], depth estimation [110], etc. ). However, upon analysing the PCA representation of SD's intermediate UNet features, we observe that it struggles to achieve similar results when working with freehand abstract sketches (detail in Sec. 4). Unlike pixel-perfect photos, highly abstract freehand sketches are sparse and lack detailed textures and colours [26] , making it harder for the SD model to extract meaningful features. Furthermore, investigating the SD denoising process in the frequency domain (via Fourier Transform), we observe the predominance of high-frequency (HF) components, rather than their low-frequency (LF) counterpart -crucial for capturing comprehensive semantic context. To mitigate this inherent bias within SD, we reinforce the diffusion process with another pretrained model (i.e., CLIP [67]) whose bias is complementary (i.e., focuses on LF) to SD. Consequently, the proposed extractor can extract semantically meaningful and accurate features from both sketches and photos, that encapsulate a broader frequency spectrum (i.e., HF and LF). (Right:) Testing the proposed method with different sketch-based discriminative and dense prediction tasks (requiring knowledge of both sketch and image), we find a marked improvement over baseline SD+CLIP hybrid feature extractor.