WWW2026

SketchMind: Understanding Abstract Sketches with MLLMs for Fine-Grained Sketch-Based Image Retrieval

Changxing Li, Donglin Zhang, Zhikai Hu, Xiao-Jun Wu, Josef Kittler

Abstract

Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to retrieve images that accurately correspond to abstract hand-drawn sketches, requiring the model to understand sparse and abstract visual cues. Existing methods tend to rely on convolutional networks or metric learning to align sketch and image features, often overlooking the inherent abstraction and semantic ambiguity present in sketches. This limitation results in an insufficient understanding of fine-grained visual details. To address this challenge, we propose SketchMind, a novel method that leverages Multi-modal Large Language Models (MLLMs) to enhance abstract sketch understanding in FG-SBIR. Specifically, we use MLLMs to generate auxiliary textual descriptions based on the given sketches via a Visual Question Answering (VQA) strategy. To effectively incorporate these descriptions, we construct a graph structure with the sketch as the central node and the generated texts as peripheral nodes. A graph attention scheme is employed to perform uncertainty-aware feature fusion, enabling the model to suppress noisy or irrelevant textual information. Furthermore, to enhance both inter- and intra-modal fine-grained alignment, we design a Multi-scale Cross-modal Jigsaw Matching module in combination with a self-supervised learning strategy, which captures local and global visual correspondences across modalities more effectively. Extensive experiments on three benchmark FG-SBIR datasets demonstrate that SketchMind achieves superior performance over existing state-of-the-art methods, proving its effectiveness. Code is available at https://github.com/li1changxing/MLLM_FG_SBIR/.