CVPR2024

Open Vocabulary Semantic Scene Sketch Understanding

Ahmed Bourouis, Judith Ellen Fan, Yulia Gryaditskaya

摘要

We study the underexplored but fundamental problem of machine understanding of abstract freehand scene sketches. We introduce a sketch encoder that ensures a semanticallyaware feature space, which we evaluate by testing its performance on a semantic sketch segmentation task. To train our model, we rely only on bitmap sketches accompanied by brief captions, avoiding the need for pixel-level annotations. To generalize to a large set of sketches and categories, we build upon a vision transformer encoder pretrained with the CLIP model. We freeze the text encoder and perform visual-prompt tuning of the visual encoder branch while introducing a set of critical modifications. First, we augment the classical key-query (k-q) self-attention blocks with value-value (v-v) self-attention blocks. Central to our model is a two-level hierarchical training that enables efficient semantic disentanglement: The first level ensures holistic scene sketch encoding, and the second level focuses on individual categories. In the second level of the hierarchy, we introduce cross-attention between the text and vision branches. Our method outperforms zero-shot CLIP segmentation results by 37 points, reaching a pixel accuracy of 85.5% on the FS-COCO sketch dataset. Finally, we conduct a user study that allows us to identify further improvements needed over our method to reconcile machine and human understanding of freehand scene sketches.