NeurIPS2023

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi

被引用 80 次

摘要

Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 AP m novel and 44.7 AP m all on OV-LVIS, surpassing the previous SoTA by 4.2 AP m novel and 9.8 AP m all . Code is available at https://github.com/CVMI-Lab/CoDet . Recent studies typically rely on vision-language models (VLMs) to determine region-word alignments, for example, by estimating region-word similarity [59, 27, 15, 28] . Despite its simplicity, the quality of generated pseudo region-text pairs is subject to limitations of VLMs. As illustrated in Figure 1b , VLMs pre-trained with image-level supervision, such as CLIP [35] , are largely unaware of localization quality of pseudo labels [59] . 53] mitigate this issue to some extent, they are initially pre-trained with a limited number of detection or grounding concepts, † This work was performed when Chuofan Ma worked as an intern at ByteDance. * Equal contribution. 37th Conference on Neural Information Processing Systems (NeurIPS 2023).