NeurIPS2024

WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Tankala Pavan Kalyan, Piyush Singh Pasi, Sahil Dharod, Azeem Motiwala, Preethi Jyothi, Aditi Chaudhary, Krishna Srinivasan

Abstract

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs). Several state-of-the-art VLMs (e.g. CLIP, BLIP-2) have achieved near-perfect performance on widely-used image-text retrieval benchmarks such as MSCOCO-Test-5K and Flickr30K-Test-1K. As a measure of out-of-distribution (OOD) generalization, prior works rely on zero-shot performance evaluated on one dataset (Flickr) using a VLM finetuned on another one (MSCOCO). We argue that such comparisons are insufficient to assess the OOD generalization capability of models due to high visual and linguistic similarity between the evaluation and finetuning datasets. To address this gap, we introduce W IKI DO (drawn from Wiki pedia D iversity O bservatory), a new cross-modal retrieval benchmark to assess the OOD generalization capabilities of pretrained VLMs. This consists of 384K image-text pairs from Wikipedia with domain labels, along with carefully curated, human-verified in-distribution (ID) and OOD test sets of size 3K each. The image-text pairs are very diverse in topics. We evaluate different VLMs of varying capacity on the W IKI DO benchmark; BLIP-2 achieves zero-shot performance of R@1 ≈ 66% on the OOD test set, compared to ≈ 81% on MSCOCO and ≈ 95% on Flickr. When fine-tuned on W IKI DO, the R@1 improvement is at most ≈ 5% on OOD instances compared to ≈ 12%