CVPR2024

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu

6 citations

Abstract

Vision language models (VLM) have demonstrated re-markable performance across various downstream tasks. However, understanding fine-grained visual-linguistic con-cepts, such as attributes and inter-object relationships, re-mains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary fo-cus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We intro-duce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we con-duct a thorough evaluation offour leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we pro-pose a simple yet effective approach to optimize VLMs in fine- grained understanding, achieving significant improve-ments on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.