ICLR2025

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan

摘要

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DECAPBENCH along with a novel metric, DCSCORE, specifically designed for detailed captioning tasks. DCSCORE evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCSCORE aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DECAPBENCH exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FEEDQUILL, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o. We release the evaluation code and the model on Github 1 . 1 https://github.com/MAGAer13/DeCapBench Recent open-source VLMs have been significantly improved, narrowing their performance gap compared with GPT-4V on various benchmarks. However, this progress does not always translate into better image captioning abilities. The issue lies in the fact that while current VLMs can generate detailed captions with many fine-grained elements, existing metrics rely on coarse-grained groundtruth captions that overlook these details. Furthermore, traditional automatic evaluation metrics show lower correlation with human evaluations, raising questions about their effectiveness. To address these limitations, we propose DECAPBENCH, a new image captioning evaluation benchmark, along with a novel metric DCSCORE, as illustrated in Figure 1 , that better captures the descriptive capabilities of VLMs. Our metric ensures that model rankings align more closely with results from the VLM arena, which is based on diverse, crowd-sourced user votes for image description tasks. DCSCORE EVALUATION METRIC