CVPR2025

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu

摘要

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline that combines sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe VL-GenRMs limitations. Comprehensive evaluation across 16 leading large vision-language models demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs. Project page: https://vl-rewardbench.github.io . * Core contributors. Answer 1 (Chosen) : There are three sinks and three mirrors in the public restroom. Answer 2 (Rejected) : In the image, there are two sinks and four mirrors equipped with sensor taps alongside a matching set of hand dryers hanging above. Llama-3.2-90B Answer 2 is better because it accurately describes the visual elements in the image, fully addresses all aspects of the question. Claude-3.5-Sonnet Answer 2 provides a more accurate, complete, and informative response to the question while maintaining clarity and relevance. It correctly identifies the number of sinks and mirrors and offers additional details about the restroom that enhance the overall understanding of the space. Qwen2-VL-7B Answer 2 is better, which accurately states that there are two sinks and four mirrors in the image. Query: How many sinks and mirrors are there in the public restroom? Figure 1. An example from our VL-RewardBench asking the visual details in a restroom. Open-source VL-GenRMs (Qwen2-VL-7B [48] and Llama-3.2-90B [13]) and the commercial model (Claude-3.5-Sonnet [3]) all fail to provide accurate judgments.