CVPR2025

Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs

Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

Abstract

With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AIgenerated content (AIGC), and computer graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of perceptual video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to the performance of human beings. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding. Q: Does the fabric in this video exhibit good clarity and proper lighting? A. Yes B. No (a) Question Type Yes-or-No What-How Open-ended (b) Quality Concern (c) Single Video vs. Video Pairs Q: What quality issues do not exist in this video? A. None of the options B. Blurriness C. Overexposure D. Underexposure Q: Why is it difficult for viewers to identify the dish the man is cooking in the video? Open-ended Response: The video has severe compression blur and block artifacts, significantly reducing the discernibility of objects in the video. Q: What feelings does this video evoke? Open-ended Response: This video depicts a castle in a unnatural and twisted jungle, where the plants have bizarre, sharp structures and dull colors. The atmosphere is eerie and terrifying, creating a sense of horror. Technical Q: As the camera moves away in this video, is there a noticeable increase in the clarity of the person's face? A. No B. Yes AIGC Global Referring Temporal Aesthetic Q: Does this video have severe camera shake? A. No B. Yes Q: What is the most impactful quality issue of this video? A. Noise B. Incorrect human structure C. Blurriness D. Overexposure Q: How is the overall lighting level in this game video? A. Very poor B. Poor C. Average D. Good Q: Is the main character in this game video rendered in high details but with relatively low clarity? A. Yes B. No Joint Compare Q: Is the exposure of the first video more balanced than the second video? A. No B. Yes