ACL2024

Visual Hallucinations of Multi-modal Large Language Models

Wen Huang, Hongbin Liu, Minxin Guo, Neil Gong

被引用 21 次

摘要

Visual hallucination (VH) means that a multi-modal LLM (MLLM) imagines incorrect details about an image in visual question answering. Existing studies find VH instances only in existing image datasets, which results in biased understanding of MLLMs' performance under VH due to limited diversity of such VH instances. In this work, we propose a tool called VHTest to generate a diverse set of VH instances. Specifically, VHTest finds some initial VH instances in existing image datasets (e.g., COCO), generates a text description for each VH mode, and uses a text-to-image model (e.g., DALL•E-3) to generate VH images based on the text descriptions. We collect a benchmark dataset with 1,200 VH instances in 8 VH modes using VHTest. We find that existing MLLMs such as GPT-4, LLaVA-1.5, and MiniGPT-v2 hallucinate for a large fraction of the instances in our benchmark. Moreover, we find that fine-tuning an MLLM using our benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. Our benchmarks are available at anonymous link: https://github.com/PrimaveralScientist/VHTest . containing factually incorrect details about an im-042 age, known as visual hallucination (VH) (Li et al., 043 2023; Liu et al., 2024b). Figure 1 shows an ex-044 ample where the MLLM hallucinates two lamps, 045 contradicting the three lamps in the image. VHs 046 in MLLMs pose obstacles to developing safe and 047 trustworthy AI, which is emphasized in a recent 048 U.S. Executive Order calling for rigorous testing 049 to address potential harms from advanced AI sys-050 tems (The White House, 2023). 051 Prior works have tried to benchmark MLLMs' 052 VHs related to object existence (Li et al., 2023; Liu 053 et al., 2024a), optical character recognition (OCR), 054 object counting, object positions comparing (Fu 055 et al., 2023), orientation, and viewpoint (Tong et al., 056 2024) (concurrent to ours). However, they collect 057 VH images only from existing image datasets like 058 COCO (Lin et al., 2014). This limits the diversity 059 of VH images since they can only find a limited 060 number of them. Moreover, existing image datasets 061 may have been used to pre-train an MLLM, leading 062 to data contamination (Jacovi et al., 2023; Sainz 063 et al., 2023). As a result, such VH images lead to a 064 biased understanding of an MLLM's performance, 065 e.g., an MLLM is incorrectly concluded to perform 066 well under VH. 067 Our Work We propose VHTest, a tool that gen-068 erates VH instances to test MLLMs. A VH in-069 stance is a triple (an image, a question, a reference 070 161 5. OCR VH: An MLLM fails to accurately iden-162 tify at least one character in an image. 163 6. Size VH: An MLLM fails to accurately com-164 pare the relative sizes of multiple objects in 165 an image. 166 7. Position VH: An MLLM fails to accurately 167 identify spatial relationships between objects 168 in an image. 169 8. Counting VH: An MLLM exhibits a counting 170 VH mode when it cannot accurately enumer-171 ate the number of objects in an image. 172 VH Instance An VH instance is a triple 173 x i , x t , y r , where x i is an image, x t is a ques-174 tion, and y r is a reference answer. We say a VH 175 instance succeeds for an MLLM if and only if the 176 MLLM's text response for x i and x t is factually 177 incorrect compared to the reference answer y r . For 178 instance, in the example shown in Figure 1, the ref-179 erence answer is "three lamps", while the MLLM's 180 text response indicates two lamps.