ACL2024

Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA

Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Yang Zhao, Xinze Guan, Xin Wang

1 citation

Abstract

Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) tested, even though humans can attain approximately 99% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on LVLMs' multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of LVLMs in understanding multipanel images, we analyze the potential causes for LVLMs' performance and offer insights for enhancement with the synthetic data. Code and data will be released. 042 able proficiency in various tasks (e.g., image cap-043 tioning and visual question answering) that require 044 natural language understanding, visual-language 045 grounding, visual reasoning, etc. 046 As LVLMs become more competent, there is 047 a trend of establishing increasingly challenging 048 benchmarks that are often arduous for average hu-049 mans to achieve (Yue et al., 2023). However, this 050 raises a pertinent question: Have LVLMs advanced 051 to the stage where elementary benchmarks easily 052 handled by average humans pose little challenge 053 to them? To answer this question, we target multi-054 panel images, each involving a series of subfigures. 055 These subfigures are presented together in certain 056 layouts, such as web screenshots capturing multiple 057 thumbnail images and posters utilizing multipanel 058 formats to present a cohesive narrative or argument. 059 We observe that while humans typically find inter-060 113 quential numbers to subfigure captions in mul-114 tipanel images, akin to the Set-of-Mark visual 115 prompting method (Yang et al., 2023), improves 116 LVLMs' understanding of these images. We test 117 LVLMs on multipanel images with and without 118 sequential number captions for each subfigure. As 119 a result, we observed that only GPT-4V (OpenAI, 120 2023b) and MiniGPT-v2 (Chen et al., 2023) show a 121 notable improvement when the sequential number 122 is not only embedded in the image but also explic-123 itly mentioned in the question. In conclusion, the 124 contributions of this study are listed as follows: 125 • We propose the MultipanelVQA benchmark with 126 real-world and synthetic data that focus on evalu-127 ating the model's ability to understand the con-128 tent and layout of multipanel images. 129 • We benchmark several open-sourced and propri-130 etary LVLMs with the MultipanelVQA bench-131 mark and find that all models tested face a signif-132 icant challenge in interpreting multipanel images 133 despite their success on single-panel images. 134 • Benefited by the synthetic data with even distri-135 butions of various multipanel image attributes 136 in the MultipanelVQA benchmark, we conduct 137 thorough error analysis to uncover various factors 138 that impact the model's performance, including 139 subfigure content, layout, background, and visual 140 hint in multipanel images. 141 • Finally, we investigate the potential of adding 142 subfigure captions in multipanel images as visual 143 prompts to enhance the performance of LVLMs 144 on multipanel image understanding. 145 2 Related Work 146 Large Vision Language Models The develop-147 ment of Large Vision Language Models (LVLMs) 148 has been propelled by advances in large-language 149 models (LLMs)(Chung et al., 2022; Touvron 150 et al., 2023a,b) and vision-and-language learn-151