ACL2024

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Fuxiao Liu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

DOI Publisher

Abstract

Multimodal Large Language Models (MLLMs) 001 have demonstrated proficiency in handling a 002 variety of visual-language tasks. However, cur-003 rent MLLM benchmarks are predominantly de-004 signed to evaluate reasoning based on static 005 information about a single image, and the abil-006 ity of modern MLLMs to extrapolate from im-007 age sequences, which is essential for under-008 standing our ever-changing world, has been 009 less investigated. To address this challenge, 010 this paper introduces Mementos, a new bench-011 mark designed to assess MLLMs' sequential 012 image reasoning abilities. Mementos features 013 4,761 diverse image sequences with varying 014 lengths. We also employ a GPT-4 assisted 015 method to evaluate MLLM reasoning perfor-016 mance. Through a careful evaluation of nine 017 recent MLLMs on Mementos, including GPT-018 4V and Gemini, we find that they struggle to 019 accurately describe dynamic information about 020 given image sequences, often leading to hal-021 lucinations/misrepresentations of objects and 022 their corresponding behaviors. Our quantita-023 tive analysis and case studies identify three key 024 factors impacting MLLMs' sequential image 025 reasoning: the correlation between object and 026 behavioral hallucinations, the influence of co-027 occurring behaviors, and the compounding im-028 pact of behavioral hallucinations. 029 1 Introduction 030 The recent emergence of Multimodal Large Lan-031 guage Models (MLLMs) such as GPT-4V (Ope-032 nAI, 2023b) and Gemini (Team, 2023) has shown 033 strong visual-language understanding and gener-034 ation capabilities in many areas, like image cap-035 tioning and visual question answering. Despite the 036 notable performance of existing MLLMs, they of-037 ten suffer from hallucination (a phenomenon where 038 MLLMs produce inaccurate descriptions of the 039 given images) due to insufficient reasoning capa-040 bilities, generating inaccurate responses in visual 041 101 cant deficiency in MLLMs' capability to deduce 102 events from image sequences. 103 Furthermore, our research pinpoints three prin-104 cipal factors that lead to the reasoning failures of 105 MLLMs: (1) the interconnectedness of object and 106 behavioral hallucinations, (2) the impact of co-107 occurring behaviors, and (3) the cumulative effect 108 of behavioral hallucinations. The objective of our 109 proposed benchmark and analyses is to shed light 110 on innovative approaches to augment the reasoning 111 abilities of MLLMs and to reduce hallucinations in 112 their subsequent advancements. 113 2 Mementos 114 In this section, we introduce Mementos, a novel 115 and challenging benchmark designed to test the rea-116 soning capability of Multimodal Large Language 117 Model (MLLM) under sequential image input. Ini-118 tially, we detail the data gathering and annotation 119 methodology for Mementos, alongside an overview 120 of data distribution. Subsequently, we outline the 121 procedure and the metric employed to evaluate the 122 reasoning capabilities of MLLMs on Mementos. 123 2.1 Mementos Benchmark 124 2.1.1 Dataset Composition 125 Mementos comprises 4,761 image sequences of 126 varying lengths, predominantly sourced from Daily-127 life, Robotics, and Comics domains. Detailed 128 statistics are provided in Table 1. This diverse col-129 lection is instrumental in evaluating the comprehen-130 sive time-varying reasoning abilities of MLLMs. 131 Specifically, the robotics data, closely associated 132 with embodied AI or real-world contexts, and the 133 comic-style storyboard data, rich in stylistic and 134 episodic diversity in image sequences, significantly 135 enhance the benchmark's relevance and robustness. 136 Table 1: The number of image sequences in different categories within Mementos. Total Train Set Val set Daily-life 3505 3055 450 Robotics 1101 902 199 Comics 155 105 50 Daily-life The Daily-life image sequences in Me-137 mentos are derived from video clips in the Next-138 QA dataset, as cited in Xiao et al. (2021). These 139 sequences represent a range of everyday life sce-140 narios. We have selectively extracted videos from 141 the Next-QA Training set, specifically those with 142 frame counts ranging from 400 to 2,500. To bal-143 ance the challenge of testing MLLMs' reasoning 144 capabilities against the risk of losing critical in-145 formation, our methodology involves retaining the 146 first frame of each video. Subsequently, we sample 147 one image every 100 frames. The collected images 148 from this sampling process then form an image se-149 quence that corresponds to the original video. This 150 approach ensures a rigorous yet feasible evalua-151 tion of MLLMs' reasoning abilities in dynamically 152 evolving everyday scenarios. 153 Robotics For the Robotics data, we utilized 154 videos from various sub-datasets within Open X-155 Embodiment (Collaboration et al., 2023). Open X-156 Embodiment aggregates video datasets from multi-157 ple university laborator