ACL2024

MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing

Jiaqi Li, Miaozeng Du, Chuanyi Zhang, Yongrui Chen, Nan Hu, Guilin Qi, Haiyun Jiang, Siyuan Cheng, Bozhong Tian

4 citations

Abstract

Multimodal knowledge editing represents a critical advancement in enhancing the capabilities of Multimodal Large Language Models (MLLMs). Despite its potential, current benchmarks predominantly focus on coarsegrained knowledge, leaving the intricacies of fine-grained (FG) multimodal entity knowledge largely unexplored. This gap presents a notable challenge, as FG entity recognition is pivotal for the practical deployment and effectiveness of MLLMs in diverse real-world scenarios. To bridge this gap, we introduce MIKE, a comprehensive benchmark and dataset specifically designed for the FG multimodal entity knowledge editing. MIKE encompasses a suite of tasks tailored to assess different perspectives, including Vanilla Name Answering, Entity-Level Caption, and Complex-Scenario Recognition. In addition, a new form of knowledge editing, Multi-Step Editing, is introduced to evaluate the editing efficiency. Through our extensive evaluations, we demonstrate that the current state-of-the-art methods face significant challenges in tackling our proposed benchmark, underscoring the complexity of FG knowledge editing in MLLMs. Our findings spotlight the urgent need for novel approaches in this domain, setting a clear agenda for future research and development efforts within the community. 2023b; Khan et al., 2023) and Image Caption (Li 044 et al., 2023b; Ramos et al., 2023) tasks, MMEdit 045 offers a platform to test the editability of MLLMs. 046 However, a critical issue remains in its primary 047 focus on coarse-grained knowledge, which often 048 falls short of accurately representing real-world 049 fine-grained (FG) entities and scenarios. 050 To underscore the limitations of a coarse-grained 051 focus, consider a real-life example in political im-052 age captioning as shown in Figure 1. An ideal 053 MLLM output would be a fine-grained and spe-054 cific caption like "President Joe Biden arrives at 055 the White House". However, a coarse-grained ap-056 proach might yield a nondescript caption such as 057 "A white hair old man arrives at a building". This 058 lack of specificity fails to capture the critical de-059 tails and convey key information to the users of 060 MLLMs, illustrating how FG entity recognition is 061 essential for delivering accurate information. 062 While the necessity for more detailed, entity-063 specific information is clear, editing FG knowl-064 edge into MLLMs is a complex and challenging 065 endeavor. Traditional FG image classification tasks 066