EMNLP2024

MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering

Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang

3 citations

Abstract

A multimodal large language model (MLLM) may struggle with answering visual-based (personal) entity questions (VEQA), such as "who is A?" or "who is A that B is talking to?" for various reasons, e.g., the absence of the name of A in the caption or the inability of MLLMs to recognize A, particularly for less common entities. Furthermore, even if the MLLM can identify A, it may refrain from answering due to privacy concerns. In this paper, we introduce a novel methodology called Matching-Augmented Reasoning (MAR) to enhance VEQA. Given a collection of visual objects with captions, MAR preprocesses each object individually, identifying faces, names, and their alignments within the object. It encodes this information and stores their vector representations in vector databases. When handling VEQA, MAR retrieves matching faces and names and organizes these entities into a matching graph, where nodes represent entities and edges indicate their similarities. MAR then derives the answer to the query by reasoning over this matching graph. Extensive experiments show that MAR significantly improves VEQA compared with the state-of-the-art methods using MLLMs. LLaVA (Liu et al., 2023) have significantly im-032 proved visual question answering (VQA) by in-033 tegrating text and images. However, they still 034 face challenges in visual-based entity question 035 answering (VEQA), a crucial subset of VQA that 036 focuses on extracting information about specific 037 entities, especially for personal entities. 038 MLLMs for VEQA: Advantages and Limitations. 039 081 As illustrated in Figure 1(c), if we can suc-082 cessfully match the face in image V 2 with the 083 face in image V 1 , and if we know that the face 084 in V 1 is "Yi Wang", we can easily answer Q 2 . 085 Contributions. Our notable contributions are 086 summarized as follows. 087 • We study VEQA, an important and com-088 monly used subset of VQA, but is under-089 explored. (Section 3) 090 • We propose matching graphs that can cap-091 ture the relationships of the same enti-092 ties over multiple captioned visual objects. 093 Based on a matching graph, we proposed 094 matching-augmenting reasoning (MAR), to 095 effective answer a VEQA. (Section 4) 096 • Given that VEQA is a relatively new prob-097 lem, existing benchmarks are not suit-098 able. Therefore, we have constructed a new 099 benchmark NewsPersonQA including 235k 100 images and 6k QA pairs. (Section 5) 101 • We conduct extensive experiments to show 102 that MAR > MLLMs + RAG > MLLMs, where 103 RAG is to feed the retrieved matching graph 104 to MLLMs. (Section 6) 105 2 Related Work 106 VQA. VQA aims at reasoning over visual and 107 textual content and cues to generate answers (Lu 108