ACL2024

Modality-Aware Integration with Large Language Models for Knowledge-Based Visual Question Answering

Junnan Dong, Qinggang Zhang, Huachi Zhou, Daochen Zha, Pai Zheng, Xiao Huang

被引用 11 次

摘要

Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight intermodal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments show the superiority of MAIL. * Corresponding Author Recently, several studies have explored using large language models (LLMs) as supplementary knowledge bases and reasoning tools for KVQA (Yang et al., 2022; Gui et al., 2022; Lin et al., 2022); according to how they fuse the knowledge, they can be broadly categorized into direct prompting and modality-agnostic approaches, shown in Figure 1 (a) and (b), respectively. The former directly prompts the question and the corresponding image caption to LLMs for answers (Yang et al., 2022) . The latter leverages LLMs to generate candidate answers with supporting evidence and simply combines both question and the external knowledge embedding, e.g., Wikidata (Shengyuan et al., 2024), for reasoning at the final stage (Gui et al., 2022; Lin et al., 2022). While the above methods have employed LLMs in various ways for KVQA, we argue that they have not fully leveraged the knowledge from LLMs and lack the cross-modal reasoning ability, potentially resulting in sub-optimal performance for complex VQA scenarios. (i) LLMs could incorrectly answer questions or provide unreliable evidence for reasoning. On the one hand, direct prompting to LLMs may struggle to identify the right answer for many complex or domain-specific questions, due to the lack of domain knowledge (Amaro et al., 2023; Chen et al., 2024) . On the other hand, LLMs