AAAI2023

HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering

Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Guohui Li

8 citations

Abstract

Visual Question Answering (VQA) aims to answer the natural language question about a given image by understanding multimodal content. However, the answering quality of most existing visual-language pre-training (VLP) methods is still limited, mainly due to: (1) Incompatibility. Upstream pretraining tasks are generally incompatible with downstream question answering tasks, which makes the knowledge from the language model not well transferable to downstream tasks, and greatly limits their performance in few-shot scenarios; (2) Under-fitting. They generally do not integrate human priors to compensate for universal knowledge from language models, so as to fit the challenging VQA problem and generate reliable answers. To address these issues, we propose HybridPrompt, a cloze-and verify-style hybrid prompt framework with bridging language models and human priors in prompt tuning for VQA. Specifically, we first modify the input questions into the cloze-style prompts to narrow the gap between upstream pre-training tasks and downstream VQA task, which ensures that the universal knowledge in the language model can be better transferred to subsequent human prior-guided prompt tuning. Then, we imitate the cognitive process of human brain to introduce topic and sample related priors to construct a dynamically learnable prompt template for human prior-guided prompt learning. Finally, we add fixedlength learnable free-parameters to further enhance the generalizability and scalability of prompt learning in the VQA model. Experimental results verify the effectiveness of Hy-bridPrompt, showing that it achieves competitive performance against previous methods on widely-used VQAv2 dataset and obtains new state-of-the-art results. Our code is released at: https://github.com/zhizhi111/hybrid .