WWW2024

CapAlign: Improving Cross Modal Alignment via Informative Captioning for Harmful Meme Detection

Junhui Ji, Xuanrui Lin, Usman Naseem

20 citations

Abstract

Harmful memes detection is challenging due to the semantic gap between different modalities. Previous studies mainly focus on feature extraction and fusion to learn discriminative information from memes. However, they ignore the misalignment of the modalities caused by the modality gap and suffer from data scarcity, resulting in insufficient learning of fusion-based models. Recently, researchers transformed images into textual captions and used language models for predictions, resulting in non-informative image captions. To address these gaps, this paper proposes an instructions-based abstracting approach CapAlign, in zero-shot visual question-answering settings. Precisely, we prompt a large language model (LLM) to ask informative questions to a pre-trained vision-language model and use the dialogues to generate a high-quality image caption. Further, to align the generated caption with the textual content of a meme, we used an LLM with instructions to generate informative captions of the meme and then prepend it with the attributes of the visual content of a meme to a prompt-based LM for prediction. Experimental findings on two benchmark datasets show that our approach produces informative captions and outperforms state-of-the-art methods for detecting harmful memes.