ACL2024

Exploring Chain-of-Thought for Multi-modal Metaphor Detection

Yanzhi Xu, Yueying Hua, Shichen Li, Zhongqing Wang

Abstract

Metaphors are commonly found in advertising and internet memes. However, the free form of internet memes often leads to a lack of high-quality textual data. Metaphor identification demands a deep interpretation of both textual and visual elements, requiring extensive common-sense knowledge, which poses a challenge to language models. To address these challenges, we propose a compact framework that enhances the small model by distilling knowledge from Multi-modal Large Language Models(MLLMS). Specifically, our approach designs a three-step process inspired by Chain-of-Thought (CoT) that extracts and integrates knowledge from larger models into smaller ones. We also developed a modality fusion architecture to transform knowledge from large models into metaphor features, supplemented by auxiliary tasks to improve model performance. Experimental results on the MET-MEME dataset demonstrate that our method not only effectively enhances the metaphor identification capabilities of small models but also outperforms existing models. To our knowledge, this is the first systematic study leveraging MLLMs in metaphor identification tasks. modality identification, multi-modal metaphor 044 identification not only spots metaphors in sentences 045 but also categorizes them as image-dominated, text-046 dominated, or complementary. The second major 047 challenge arises from the poor quality of textual 048 content, mainly sourced from advertisements and 049 memes on social media. Texts give the image more 050 metaphorical features. Recent efforts use OCR (Op-051 tical Character identification) to extract texts in the 052 image. However, only relying on OCR to convert 053 them into parallel texts leads to the loss of texts' 054 positional information. Figure 1 presents a repre-055 sentative example, symbolizing how 'PUBG' (a 056 video game) acts like a trap preventing 'me' from 057 achieving my 'life goals'. 058 To overcome these challenges, we hope to gain 059 126 Zhang et al., 2023a). Unlike the aforementioned 127 approaches that extract information from different 128 modalities and directly merge them, we leverage 129 LLMs employing the CoT method to analyze fea-130 tures between modalities, aiding downstream mod-131 els in cross-modal fusion. 132 3 Method 133 We propose a novel framework based on knowledge 134 distillation from MLLMs to enhance metaphor 135 identification. In this section we first introduce 136 the task definition(3.1) and the complete model 137 architecture((3.2). After that, we elaborate on 138 knowledge acquisition from MLLMs using the CoT 139 method(3.3) and the implementation of the down-140 stream fusion module(3.4). Finally, we provide a 141 brief exposition of the training methodology (3.5). 142 3.1 Task Definition 143 Formally, the task of multi-modal metaphor iden-144 tification falls under the typical category of multi-145 modal classification problems. Given a set of cross-146 modal sample pairs, the task aims to determine 147 whether metaphorical features are present and pro-148 vide a classification result. Our work focuses on 149 the identification of metaphors in image-text pairs, 150 thus the task is represented as: 151 Y = F (x I , x T ) (1) 152 where x I and x T respectively denote the features 153 of the image and text modalities. Our objective is 154 to utilize a more effective method F to ensure that 155 the classification result Ŷ more closely aligns with 156 the true value y. 157 372 due to their strong performance in both Chinese 373 and English corpora. We fine-tuned both models 374 separately using LoRA.