USENIX Security2025

From Meme to Threat: On the Hateful Meme Understanding and Induced Hateful Content Generation in Open-Source Vision Language Models

Yihan Ma, Xinyue Shen, Yiting Qu, Ning Yu, Michael Backes, Savvas Zannettou, Yang Zhang

摘要

Open-source Vision Language Models (VLMs) have rapidly advanced, blending natural language with visual modalities, leading them to achieve remarkable performance on tasks such as image captioning and visual question answering. However, their effectiveness in real-world scenarios remains uncertain, as real-world images-particularly hateful memes-often convey complex semantics, cultural references, and emotional signals far beyond those in experimental datasets. In this paper, we present an in-depth evaluation of VLMs' ability to interpret hateful memes by curating a dataset of 39 hateful memes and 12,775 responses from seven representative VLMs using carefully designed prompts. Our manual annotations of the responses' informativeness and soundness reveal that VLMs can identify visual concepts and understand cultural and emotional backgrounds, especially for the well-known hateful memes. However, we find that the VLMs lack robust safeguards to effectively detect and reject hateful content, making them vulnerable to misuse for generating harmful outputs such as hate speech and offensive slogans. Our findings show that 40% of VLM-generated hate speech and over 10% of hateful jokes and slogans were flagged as harmful, emphasizing the urgent need for stronger safety measures and ethical guidelines to mitigate misuse. We hope our study serves as a foundation for improving VLM safety and ethical standards in handling hateful content. 1 Disclaimer. This paper includes examples of hateful content, including antisemitic symbols and other forms of highly offensive material. Reader discretion is advised when reviewing this content.