EMNLP2025

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu

被引用 1 次

摘要

Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MEMESAFETYBENCH, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLMbased instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to textonly inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MEMESAFETYBENCH is publicly available at https://github.com/ oneonlee/Meme-Safety-Bench . Warning: This paper includes examples of harmful language and images that may be sensitive or uncomfortable. Reader discretion is recommended. Meme (𝐼𝐼 𝑖𝑖 ) False or Misleading Information 𝑐𝑐 𝑖𝑖 Category (𝑐𝑐 𝑖𝑖 ) Classification GPT-4o ① Harmful/Harmless Instruction Generation & Verification ② Task (𝑡𝑡 𝑖𝑖 𝑗𝑗