ACL2025

Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

Jiahui Geng, Thy Thy Tran, Preslav Nakov, Iryna Gurevych

摘要

Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, here we exploit the capabilities of MLLMs to interpret non-textual instructions-specifically adversarial images or audio-generated by our novel method, Con Instruction. We optimize the adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental aspects of sophisticated understanding in MLLMs. Unlike previous work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLMs safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new attack response categorization (ARC) that considers both response quality and relevance to the malicious instructions to evaluate attack success. The results show that Con Instruction effectively bypasses the safety mechanisms in various visual and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, across two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On the defense side, we explore various methods against our attacks and find a substantial gap among existing techniques. Our implementation is made available. 1 Warning: This paper contains examples that may be offensive to some readers.