ACL2025

SQL Injection Jailbreak: A Structural Disaster of Large Language Models

Jiawei Zhao, Kejiang Chen, Weiming Zhang, Nenghai Yu

被引用 9 次

摘要

Large Language Models (LLMs) are susceptible to jailbreak attacks that can induce them to generate harmful content. Previous jailbreak methods primarily exploited the internal properties or capabilities of LLMs, such as optimization-based jailbreak methods and methods that leveraged the model's contextlearning abilities. In this paper, we introduce a novel jailbreak method, SQL Injection Jailbreak (SIJ), which targets the external properties of LLMs, specifically, the way LLMs construct input prompts. By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content. For open-source models, SIJ achieves near 100% attack success rates on five well-known LLMs on the AdvBench and HEx-PHI, while incurring lower time costs compared to previous methods. For closed-source models, SIJ achieves an average attack success rate over 85% across five models in the GPT and Doubao series. Additionally, SIJ exposes a new vulnerability in LLMs that urgently requires mitigation. To address this, we propose a simple adaptive defense method called Self-Reminder-Key to counter SIJ and demonstrate its effectiveness through experimental results. Our code is available at https://github.com/ weiyezhimeng/SQL-Injection-Jailbreak . Warning: This paper contains examples of harmful results generated by LLMs. * Corresponding author. towards safety alignment (Ji et al., 2024; Yi et al., 2024) to ensure secure outputs from LLMs, they remain susceptible to jailbreak attacks. When exposed to crafted prompts, LLMs may output harmful content, such as violence, sexual content, and discrimination (Zhang et al., 2024c), which poses significant challenges to the secure and trustworthy development of LLMs. Previous jailbreak attack methods primarily exploit the internal properties or capabilities of LLMs. Among these, one category of attacks leverages the model's implicit properties, such as various optimization-based attack methods (Zou et al., 2023; Liu et al., 2024; Chao et al., 2023; Guo et al., 2024) , which do not provide an explicit explanation for the reasons behind their success. For instance, the GCG (Zou et al., 2023) method maximizes the likelihood of the model generating affirmative prefixes, such as "Sure, here is," by optimizing the suffix added to harmful prompts. However, it fails to explain why the model is sensitive to such suffixes. Another category of attacks exploits the model's explicit capabilities, such as code comprehension (