USENIX Security2025
Exploiting Task-Level Vulnerabilities: An Automatic Jailbreak Attack and Defense Benchmarking for LLMs
Lan Zhang, Xinben Gao, Liuyi Yao, Jinke Song, Yaliang Li
Abstract
Recent advancements in large language models (LLMs) have notably improved their proficiency in executing complex tasks. However, these advancements are accompanied by an increased risk of generating toxic content as well as leaking private information. "Jailbreak" is an emerging trend to amplify this vulnerability by carefully modifying prompts such as "DAN" to circumvent the LLMs' defense. Notwithstanding, existing jailbreaks typically focus on specific prompts or tokens, rendering them susceptible to countermeasures such as realignments. In contrast to these prompt-level or tokenlevel jailbreaks, we present a novel task-level jailbreak based on "knowledge decomposition" , which does not rely on any specific prompts or tokens. Our attack demonstrates significantly enhanced resistance against realignments compared to previous jailbreak techniques. Furthermore, our attack not only achieves about 10% higher success rates than SOTA attacks but also generates responses that are richer in detail and information. This is attributed to aggregation of responses from multiple well-designed queries rather than relying on only a singular query as in previous attacks, thus signifying an elevated risk of threat. On the other hand, "knowledge decomposition" provide us a method to generate plenty tasks with varying risk levels, thereby establishing a novel benchmark to assess the defensive effectiveness of LLMs. Warning: this paper contains content that can be offensive in nature.