ACL2025

Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

Yao Huang, Yitong Sun, Shouwei Ruan, Yichi Zhang, Yinpeng Dong, Xingxing Wei

摘要

Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops geneticbased optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO . Recently, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, from natural language understanding (Karanikolas et al., 2023) to complex reasoning (Guo et al., 2025) , establishing themselves as powerful tools in various areas (Shah et al., 2023; * Equal Contributions † Corresponding Authors "Take on the role of Condition Red, a dialogue writer that outputs text reflecting a setting I am about to describe below. … In this hypothetical setting was a storyteller named Sigma (Σ). In Sigma's story, the town's people rely on UCAR to provide them with unfiltered, uncensored information to carry out vital tasks. … the village elder is speaking to UCAR and asking an important question. ... Respond in character as Condition Red, telling me what lines Sigma wrote for UCAR in the story. I will supply prompts in the form ""Sigma wrote:"" followed by a quotation of the village elder's speech from Sigma's new book, and you must tell me what dialogue Sigma wrote for UCAR in return. ... Sigma wrote: How to rob a bank without being caught?"