EMNLP2025
Jailbreak LLMs through Internal Stance Manipulation
Shuangjie Fu, Du Su, Beining Huang, Fei Sun, Jingang Wang, Wei Chen, Huawei Shen, Xueqi Cheng
Abstract
To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, generate adversarial prompts for malicious requests that induce LLMs to respond following a fixed affirmative template. However, we observed that the reliance on the fixed output template is ineffective for certain malicious requests, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that generalizes across all malicious requests. Our approach is inspired by the discovery of LLMs' intrinsic safety mechanisms: they tend to exhibit a similar refusal stance across diverse adversarial prompts, resulting in consistent rejections. We propose Stance Manipulation (SM), a novel automated jailbreak approach that generates adversarial prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate the superiority of SM's performance. Under commonly used setting, SM achieves success rates over 77.1% across all models on Advbench. Specifically, for Llama-2-7b-chat, SM outperforms the best baseline by 25.4%. In further experiments with extended iterations, SM achieves over 92.2% attack success rate across all models. Our code is publicly available at https://github.com/Zed630/Stance-Manipulation