ACL2025
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, Xiuying Chen
Abstract
Jailbreaking in large language models (LLMs) poses major security risks by tricking models into generating harmful text. However, there is limited understanding of how jailbreaking operates, which makes it difficult to develop effective defenses. In this work, we conduct a large-scale analysis of seven jailbreak methods and uncover that inconsistencies in previous studies arise from insufficient observation samples. Our analysis reveals that jailbreaks shift harmful activations beyond a defined safety boundary, where LLMs become less sensitive to harmful information. We also find that the low and the middle layers are critical in driving these shifts, while deeper layers play a lesser role. Leveraging these insights, we propose a novel defense mechanism called Activation Boundary Defense (ABD), which adaptively constrains activations within the safety boundary. To further optimize performance, we use Bayesian optimization to select the most effective layers for ABD application and confirm that the low and middle layers have the greatest impact, consistent with our earlier observations. Experiments across multiple benchmarks demonstrate that ABD achieves an average defense success rate (DSR) of over 98% against various jailbreak attacks, with less than 2% impact on the model's overall capabilities.