USENIX Security2026

JailbreakScope: Interpreting Jailbreak Mechanism through Representation and Circuit Analyses

Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong Wang, Rui Zheng

Abstract

Large Language Models (LLMs) exhibit impressive performance but remain vulnerable to jailbreak attacks, where adversarial prompts are crafted to bypass safety alignments and elicit unexpected responses. Despite their prevalence, the underlying mechanisms that enable jailbreaks are still not well understood. Recent studies primarily focus on static representation shifts or on identifying components associated with generation safety. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to representation changes, leaving significant gaps in uncovering jailbreak mechanism. In this paper, we propose JailbreakScope, an interpretation framework that analyzes jailbreak mechanisms from both representation (how jailbreaks distort LLM's harmfulness perception) and circuit (how jailbreaks impact circuits that are important for generation safety) perspectives, tracking their evolution throughout the entire generation process. We conduct in-depth evaluations on 5 mainstream LLMs under 7 jailbreak strategies. Our evaluation reveals a general pattern that jailbreaks amplify components that reinforce affirmative responses while suppressing those producing refusal, which shifts representations towards safe regions, leading LLMs to provide responses instead of refusals. Moreover, we find a strong and consistent correlation between representation deception and circuit activation shift across diverse jailbreaks and multiple LLMs.