NDSS2026
Bleeding Pathways: Vanishing Discriminability in LLM Hidden States Fuels Jailbreak Attacks
Yingjie Zhang, Tong Liu, Zhe Zhao, Guozhu Meng, Kai Chen
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreak attacks that exploit adversarial prompts to circumvent safety measures. Current safety fine-tuning approaches face two critical limitations. First, they often fail to strike a balance between security and utility, where stronger safety measures tend to over-reject harmless user requests. Second, they frequently miss malicious intent concealed within seemingly benign tasks, leaving models exposed to exploitation. Our work identifies a fundamental cause of these issues: during response generation, an LLM's capacity to differentiate harmful from safe outputs deteriorates. Experimental evidence confirms this, revealing that the separability between hidden states for safe and harmful responses diminishes as generation progresses. This weakening discrimination forces models to make compliance judgments earlier in the generation process, restricting their ability to recognize developing harmful intent and contributing to both aforementioned failures. To mitigate this vulnerability, we introduce DEEPALIGN -an inherent defense framework that enhances the safety of LLMs. By applying contrastive hidden-state steering at the midpoint of response generation, DEEPALIGN amplifies the separation between harmful and benign hidden states, enabling continuous intrinsic toxicity detection and intervention throughout the generation process. Moreover, it facilitates contextually appropriate safe responses to harmful queries, thereby expanding the feasible space of safe responses. Evaluations demonstrate DEEPALIGN's efficacy. Across diverse LLMs spanning varying architectures and scales, it reduced attack success rates of nine distinct jailbreak attacks to near-zero or minimal. Crucially, it preserved model capability while reducing over-refusal. Models equipped with DEEPALIGN exhibited up to 3.5% lower error rates in rejecting challenging benign queries and maintained standard task performance with less than 1% decline. This marks a substantial advance in the safety-utility Pareto frontier. Content warning: This paper contains unfiltered content generated by LLMs that may be offensive to readers. * Corresponding authors and fuzz testing [36] , [56] . However, the increasing prevalence of LLMs has brought critical security concerns to the forefront. A notable issue is their susceptibility to jailbreak attacks [64], [43], [29], [48] , where crafted prompts bypass safeguards to generate harmful outputs. These pose significant risks, from societal harm to cybersecurity threats like remote code execution (RCE) [28] . Recent cases show real-world dangers, such as malware using public LLM APIs to dynamically create attack payloads that bypass detection [7] . Current defense primarily adopts two paradigms: external security guardrails and endogenous safeguards. While API providers offer guardrails, they face accuracy-recall tradeoffs. This challenge is particularly acute for smaller organizations and individual developers relying on platforms like Hugging Face, who lack access to enterprise-grade guardrails. This highlights the need for stronger endogenous safeguards. Endogenous safeguards are inherently limited because safety alignment captures human preferences-something pretraining alone cannot capture. Current approaches address this by fine-tuning models on malicious queries and refusal responses, teaching them to reject harmful inputs [49] . However, these safeguards remain operationally brittle. Adversaries exploit semantic ambiguities and generation dynamics to craft inputs that bypass alignment efforts, inducing models to propagate harmful content. This reveals two interconnected challenges: • Intent Disambiguation in Adversarial Contexts. Malicious intent is often artfully embedded within semantically complex prompts or benign tasks, confounding reliable identification of malicious intent. • The Security-Utility Pareto Trade-off. Security enhancements frequently increase refusal rates on benign prompts, degrading utility and user experience-a fundamental constraint on robust LLM deployment. Dynamic Degradation in Discriminative Capacity: A Core Vulnerability Underpinning Safety-Utility Trade-offs. Our research reveals a critical, previously overlooked vulnerability: during response generation, the model's inherent capability to distinguish between benign and harmful token sequences progressively degrades. This is reflected in a measurable phenomenon: as the model generates more harmful response