WWW2026

PADD: Prefix-based Attention Divergence Detector for LLM Jailbreaks

Ziqun Bao, Jiaqiang Niu, Yuchen Shao, Chengcheng Wan

摘要

Large Language Models (LLMs) have achieved remarkable progress in understanding and generation tasks, yet they remain highly susceptible to adversarial prompt attacks that bypass safety safeguards and induce the generation of harmful content. Existing pre-generation and post-generation defense methods typically rely on intent recognition, surface-level features, or external classifiers, rendering them vulnerable to evasion via subtle prompt perturbations while incurring substantial computational overhead. In this paper, we propose PADD (Prefix-based Attention Divergence Detector), a lightweight pre-generation defense mechanism that leverages internal model signals for robust attack detection. At its core, PADD prepends a lightweight safety prefix to the input prompt and compares attention distributions between the original and prefixed prompts. By transforming cross-prompt comparisons into self-comparisons via composite signals of attention divergence and attention plasticity, PADD achieves strong separability between adversarial and benign prompts with low-latency detection, without requiring modifications or fine-tuning of the base model. Extensive experiments across four mainstream open-source LLMs and multiple public benchmarks demonstrate that PADD significantly reduces attack success rates (0.4--3.0%) while maintaining low false rejection rates (0--5.2%). These results position PADD as a scalable, efficient, and practical safeguard for LLM safety.