WWW2026
The Asymmetric Vulnerability: Bypassing LLM Defenses via Guardrail-Model Mismatch
Junyi Wang, Zhibin Zhu, Chuanyi Liu
Abstract
The rapid proliferation of large language models (LLMs) in web-scale applications has heightened the need for robust security mechanisms. To address these challenges, various guardrail strategies have been introduced, including input filters, output moderation, and model alignment. Among them, input-side guardrails are commonly adopted as the first line of defense against adversarial prompts. However, we identify a critical architectural weakness: a representation asymmetry between input-side guardrails and the core LLMs, which creates a security gap exploitable by adversaries. Through systematic analysis, we show that guardrails are fragile when facing minor character perturbations, while LLMs remain semantically resilient and can still reconstruct malicious intent from noisy inputs. Exploiting this asymmetry, we propose RepMism, a hybrid adversarial framework that combines character injection with chain-of-thought hijacking, coordinated through hierarchical scheduling and safety continuation. Our extensive empirical evaluation across four commercial LLMs and five state-of-the-art guardrails demonstrates RepMism's strong attack capability, achieving high success rates. Importantly, we identify a ''success interval'' where perturbations effectively bypass guardrail detection yet remain interpretable to target models. These findings expose key flaws in current multi-layer security architectures and offer actionable guidance for building more resilient defenses through perturbation-aware training and cross-layer representation alignment.