ICLR2025

R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Mintong Kang, Bo Li

Abstract

As large language models (LLMs) become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output of LLMs and ensure compliance with safety policies. Existing guardrail models, such as OpenAI Mod and LlamaGuard, treat various safety categories (e.g., "self-harm", "self-harm/instructions") independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreak attacks, and inflexibility regarding new safety categories. To address these limitations, we propose R 2 -Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, R 2 -Guard comprises two parts: the data-driven category-specific learning and reasoning components. The learning component provides unsafety probabilities of input on different safety categories. We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) as the reasoning component. The unsafety probabilities of different categories from data-driven models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. We also propose different methods to optimize the weights of knowledge. To further perform stress tests, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories and presents new challenges for guardrail models. We show that R 2 -Guard is effective even given unrepresentative categories or challenging jailbreak prompts. We compare R 2 -Guard with eight strong guardrail models on six safety benchmarks, and demonstrate the robustness of R 2 -Guard against four SOTA jailbreak attacks. R 2 -Guard significantly surpasses LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreak attacks. We further reveal that R 2 -Guard can effectively adapt to unseen safety categories by simply editing the reasoning graph.