EMNLP2025

Path Drift in Large Reasoning Models: How First-Person Commitments Override Safety

Yuyi Huang, Runzhe Zhan, Lidia S. Chao, Ailin Tao, Derek F. Wong

被引用 3 次

摘要

As large reasoning models are increasingly deployed for complex reasoning tasks, Chainof-Thought prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in LRMs can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) firstperson commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; and (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, selfrole priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection. Our findings highlight the need for trajectory-level alignment oversight in longform reasoning beyond token-level alignment. Warning: This paper contains jailbreak contents that can be offensive in nature.