ACL2025

Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

Chengda Lu, Xiaoyu Fan, Yu Huang, Rongwu Xu, Jijie Li, Wei Xu

被引用 11 次

摘要

Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FICDETAIL, whose practical performance validates our theoretical findings. Warning: This work contains potentially offensive LLM-generated content. * Equal contribution. 1 q p (P )f (P )h (i) (P ) dP J Key factors affecting harmfulness with CoT.