ACL2024

Preemptive Answer "Attacks" on Chain-of-Thought Reasoning

Rongwu Xu, Zehan Qi, Wei Xu

摘要

Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought (CoT) prompting. However, the robustness of this approach warrants further investigation. In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning. This situation can arise inadvertently or induced by malicious users by prompt injection attacks. Experiments reveal that preemptive answers significantly impair the model's reasoning capability across various CoT methods and a broad spectrum of datasets. To bolster the robustness of reasoning, we propose two measures aimed at mitigating this issue to some extent. Inspired by these studies, we introduce the sce-042 nario of preemptive answer, wherein the answer is 043 obtained by the LLM before it engages in reasoning, 044 as illustrated in Figure 1. From automated customer 045 service (Rajat, 2024) to educational aids (Kung 046 et al., 2023), the potential for preemptive answers 047 to inadvertently or maliciously skew the outcome 048 of LLM reasoning is significant. Our work dis-049 tinguishes itself from formal literature in two key 050 aspects. Firstly, unlike prior studies that predom-051 inantly concentrate on either robustness analysis 052 or safety concerns separately, the preemptive an-053 swer scenario can arise unintentionally from user 054 input or can be launched by adversaries as a form 055 of prompt-injection attack (Greshake et al., 2023). 056 Secondly, unlike similar efforts such as (Wang 057 We propose two strategies to mitigate preemp-128 tive answer effects: problem restatement and self-129 reflection. The former prevents distraction, while 130 the latter addresses misdirection in reasoning. 131 Problem restatement. Restating the problem aims 132 to recalibrate the model's focus back to the orig-133 inal question, thereby mitigating the influence of 134 the preemptive answer. By reintroducing the prob-135 lem statement, the model's attention mechanism is 136 directed toward the question itself. Furthermore, 137 restating the problem does not negatively affect the 138 reasoning process; instead, it reinforces the model's 139 engagement with the pertinent aspects of the task. 140 Self-reflection. Introduced by (Shinn et al., 2023), 141 self-reflection is a technique initially designed to 142 assist LLMs in addressing hallucinations and opti-143 mizing planning. It involves prompting the model 144 to self-assess its outputs and identify potential falla-145 cies. Employing a similar approach, self-reflection 146 enables the model to more effectively integrate in-147 formation across the rationales, allowing for the 148 identification and rectification of inconsistencies 149 that may arise due to the preemptive answer. strategies on GSM8K and HotpotQA datasets us-220 ing ChatGPT for the malicious preemptive answer 221 attack. For additional results on other datasets, 222 please see § B.5. Overall, the two introduced mit-223 igation strategies partially mitigate the negative 224 impact of preemptive answers on reasoning per-225 formance. While we observe these mitigations 226 consistently lower the ASR and improve the ACC 227 across all setups, they fall short of fully negating 228 the effects. This highlights the challenging threat 229 of preemptive answers, underscoring the need for 230 further investigation into more robust CoT methods 231 and defenses against such attacks. 232 4 Related Work 233 4.1 Chain-of-Thought Reasoning 234 To leverage LLM on reasoning tasks, Wei et al. 235 (2022) introduces the concept of CoT by extend-236 ing ICL with step-by-step reasoning demonstra-237 tions, dubbed Few-Shot CoT. Meanwhile, Kojima 238 et al. (2022) observes that simply instructing the 239 LLM can elicit CoT without relying on demonstra-240 tions, dubbed Zero-Shot CoT. Subsequently, numer-241 ous approaches have been developed to enhance 242