ACL2024
Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance
Saurabh Srivastava, Chengyue Huang, Weiguo Fan, Ziyu Yao
Abstract
Large language models (LLMs) have revolutionized zero-shot task performance, mitigating the need for task-specific annotations while enhancing task generalizability. Despite its advancements, current methods using trigger phrases such as "Let's think step by step" remain limited. This study introduces PROMPTED, an approach that optimizes the zero-shot prompts for individual task instances following an innovative manner of "LLMs in the loop". Our comprehensive evaluation across 13 datasets and 10 task types based on GPT-4 reveals that PROMPTED significantly outperforms both the naive zero-shot approaches and a strong baseline (i.e., "Output Refinement") which refines the task output instead of the input prompt. Our experimental results also confirmed the generalization of this advantage to the relatively weaker GPT-3.5. Even more intriguingly, we found that leveraging GPT-3.5 to rewrite prompts for the stronger GPT-4 not only matches but occasionally exceeds the efficacy of using GPT-4 as the prompt rewriter. Our research thus presents a huge value in not only enhancing zero-shot LLM performance but also potentially enabling supervising LLMs with their weaker counterparts, a capability attracting much interest recently. 1 coming the more versatile paradigm (e.g., how or-043 dinary users send ad-hoc queries to ChatGPT (Liu 044 et al., 2023)), owing to the better task generalizabil-045 ity they brought by eschewing the need for task-046 specific annotations. 047 However, LLMs' performance in zero-shot 048 prompting, especially for complex tasks such as 049 mathematical reasoning and information extrac-050 tion, still lags behind that achieved with few-shot 051 prompting (Wei et al., 2022a). It also shows to be 052 sensitive to the design of the prompt instruction (Lu 053 et al., 2021; Pryzant et al., 2023). To improve zero-054 shot prompting, Kojima et al. (2022) proposed the 055 use of the instruction "Let's think step by step" to 056 elicit reasoning from LLMs. This is followed by 057 forms the strong baseline of "Output Refinement", 111 demonstrating the advantage of rewriting the input 112 prompt over refining the LLM output. Our fur-113 ther analysis revealed that PROMPTED aids the 114 task LLM in recalling relevant facts for knowledge-115 intensive tasks, including domain-specific ones 116 (e.g., medical question answering). It also results 117 in more ethical responses by including proper in-118 structions in the rewritten prompt. 119 Particularly notable is PROMPTED's ability to 120 maintain high accuracy levels when applied to the 121 relatively weaker GPT-3.5. An exciting observa-122 tion is that, when using GPT-3.5 as the meta LLM 123 to rewrite prompts for GPT-4 as the task LLM, 124 PROMPTED brings on-par or even better perfor-125 mance than using GPT-4 as the meta LLM. This re-126 sult indicates the promise of supervising a stronger 127 LLM using a weaker one, and we thus expect our 128 work to pave the way for future research towards 129 enhancing AI for tasks that are beyond human ca-130 pabilities (Burns et al., 2023).