ICLR2023

Ask Me Anything: A simple strategy for prompting language models

Simran Arora, Avanika Narayan, Mayee F. Chen, Laurel J. Orr, Neel Guha, Kush Bhatia, Ines Chami, Christopher Ré

74 citations

Abstract

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly perfect prompt for a task. To mitigate the high degree of effort involved in prompting, we instead ask whether collecting multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING PROMPTING (AMA). We first develop an understanding of the effective prompt formats, finding question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. Output True or False"). Our approach recursively uses the LLM to transform task inputs to the effective QA format. We apply these prompts to collect several noisy votes for the input's true label. We find that these prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions. We evaluate AMA across open-source model families (EleutherAI, BLOOM, OPT, and T0) and sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code for reproducing the results here: https://github.com/HazyResearch/ama_prompting . Recent work has evaluated LLM prompting performance on a broad set of tasks and finds the process to be brittlesmall changes to the prompt result in large performance variations [Zhao et al., 2021 , Holtzman et al., 2021] . The performance further varies depending on the chosen LLM family [Ouyang et al., 2022 , Sanh et al., 2022, inter alia.] and model size [Wei et al., 2022a , Lampinen et al., 2022] . To improve reliability, significant effort is dedicated towards designing a painstakingly perfect prompt. For instance, Mishra et al. [2021] and Wu et al. [2022] recommend that users manually explore large search-spaces of strategies to tune their prompts on a task-by-task basis.