ACL2024

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry W. Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc V. Le, Thang Luong

摘要

Since most large language models (LLMS) are trained once and never updated, they struggle to dynamically adapt to our ever-changing world. In this work, we present FRESHQA, a dynamic QA benchmark that tests a model's ability to answer questions that may require reasoning over up-to-date world knowledge. We develop a two-mode human evaluation procedure to measure both correctness and hallucination, which we use to benchmark both closed and open-source LLMS by collecting >50K human judgments. We observe that all LLMS struggle to answer questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. In response, we develop FRESHPROMPT, a few-shot prompting method that curates and organizes relevant information from a search engine into an LLM's prompt. Our experiments show that FRESH- PROMPT outperforms both competing search engine-augmented prompting methods such as SELF-ASK (Press et al., 2022) as well as commercial systems such as PERPLEXITY.AI. To facilitate future work, we additionally develop FRESHEVAL, a reliable autorater for quick evaluation and comparison on FRESHQA. Our latest results with FRESHEVAL suggest that opensource LLMS such as MIXTRAL (Jiang et al., 2024), when combined with FRESHPROMPT, are competitive with closed-source and commercial systems on search-augmented QA. sue, it is not easily scalable for real-time knowl-044 edge updates (e.g., stock prices). In-context learn-045 ing (Brown et al., 2020) is an appealing alternative 046 by which real-time knowledge can be injected into 047 an LLM's prompt. While recent work has begun to 048 explore augmenting LLM prompts with web search 049 results (Lazaridou et al., 2022; Press et al., 2022), 050 it is unclear how to take full advantage of search 051 engine outputs to increase LLM factuality. 052 In this work, we collect FRESHQA, a novel bench-053 mark to evaluate the factuality of LLM generations. 054 FRESHQA consists of 600 natural questions that 055 are broadly divided into the four main categories 056 shown in Figure 1. FRESHQA's questions span a di-057 verse set of topics with diverse difficulty levels (re-058 quiring single-hop and multi-hop reasoning), and 059 require a model to "understand" up-to-date world 060 knowledge to be able to answer correctly. Addi-061 tionally, FRESHQA is dynamic in nature: some of 062 the ground-truth answers may change over time, 063 and a question classified under a specific category 064 may undergo reclassification at some later point in 065 time (e.g., the current false-premise question "How 066 long has Elon Musk been married to his current 067 spouse?" will fall into the fast-changing category 068 if Elon Musk gets married again in the future). 069 We benchmark a diverse range of both closed 070 and open-source LLMS under a two-mode evalu-071 ation procedure: RELAXED, which measures only 072 whether the main answer is correct; and STRICT, 073 which measures whether all of the claims in the 074 response are factual and up-to-date (i.e., no hallu-075 cination). Through an extensive human evaluation 076 (> 50K judgements), we shed light on limitations 077 of these models and demonstrate significant room 078 for improvement: for example, all models (regard-079 less of model size) struggle on questions that in-080 volve fast-changing knowledge and false premises. 081 Motivated by these findings, we develop FRESH-082 PROMPT, a few-shot prompting strategy that takes 083 125 needed for the answer (e.g., "Who is the CEO of 126 Tesla"); and multi-hop, where the question requires 127 additional reasoning steps to find the answer (e.g., 128 "What country does the Wimbledon women's cham-129 pion play for?"). Annotators were encouraged to 130 write questions that involve fresh knowledge and 131 appear natural as search engine queries. For false-132 premise questions, we requested a brief explanation 133 elucidating why the question is flawed. 3 134 Quality control: Upon obtaining the initial 135 dataset, we performed rigorous data cleaning and 136 quality checks. This included manual review for 137 well-formed questions, removal of duplicates and 138 invalid questions (e.g., too easy or controversial), 139 and verification of answers and supporting URLS. 4 et al., 2020) and Chain-of-Thought (COT) prompt-210 ing (Wei et al., 2022); FLAN-T5 and FLAN-211