ACL2024

Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts When Knowledge Conflicts?

Hexiang Tan, Fei Sun, Wanli Yang, Yuanzhuo Wang, Qi Cao, Xueqi Cheng

Abstract

While auxiliary information has become a key to enhancing Large Language Models (LLMs), relatively little is known about how LLMs merge these contexts, specifically contexts generated by LLMs and those retrieved from external sources. To investigate this, we formulate a systematic framework to identify whether LLMs' responses, derived from the integration of generated and retrieved contexts, are attributed to either generated or retrieved contexts. To easily trace the origin of the response, we construct datasets with conflicting contexts, i.e., each question is paired with both generated and retrieved contexts, yet only one of them contains the correct answer. Our experiments reveal a significant bias in several LLMs (GPT-4/3.5 and Llama2) to favor generated contexts, even when they provide incorrect information. We further identify two key factors contributing to this bias: i) contexts generated by LLMs typically show greater similarity to the questions, increasing their likelihood of being selected; ii) the segmentation process used in retrieved contexts disrupts their completeness, thereby hindering their full utilization in LLMs. Our analysis enhances the understanding of how LLMs merge diverse contexts, offering valuable insights for advancing current augmentation methods for LLMs 1 et al., 2022; Sun et al., 2023), e.g., GenRead (Yu 041 et al., 2022), instruct LLMs to initially generate 042 a background context tailored to the given ques-043 tion, which is then employed as the basis for pro-044 ducing the final answer. In contrast, retrieval-045 augmented approaches (Lewis et al., 2020; Ram 046 et al., 2023) adopt an alternative strategy by incor-047 porating relevant passages from external corpora, 048 e.g., Wikipedia, as context, thereby notably enhanc-049 ing LLMs' capability to address challenges like 050 knowledge updates (Jang et al., 2022) and long-tail 051 knowledge (Kandpal et al., 2023). 052 Building on the foundations laid by generation-053 augmented and retrieval-augmented methods, re-054 cent hybrid approaches have attempted to integrate 055 them to further improve performance in tasks like 056 Question Answering (QA) (Yu et al., 2022; Mallen 057 et al., 2023). These hybrid approaches face a signif-058 icant challenge: conflicts between diverse sources 059 can impede the effectiveness of information integra-060 tion (Zhang et al., 2023). While recent works have 061 investigated conflicts within contexts from a sin-062 gle source, either only retrieved (Chen et al., 2022) 063 or generated (Xie et al., 2023), it remains unclear 064 how LLMs resolve conflicts between generated and 065 retrieved contexts. This study, therefore, aims to 066 investigate the underlying mechanisms by which 067 120 with semantic integrity. The segmentation process 121 used in retrieved contexts may disrupt their com-122 pleteness, thereby hindering their full utilization in 123 LLMs. 124 This work preliminarily explores the growing 125 challenge of LLMs utilizing contexts from diverse 126 sources, especially in light of the increasing preva-127 lence of LLM-generated content on the internet, 128 which may contain potential misinformation (Pan 129 et al., 2023). Furthermore, our findings offer 130 valuable guidance for enhancing existing retrieval-131 augmented methods, such as optimizing passage 132 segmentation in retrieval systems. Our main contri-133 butions can be summarized as: 134 • We uncover a critical bias in existing LLMs, 135 where they heavily rely on generated contexts re-136 gardless of correctness, indicating an insufficient 137 use of diverse information sources. 138 • To facilitate controlled experiments, we develop 139 a specialized framework for constructing tailored 140 datasets and excluding confounding factors, e.g., 141 input order and context length. 142 • Our extensive analyses have identified two key 143 factors, i.e., text similarity and semantic com-144 pleteness, in the context utilization of LLMs. 145 Moreover, we reveal that the confirmation bias 146