ACL2024

A Grounded Preference Model for LLM Alignment

Tahira Naseem, Guangxuan Xu, Sarathkrishna Swaminathan, Asaf Yehudai, Subhajit Chaudhury, Radu Florian, Ramón Fernandez Astudillo, Asim Munawar

DOI 出版方

摘要

Despite LLMs' recent advancements, they still suffer from factual inconsistency and hallucination. An often-opted remedy is retrievalaugmented generation -however, there is no guarantee that the model will strictly adhere to retrieved grounding. Fundamentally, LLMs need to be aligned to be more faithful to grounding, which will require high-quality preference annotations. This paper investigates whether we can create high-quality grounded preference data for model alignment without using annotations from humans or large proprietary models. We experimented with existing entailment data and proposed approaches to generate synthetic grounded preference data, with which we train a Grounded Preference Model(GPM). We demonstrate through Proximal Policy Optimization(PPO) training of Mistral-7B-Instruct that our GPM model can successfully align powerful LLMs to generate much better grounded responses as judged by GPT4. Moreover, we show that our GPM is also a great faithfulness classifier, achieving SoTA in dialogue sub-tasks of the TRUE faithfulness Benchmark. We will release our GPM under the Apache 2.0 license. and labor-intensive. Moreover, there are very few 043 publicly available preference datasets for content-044 grounded dialogues. Therefore, we pose the fol-045 lowing questions: 046 1. Whether some existing data can be repurposed 047 to be a proxy for Grounded Preference? 048 2. Whether we can use simple synthetic data to 049 expand on existing data to build better imitations 050 of Grounded Preference? 051 We investigated the above questions by em-052 pirically testing preference alignment on a lead-053 ing LLM, Mistral-7B-Instruct-v0.1 1 (Jiang et al., 054 2023), and used GPT4 as a judge to evaluate align-055 ment outcomes. Building on the insights from the 056 study, we propose a recipe for the Grounded Pref-057 erence Model that not only preserves its faithful 058 quality but also acts as a better reward for LLM 059 alignment. 112 generate responses via each of them for the same 113 D w and Q w . The response from the higher-ranked 114 model is treated as a positive response while the 115 other one, as negative. This approach is similar 116 to Kim et al. (2023), however, we explore its ef-117 ficacy in content grounded setting. We apply this 118 method using the following LLMs listed in the or-119 der of their ranking: Falson-180b, Falcon-40b, flan-120 t5-xxl, flan-t5-xl and flan-t5-large (Penedo et al., 121 2023; Wei et al., 2022). The source datasets for 122 this method come from SQuAD-v2, CoQA, Multi-123 Doc2Dial, QUAC and FloDial (Raghu et al., 2021). 124 We refer to this type of synthetic preference data as 125 "model-gap" dataset. 126 Faith Score Distillation(distill) In this method, 127 for a gold faithful triplet e w = (D w , Q w , R w ), we 128 generate multiple responses for query Q w and doc-129 ument D w at high sampling temperature (T=1.2), 130 encouraging hallucinative responses. To ensure 131 these generated responses can be treated as neg-132 atives, we evaluate their faithfulness to the doc-133 ument using an ensemble of faithfulness metrics. 134 Responses that score below a threshold are used 135 as negatives. Since this method distills knowledge 136 from faithfulness metrics to create synthetic data, 137 we refer to it as "distill" dataset. Flan-t5-xxl and 138 Flan-t5-xl are used to generate responses, while 139 faithfulness metrics ANLI, FactCC, and SummaC 140 are used for filtering responses. The source datasets 141 are SQuAD-v2, CoQA, Multi-Doc2Dial, QUAC 142 and FloDial (Raghu et al., 2021). 143 2.2 Preference Model Objective L(r θ , D) = -E (ew,e l )∼D [log σ(r θ (e w ) -r θ (e l )] 157 We implement this objective using an encoder-158 only transformer model for r θ . In particular, we 159 use the DeBERTa large model 2 and employ token-160 type embeddings to distinguish D, Q from R. A 161 reward modeling head is added on top of the [CLS] 162 token's output embedding in the form of a d × 1 163 linear layer, where d is the dimension of the final 164 hidden layer. 165 2.3 Preference Model Training 166 We train the GPM on 1.8 million gold and 0.7 167 million synthetically generated samples. For each 168 synthetic data type, the ratio between gold and 169 synthetic during training is 10:1 respectively. We 170 train for 100k steps with a batch size of 20 and a 171 learning rate of 1e-5 3 . We run one experiment for 172 each setting and use the last checkpoint. 173 3 GPM for LLM Alignment 174 We use the standard RLHF procedure (Ouyang 175 et al., 2022a) for model alignment that optimizes: ocqa: Open-domain conversational question answer-335 ing with topic switching. Transactions of the Associ-336