ACL2023
Nano: Nested Human-in-the-Loop Reward Learning for Few-shot Language Model Control
Xiang Fan, Yiwei Lyu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
被引用 3 次
摘要
Pretrained language models have demonstrated extraordinary capabilities in language generation. However, real-world tasks often require controlling the distribution of generated text in order to mitigate bias, promote fairness, and achieve personalization. Existing techniques for controlling the distribution of generated text only work with quantified distributions, which require pre-defined categories, proportions of the distribution, or an existing corpus following the desired distributions. However, many important distributions, such as personal preferences, are unquantified. In this work, we tackle the problem of generating text following arbitrary distributions (quantified and unquantified) by proposing NANO, a fewshot human-in-the-loop training algorithm that continuously learns from human feedback. NANO achieves state-of-the-art results on single topic/attribute as well as quantified distribution control compared to previous works. We also show that NANO is able to learn unquantified distributions, achieves personalization, and captures differences between different individuals' personal preferences with high sample efficiency. Our code is available at https: //github.com/sfanxiang/Nano . but also improves performance on quantified distributions. Our contribution is summarized as follows: • We introduce a human-in-the-loop reward learning algorithm that learns to generate text following arbitrary distribution through human feedback. We demonstrate that our method works for all of the following types of distributions: single-topic/attribute, quantified distributions, and unquantified distributions. • We show that NANO is able to learn unquantified distributions, successfully achieves personalization, and captures differences between different individuals' personal preferences with only 64 labels from each person (RQ1). • We achieve state-of-the-art result on controlling quantified distributions (RQ2) as well as single topic/attribute generation (RQ3) compared to previous works, while using only few-shot samples. • Through ablation studies, we demonstrate the necessity of multi-iteration human feedback for high sample efficiency (RQ4) and justify our architecture's design choices (RQ5). We also show that our method extends to newer and larger language models than GPT-2. An illustration of our method is shown in Figure 1 , and a comparison of NANO's capabilities to previous works is provided in Table 2 .