EMNLP2024
Preference-Guided Reflective Sampling for Aligning Language Models
Hai Ye, Hwee Tou Ng
摘要
Iterative data generation and model re-training can effectively align large language models (LLMs) to human preferences. The process of data sampling is crucial, as it significantly influences the success of policy improvement. Repeated random sampling is a widely used method that independently queries the model multiple times to generate outputs. In this work, we propose a more effective sampling method, named Preference-Guided Reflective Sampling (PRS). Unlike random sampling, PRS employs a tree-based generation framework to enable more efficient sampling. It leverages adaptive self-refinement techniques to better explore the sampling space. By specifying user preferences in natural language, PRS can further optimize response generation according to these preferences. As a result, PRS can align models to diverse user preferences. Our experiments demonstrate that PRS generates higher-quality responses with significantly higher rewards. On AlpacaEval and Arena-Hard, PRS substantially outperforms repeated random sampling in bestof-N sampling. Moreover, PRS shows strong performance when applied in iterative offline RL training 1 . * Provide references or sources to support each claim made in the response ... * Break down the response into smaller, more manageable sections ... 1. Timeline chart: This is a graphical representation of events or milestones in chronological order. ... 2. Gantt chart: A Gantt chart is a type of bar chart used to show a schedule of a ... References * [1]: Wikipedia, "Timeline," <https ://en.wikipedia.org/wiki/Timeline> ... Response Feedback Refined Response 2. Sample response 3. Provide feedback 4. Revise response 1. Add explicit preference (a) Reflective Refinement (b) Tree-based Generation