EMNLP2024

BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment

Wenda Xu, Jiachen Li, William Yang Wang, Lei Li

被引用 1 次

摘要

Anthropic Helpfulness PO (DPO)Offline DPO On-Policy DPO Figure 1: Given the same annotation budget, our BPO (when F = 2) significantly outperforms offline DPO (F = 1) on both TL;DR and Anthropic Helpfulness by introducing only one additional preference annotation phase.Its performance (when F = 2) even matches, if not exceeds, that of on-policy DPO (F = T ), which collects new annotations at every step.