ICLR2025
Post-hoc Reward Calibration: A Case Study on Length Bias
Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo M. Ponti, Ivan Titov
摘要
Research Background Research Question Research Question We ask, can we correct or mitigate biases in reward signals without extra training and data ? This paper attempts to answer this question by framing it as Post-hoc Reward Calibration. Specifically, our method uses only a batch of scored prompt-response examples to mitigate the RM's bias without intervening in the preference data collection, RM training, and the RLHF phase.