ICLR2025

Post-hoc Reward Calibration: A Case Study on Length Bias

Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo M. Ponti, Ivan Titov

Abstract

Research Background Research Question Research Question We ask, can we correct or mitigate biases in reward signals without extra training and data ? This paper attempts to answer this question by framing it as Post-hoc Reward Calibration. Specifically, our method uses only a batch of scored prompt-response examples to mitigate the RM's bias without intervening in the preference data collection, RM training, and the RLHF phase.