ICLR2025
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
Paria Rashidinejad, Yuandong Tian
摘要
Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiaşu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain "stationary labels", resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking. However, these alignment techniques are observed to suffer from the notorious reward hacking problem (Amodei et al., 2016; Tien et al., 2022; Gao et al., 2023; Casper et al., 2023) , where optimizing imperfect learned reward leads to poor performance under the true reward-assuming an underlying true reward exists (Skalse et al., 2022) . One primary cause of the discrepancy between the learned and true rewards arises because preference data do not encompass all conceivable choices, making the learned reward model vulnerable to significant statistical fluctuations in areas with sparse data. Consequently, the AI system might be swayed toward choices that only appear favorable under the learned reward but are, in reality, subpar, or the system might be deterred from truly desirable choices that do not seem favorable according to the learned rewards.