AAAI2026
On the Exponential Convergence for Offline RLHF with Pairwise Comparisons
Zhirui Chen, Vincent Y. F. Tan
Abstract
We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu, Jordan, and Jiao (2023) , where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the simple regret. We propose an algorithm, RL with Locally Optimal Weights or RL-LOW, which yields an exponential form of simple regret of exp(-Ω(n/H)) where n is the number of data samples and H denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our RL-LOW. In view of privacy considerations in practical applications, we also extend RL-LOW to the setting of (ε, δ)-differential privacy and show, somewhat surprisingly, that the hardness parameter H is unchanged in the asymptotic regime as n tends to infinity; this underscores the inherent efficiency of RL-LOW in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of inverse polynomial convergence (e.g., O( 1 √ n )) for offline RLHF with pairwise comparisons.