ICLR2025

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Calarina Muslimani, Matthew E. Taylor

摘要

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-inthe-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar-and preference-based RL algorithms. In SDP, we start by pseudo-labeling all lowquality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks. Published as a conference paper at ICLR 2025 Can we leverage sub-optimal, unlabeled data to improve learning in human-in-the-loop RL methods? To that end, we present Sub-optimal Data Pre-training, SDP, a tool for human-in-the-loop RL algorithms to increase human feedback efficiency. SDP leverages sub-optimal trajectories by pseudolabeling all transitions with the minimum environment reward. The now pseudo-labeled sub-optimal data serves two purposes. First, we pre-train a regression-based reward model by applying standard supervised learning to minimize the mean squared loss. Intuitively, this pre-training provides the reward model a head start, biasing it towards assigning lower reward values to these low-quality transitions. Second, we initialize the RL agent's replay buffer with the sub-optimal data and make learning updates to the RL agent. This process changes the RL agent's policy and provides different behaviors for the human to provide feedback on (relative to learning with no initial sub-optimal data). This ensures that when the human teacher provides feedback, their time is used efficiently, avoiding redundant feedback on the existing sub-optimal data. Afterward, we follow the standard preference-or scalar-based RL protocol. This paper's core contribution is showing that we can harness the availability of low-quality, rewardfree data for human-in-the-loop RL approaches by pseudo-labeling it with minimum rewards and treating it as a prior for learning reward models. We first validate the utility of SDP in extensive simulated teacher experiments, combining it with four scalar-and preference-based RL algorithms. These experiments show that SDP significantly improves the feedback efficiency in complex tasks from both the DeepMind Control (DMControl) (Tassa et al., 2018) and Meta-World (Yu et al., 2020) suites. Crucially, we further highlight the real-world applicability of SDP by demonstrating its success with human teachers in a 16-person user study. Overall, this work takes an important step toward considering how human-in-the-loop RL approaches can take advantage of readily available sub-optimal data. RELATED WORK Human-in-the-Loop RL Several approaches in human-in-the-loop RL allow agents to leverage human feedback to adapt or learn new behavior. Learning from demonstration is one such methodology that allows a human to provide examples of desired agent behavior (Argall et al., 2009) . Human demonstration data has been used to shape the environment's reward function (Brys et al., 2015) , develop a reward function from scratch (Abbeel & Ng, 2004) , or bias the agent's policy towards certain actions (Taylor et al., 2011) . Although demonstrations can be a rich source of feedback, they are often expensive to obtain and may require domain experts (Dragan & Srinivasa, 2012) . Another approach is learning from preference-based feedback where a teacher provides preferences between two or more sets of agent behavior (Christiano et al., 2017) . Preference learning has been popularized in recent years as it can require less effort and expertise compared to providing demonstrations. To further reduce the amount of human interaction required, several strategies have been introduced. This has included combining preferences with demonstrations (