ICLR2023

Behavior Proximal Policy Optimization

Zifeng Zhuang, Kun Lei, Jinxin Liu, Donglin Wang, Yilang Guo

被引用 8 次

摘要

Offline reinforcement learning (RL) is a challenging setting where existing offpolicy actor-critic methods perform poorly due to the overestimation of out-ofdistribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we get a surprising finding that some online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark indicate this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at https://github.com/Dragon-Zhuang/BPPO . INTRODUCTION Typically, reinforcement learning (RL) is thought of as a paradigm for online learning, where the agent interacts with the environment to collect experience and then uses that to improve itself (Sutton et al., 1998) . This online process poses the biggest obstacles to real-world RL applications because of expensive or even risky data collection in some fields (such as navigation (Mirowski et al., 2018) and healthcare (Yu et al., 2021a)). As an alternative, offline RL eliminates the online interaction and learns from a fixed dataset, collected by some arbitrary and possibly unknown process (Lange et al., 2012; Fu et al., 2020) . The prospect of this data-driven mode (Levine et al., 2020) is pretty encouraging and has been placed with great expectations for solving RL real-world applications. Unfortunately, the major superiority of offline RL, the lack of online interaction, also raises another challenge. The classical off-policy iterative algorithms should be applicable to the offline setting since it is sound to regard offline RL as a more severe off-policy case. But all of them tend to underperform due to the overestimation of out-of-distribution (shorted as OOD) actions. In policy evaluation, the Q-function will poorly estimate the value of OOD state-action pairs. This in turn affects the policy improvement, where the agent trends to take the OOD actions with erroneously estimated high values, resulting in low-performance (Fujimoto et al., 2019) . Thus, some solutions keep the learned policy close to the behavior policy to overcome the overestimation (Fujimoto et al., 2019; Wu et al., 2019) . Most offline RL algorithms adopt online interactions to select hyperparameters. This is because offline hyperparameter selection, which selects hyperparameters without online interactions, is always an open problem lacking satisfactory solutions (Paine et al., 2020; Zhang & Jiang, 2021) . Deploying the policy learned by offline RL is potentially risky in certain areas (Mirowski et al., 2018; Yu et al., 2021a) since the performance is unknown. However, if the deployed policy can guarantee better performance than the behavior policy, the risk during online interactions will be greatly reduced. This inspires us to consider how to use offline dataset to improve behavior policy with a monotonic performance guarantee. We formulate this problem as offline monotonic policy improvement.