ICML2024

Planning, Fast and Slow: Online Reinforcement Learning with Action-Free Offline Data via Multiscale Planners

Chengjie Wu, Hao Hu, Yiqin Yang, Ning Zhang, Chongjie Zhang

被引用 3 次

摘要

Learning efficiently from small amounts of data has long been the focus of modelbased reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm has demonstrated state-of-the-art results in both settings. In this work, we describe the Reanalyse algorithm which uses modelbased policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude. We further show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions, as in the case of offline Reinforcement Learning (offline RL). Combining Reanalyse with the MuZero algorithm, we introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL. In contrast to previous work, our algorithm does not require any special adaptations for the off-policy or offline RL settings. MuZero Unplugged sets new state-of-the-art results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting. * Equal contribution 35th Conference on Neural Information Processing Systems (NeurIPS 2021). So far, these developments have been relatively independent, with no unified algorithm that could achieve state-of-the art results in both the online and offline settings. In this paper, we describe the Reanalyse algorithm, a simple yet effective technique for policy and value improvement at any data budget, including the fully offline case. A preliminary version of Reanalyse was briefly introduced in the context of MuZero (Schrittwieser et al., 2020), but limited to data efficiency improvements in the discrete action case. Here, we delve deeper into the algorithm and push its capabilities much further -ultimately to the point where most or all of the data is reanalysed. Starting with the possible uses of Reanalyse, we show how it can be used for data efficient learning and offline RL, leading to MuZero Unplugged. We demonstrate its effectiveness for the online case through results on Atari and for the offline case through results on the RL Unplugged benchmark for Atari and DM Control.