NeurIPS2021

Generalized Proximal Policy Optimization with Sample Reuse

James Queeney, Yannis Paschalidis, Christos G. Cassandras

被引用 63 次

摘要

In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. In this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization. This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse. We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency. On-policy reinforcement learning methods such as Proximal Policy Optimization (PPO) [19] deliver stable performance throughout training due to their connection to theoretical policy improvement guarantees. These methods are motivated by a lower bound on the expected performance loss at every update, which can be approximated using samples generated by the current policy. The theoretically supported stability of these methods is very attractive, but the need for on-policy data and the highvariance nature of reinforcement learning often requires significant data to be collected between every update, resulting in high sample complexity and slow learning. 35th Conference on Neural Information Processing Systems (NeurIPS 2021).