NeurIPS2022

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

被引用 14 次

摘要

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of Õp a SAH 3 K lnp1δqq 1 in K episodes using O pH log 2 log 2 pKqq batches with confidence parameter δ. To our best of knowledge, it is the first Õp ? SAH 3 Kq regret bound with OpH log 2 log 2 pKqq batch complexity. Meanwhile, we show that to achieve ÕppolypS, A, Hq ? Kq regret, the number of batches is at least Ω pH log A pKq log 2 log 2 pKqq, which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model. On the other hand, we show a lower bound of batch complexity as below. Theorem 2. For any algorithm with OppolypS, A, Hq ? Kq regret bound, the batch complexity is at least ΩpH log A pKq log 2 log 2 pKqq. Compared to the lower bound of Ωplog 2 log 2 pKqq in [Gao et al., 2019] for multi-armed bandit problem, additional ΩpH log A pKqq batches are required to explore the structure of the MDP. Due to space limitation, we defer the full proofs of Theorem 1 and Theorem 2 to Appendix D and Appendix B respectively. Our contribution. We propose the framework of multi-batch RL, and first achieve OpH log 2 log 2 pKqq sample complexity bound with the near-optimal Õp ? SAH 3 Kιq regret bound with an efficient algorithm. We also prove that for any algorithm with OppolypS, A, Hq ? Kq regret, the