ICML2021

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Zihan Zhang, Yuan Zhou, Xiangyang Ji

被引用 39 次

摘要

In this paper we consider the problem of learning an ϵ\epsilon-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with SS states, AA actions, the discount factor γ(0,1)\gamma \in (0,1), and an approximation threshold ϵ>0\epsilon > 0, we provide a model-free algorithm to learn an ϵ\epsilon-optimal policy with sample complexity O~(SAln(1/p)ϵ2(1γ)5.5)\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{5.5}}) (where the notation O~()\tilde{O}(\cdot) hides poly-logarithmic factors of S,A,1/(1γ)S,A,1/(1-\gamma), and 1/ϵ1/\epsilon) and success probability (1p)(1-p). For small enough ϵ\epsilon, we show an improved algorithm with sample complexity O~(SAln(1/p)ϵ2(1γ)3)\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}}). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on SS, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.