NeurIPS2020

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

Tengyu Xu, Zhe Wang, Yingbin Liang

104 citations

Abstract

The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. In the infinite horizon scenario, the finite-sample convergence rate for the AC and natural actorcritic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an ǫ-accurate stationary point improves the best known sample complexity of AC by an order of O(ǫ -1 log(1/ǫ)), and the overall sample complexity for a mini-batch NAC to attain an ǫ-accurate globally optimal point improves the existing sample complexity of NAC by an order of O(ǫ -1 / log(1/ǫ)). Moreover, the sample complexity of AC and NAC characterized in this work outperforms that of policy gradient (PG) and natural policy gradient (NPG) by a factor of O((1 -γ) -3 ) and O((1 -γ) -4 ǫ -1 / log(1/ǫ)), respectively. This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.