NeurIPS2020

Provably Efficient Neural GTD for Off-Policy Learning

Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, Mingyi Hong

7 citations

Abstract

This paper studies a gradient temporal difference (GTD) algorithm using neural network (NN) function approximators to minimize the mean squared Bellman error (MSBE). For off-policy learning, we show that the minimum MSBE problem can be recast into a min-max optimization involving a pair of over-parameterized primal-dual NNs. The resultant formulation can then be tackled using a neural GTD algorithm. We analyze the convergence of the proposed algorithm with a 2-layer ReLU NN architecture using m neurons and prove that it computes an approximate optimal solution to the minimum MSBE problem as m ! 1. • We focus on the 2-layer ReLU NN architecture. We show that when the width of the NN employed goes to infinity and the TD error function lies in the NN function class, the proposed neural GTD algorithm is guaranteed to find a global minimizer of the MSBE problem. Importantly, the convergence rates measured with the functional distance are independent of NN's width. Related Works Recent works have studied the global convergence to optimal solutions of different RL algorithms. Examples are [Cai et al., 2019 , Wang et al., 2019 , Xu and Gu, 2019] which studied neural TD learning, neural policy gradient, and neural Q-learning, respectively. These works rely on that in the limit when the number of neurons approaches infinity, the nonlinear NN function is locally approximated by a linear function. The present paper follows a similar philosophy, i.e., by reducing the neural GTD algorithm to a primal-dual method for solving a convex-concave problem. We develop new analysis to handle biases in gradients and off-policy learning settings. Our work is also related to recent works on finite time guarantees for GTD learning algorithms using linear function approximation. For instance, Dalal et al. [2018 Dalal et al. [ , 2019]] , Liu et al. [2015] provide guarantees when the algorithms acquire i.i.d. samples; Du et al. [2017] apply variance reduction technique; Doan [2019], Gupta et al. [2019], Wang et al. [2017], Xu et al. [2019] consider when Markov samples are used. In comparison, we extend these analysis to using NN function approximation and offer finite-time guarantees in the overparameterization limit.