NeurIPS2020

Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis

Shaocong Ma, Yi Zhou, Shaofeng Zou

18 citations

Abstract

Variance reduction techniques have been successfully applied to temporaldifference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less popular one time-scale TD algorithm or the two time-scale GTD algorithm but with a finite number of i.i.d. samples, and both algorithms apply to only the on-policy setting. In this work, we develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting and analyze its non-asymptotic convergence rate over both i.i.d. and Markovian samples. In the i.i.d. setting, our algorithm matches the best-known lower bound Õ( -1 ). In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity O( -1 log -1 ) that is near-optimal. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller asymptotic convergence error than both the conventional TDC and the variance-reduced TD. Assumption 2.1 (Problem solvability). The matrix A and C are non-singular. Assumption 2.2 (Boundedness). For all states s, s ∈ S and all actions a ∈ A, 1. The feature function is uniformly bounded as φ(s) ≤ 1; 2. The reward is uniformly bounded as r(s, a, s ) ≤ r max ; 3. The importance sampling ratio is uniformly bounded as ρ(s, a) ≤ ρ max . Assumption 2.3 (Geometric ergodicity). There exists κ > 0 and ρ ∈ (0, 1) such that for all t ≥ 0, sup where d TV (P, Q) denotes the total-variation distance between the probability measures P and Q. Under Assumption 2.1, the optimal parameter θ * can be written as θ * = -A -1 b. 3 Variance-Reduced TDC for I.I.D. Samples In this section, we propose a variance reduction scheme for the off-policy TDC over i.i.d. samples and analyze its non-asymptotic convergence rate.