NeurIPS2020

Agnostic $Q$ -learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity

Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang

被引用 19 次

摘要

The current paper studies the problem of agnostic Q-learning with function approximation in deterministic systems where the optimal Q-function is approximable by a function in the class F with approximation error δ ≥ 0. We propose a novel recursion-based algorithm and show that if δ = O ρ/ √ dim E , then one can find the optimal policy using O(dim E ) trajectories, where ρ is the gap between the optimal Q-value of the best actions and that of the second-best actions and dim E is the Eluder dimension of F. Our result has two implications: 1. In conjunction with the lower bound in [Du et al., 2020] , our upper bound suggests that the condition δ = Θ ρ/ √ dim E is necessary and sufficient for algorithms with polynomial sample complexity. 2. In conjunction with the obvious lower bound in the tabular case, our upper bound suggests that the sample complexity Θ (dim E ) is tight in the agnostic setting. Therefore, we help address the open problem on agnostic Q-learning proposed in [Wen and Van Roy, 2013] . We further extend our algorithm to the stochastic reward setting and obtain similar results.