NeurIPS2023
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with ε-Greedy Exploration
Shuai Zhang, Hongkang Li, Meng Wang, Miao Liu, Pin-Yu Chen, Songtao Lu, Sijia Liu, Keerthiram Murugesan, Subhajit Chaudhury
50 citations
Abstract
This paper provides a theoretical understanding of Deep Q-Network (DQN) with the ε-greedy exploration in deep reinforcement learning. Despite the tremendous empirical achievement of the DQN, its theoretical characterization remains underexplored. First, the exploration strategy is either impractical or ignored in the existing analysis. Second, in contrast to conventional Q-learning algorithms, the DQN employs the target network and experience replay to acquire an unbiased estimation of the mean-square Bellman error (MSBE) utilized in training the Qnetwork. However, the existing theoretical analysis of DQNs lacks convergence analysis or bypasses the technical challenges by deploying a significantly overparameterized neural network, which is not computationally efficient. This paper provides the first theoretical convergence and sample complexity analysis of the practical setting of DQNs with ε-greedy policy. We prove an iterative procedure with decaying ε converges to the optimal Q-value function geometrically. Moreover, a higher level of ε values enlarges the region of convergence but slows down the convergence, while the opposite holds for a lower level of ε values. Experiments justify our established theoretical insights on DQNs. 2 Related Works. Q-learning with linear function approximation. In the setting of linear function approximation, the Q-function is assumed to be a linear function of either the feature mapping [83, 32, 92] or a mixture of some basis kernels [91, 52] . Early works mainly focus on the algorithm design [6, 48, 2] and convergence analysis [38, 47, 67, 14, 69] but lacks theoretical guarantees with polynomial sample complexity. Assuming the underlying Q-function can be exactly represented as a linear function of the feature mapping with some unknown parameters, several sample-efficient algorithms are proposed to find the ground-truth mapping with finite-sample guarantee [11, 80] , and the sample complexity depends linearly on the feature dimension [80] .