ICML2022

On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs

Yuanzhou Chen, Jiafan He, Quanquan Gu

被引用 8 次

摘要

We study reinforcement learning for infinitehorizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK (Zhou et al., 2021b) algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend uniform-PAC sample complexity from the episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an O d 2 /((1 -γ) 4 ϵ 2 ) + 1/((1 -γ) 6 ϵ 2 ) uniform-PAC sample complexity, where d is the dimension of the feature mapping, γ ∈ (0, 1) is the discount factor of the MDP and ϵ is the accuracy parameter. To the best of our knowledge, this is the first O(1/ϵ 2 ) sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).