NeurIPS2024

Truncated Variance Reduced Value Iteration

Yujia Jin, Ishani Karmarkar, Aaron Sidford, Jiayi Wang

摘要

We provide faster randomized algorithms for computing an ϵ\epsilon-optimal policy in a discounted Markov decision process with AtotA_{\text{tot}}-state-action pairs, bounded rewards, and discount factor γ\gamma. We provide an O~(Atot[(1γ)3ϵ2+(1γ)2])\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-2}])-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in O~(1)\tilde{O}(1)-time, and an O~(s+(1γ)2)\tilde{O}(s + (1-\gamma)^{-2})-time algorithm in the offline setting where the probability transition matrix is known and ss-sparse. These results improve upon the prior state-of-the-art which either ran in O~(Atot[(1γ)3ϵ2+(1γ)3])\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-3}]) time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, O~(s+Atot(1γ)3)\tilde{O}(s + A_{\text{tot}} (1-\gamma)^{-3}) time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in O~(Atot)\tilde{O}(A_{\text{tot}})-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.