NeurIPS2021

Nearly Horizon-Free Offline Reinforcement Learning

Tongzheng Ren, Jialian Li, Bo Dai, Simon S. Du, Sujay Sanghavi

52 citations

Abstract

We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with S states and A actions, or linear MDP with anchor points and feature dimension d, given the collected K episodes data with minimum visiting probability of (anchor) stateaction pairs d m , we obtain nearly horizon H-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by 1. Specifically: • For offline policy evaluation, we obtain an Õ 1 Kdm error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on poly (H, S, A, d) in higher-order term. • For offline policy optimization, we obtain an Õ 1 Kdm + min(S,d) Kdm sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by Cui and Yang [2020] that has additional poly (H, S, d) factors in the main term. To the best of our knowledge, these are the first set of nearly horizon-free bounds for episodic timehomogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest. Offline Policy Evaluation. For OPE in infinite horizon tabular MDP, [Li et al., 2020] showed that plugin estimator can achieve the error of O 1 dm(1-γ) under Assumption 1, which matches the lower bound in [Pananjady and Wainwright, 2020] up to logarithmic factors. For OPE in finite horizon time-inhomogeneous tabular MDP, [Yin and Wang, 2020, Yin et al., 2020] provided an error bound of O 1 Kdm + √ SA Kdm under the uniform reward assumption, which matches the lower bound [Jiang and Li, 2016] up to logarithmic fac-