NeurIPS2023
Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage
Jose H. Blanchet, Miao Lu, Tong Zhang, Han Zhong
被引用 52 次
摘要
In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization (P 2 MPO), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the family of target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that P 2 MPO is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of the distributions induced by the optimal robust policy and the perturbed models around the nominal model. Our assumption on data is relatively mild compared with previous full-coverage-style assumptions which need a uniformly lower bounded data distribution. Our algorithm and theory can be applied to a vast body of robust Markov decision processes (RMDPs) in the regime of large state spaces. By tailoring specific model estimation subroutines for concrete examples of RMDPs, including tabular RMDPs, factored RMDPs, kernel and neural RMDPs, we prove that for all these examples P 2 MPO enjoys a O(n -1/2 ) convergence rate, where n is the number of trajectories in data. We highlight that all these RMDP examples, except tabular RMDPs, are first identified and proven tractable by this work. Furthermore, as an extension to multi-agent decision-making, we continue our study of robust offline RL in the multi-player robust Markov games (RMGs). By extending the double pessimism principle identified for single-agent RMDPs, we propose another doubly-pessimistic-type algorithm framework that can efficiently find the robust Nash equilibria among players using only robust unilateral (partial) coverage data. To our best knowledge, this work proposes the first general learning principle -double pessimismfor robust offline RL and shows that it is provably efficient in the context of general function approximation.