NeurIPS2020

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Nevena Lazic, Dong Yin, Mehrdad Farajtabar, Nir Levine, Dilan Görür, Chris Harris, Dale Schuurmans

被引用 13 次

摘要

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments. Introduction Recently, there have been considerable advances in reinforcement learning (RL), with algorithms achieving impressive performance on game playing and simple robotic tasks. Successful approaches typically learn through direct (online) interaction with the environment. However, in many real applications, access to the environment is limited to a fixed dataset, due to considerations of cost, safety, or time. One key challenge in this setting is off-policy evaluation (OPE): the task of evaluating the performance of a target policy given samples collected by a behavior policy. The focus of our work is OPE in infinite-horizon undiscounted MDPs, which capture long-horizon tasks such as game playing, routing, and the control of physical systems. Most recent state-of-the-art OPE methods for this setting estimate the ratios of stationary distributions of the target and behavior policy [