NeurIPS2020

A Unifying View of Optimism in Episodic Reinforcement Learning

Gergely Neu, Ciara Pike-Burke

76 citations

Abstract

The principle of "optimism in the face of uncertainty" underpins many theoretically successful reinforcement learning algorithms. In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. This framework is built upon Lagrangian duality, and demonstrates that every model-optimistic algorithm that constructs an optimistic MDP has an equivalent representation as a value-optimistic dynamic programming algorithm. Typically, it was thought that these two classes of algorithms were distinct, with model-optimistic algorithms benefiting from a cleaner probabilistic analysis while value-optimistic algorithms are easier to implement and thus more practical. With the framework developed in this paper, we show that it is possible to get the best of both worlds by providing a class of algorithms which have a computationally efficient dynamic-programming implementation and also a simple probabilistic analysis. Besides being able to capture many existing algorithms in the tabular setting, our framework can also address largescale problems under realizable function approximation, where it enables a simple model-based analysis of some recently proposed methods. * This work was done while CPB was at Universitat Pompeu Fabra and Barcelona Graduate School of Economics. π * which satisfies the Bellman optimality equations. For proofs and further details of this formulation, see Puterman [38] . Linear function approximation in MDPs. In most practical problems, the state space is too large to use the above results and it is common to work with parameterized estimates of the quantities of interest. We focus on the classic idea of linear function approximation to represent the action-value functions as linear functions of some fixed d-dimensional feature map ϕ : S → R d , so Q θ h (x, a) = θ h,a , ϕ(x) for some θ h,a ∈ R d for each action a and stage h. To avoid technicalities, we assume that the state space S is still finite, although potentially very large. This allows us to define the S × d feature matrix Φ with its x th row being ϕ T (x), and represent the action-value function as Q h,a = Φθ h,a . We make the following assumption: Assumption 1 (Factored linear MDP [49, 37, 25] ). For each action a and stage h, there exists a d × S matrix M h,a and a vector ρ a such that the transition matrix can be written as P h,a = ΦM h,a , and the reward function as r a = Φρ a . Furthermore, the rows of M h,a , m h,a (x), satisfy m h,a (x) 1 ≤ C P for all (x, a, h), ρ satisfies ρ a 2 ≤ C r , and ϕ(x) 2 ≤ R for some positive constants C P , C r , R. As shown by Jin et al. [25], this assumption implies that for every policy π, there exists a θ π such that Q π h (x, a) = θ π h,a , ϕ(x) . We now show that factored linear MDPs also enjoy a strong dual realizability property. Let W h,a be an arbitrary symmetric S × S weight matrix for each action a such that Φ T W h,a Φ is full rank, and notice that, due to the realizability of the action-value functions, the optimal value functions can be written as the solution to the following LP: minimize