ICML2023

Reward-Mixing MDPs with Few Latent Contexts are Learnable

Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

摘要

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RM-MDPs): at the beginning of every episode nature randomly picks a latent reward model among M candidates and an agent interacts with the MDP throughout the episode for H time steps. Our goal is to learn a near-optimal policy that nearly maximizes the H time-step cumulative rewards in such a model. Prior work (Kwon et al., 2021a) established an upper bound for RMMDPs with M = 2. In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary M ≥ 2 and provide a sample-efficient algorithm-EM 2 -that outputs an ϵ-optimal policy using O ϵ -2 • S d A d • poly(H, Z) d episodes, where S, A are the number of states and actions respectively, H is the time-horizon, Z is the support size of reward distributions and d = O(min(M, H)). We also provide a (SA) Ω( √ M ) /ϵ 2 lower bound, supporting that super-polynomial sample complexity in M is necessary.