ICML2020

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

被引用 117 次

摘要

We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves Õ(L|X| |A|T ) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure Õ( √ T ) regret in this challenging setting; in fact it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound.