ICML2022

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Ménard

被引用 24 次

DOI arXiv 出版方

摘要

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order O( √ H 3 SAT ) where H is the length of one episode, S is the number of states, A the number of actions, T the number of episodes, that matches the lower-bound of Ω( √ H 3 SAT ) up to poly-log terms in H, S, A, T for a large enough T . To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon H (and S) without the need of an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anticoncentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981). 1 We translate all the bounds to the stage-dependent setting by multiplying by √ H the regret bounds in the stage-independent setting. 2 In the O(•) notation we ignore terms poly-log in H, S, A, T . 3 Or simple linearly parameterized settings.