NeurIPS2022

Reinforcement Learning with Logarithmic Regret and Policy Switches

Grigoris Velegkas, Zhuoran Yang, Amin Karbasi

5 citations

Abstract

In this paper, we study the problem of regret minimization for episodic Reinforcement Learning (RL) both in the model-free and the model-based setting. We focus on learning with general function classes and general model classes, and we derive results that scale with the eluder dimension of these classes. In contrast to the existing body of work that mainly establishes instance-independent regret guarantees, we focus on the instance-dependent setting and show that the regret scales logarithmically with the horizon T , provided that there is a gap between the best and the second best action in every state. In addition, we show that such a logarithmic regret bound is realizable by algorithms with O(log T ) switching cost (also known as adaptivity complexity). In other words, these algorithms rarely switch their policy during the course of their execution. Finally, we complement our results with lower bounds which show that even in the tabular setting, we cannot hope for regret guarantees lower than o(log T ). In the model-free setting, we study the algorithm proposed in Kong et al. (2021) , which is a variant of the leastsquares value iteration (LSVI) with upper confidence bound (UCB) bonuses that guide exploration. Here, the bonus functions are given by the width of a data-dependent confidence region for LSVI. For the model-based setting, we develop a similar algorithm which combines value-targeted regression with UCB bonuses. On top of the logarithmic regret guarantees they enjoy, our algorithms feature lazy policy updates, in the sense that policy is updated rarely and only when certain conditions are met. For both settings, we establish O(poly(log T ) • poly(H) • poly(d F ) • 1/gap min ) regret guarantees, where T is the number of interactions with the environment, H is the planning horizon, d F is a term that captures the complexity of the function class F which is used to approximate either the value function or the transition model. In particular, d F involves both the eluder dimension (Russo & Van Roy 2013) and the log-covering numbers of the function classes. That is, for benign MDPs where gap min > 0 these RL algorithms achieve logarithmic regret, which is exponentially better than the worst-case O( √ T )-regret. Moreover, we show that the adaptivity complexity, meaning the number of different policies our algorithms use, is also logarithmic in T . To the best of our knowledge, this is the first work that establishes a logarithmic instance-dependent regret guarantee for RL with general function approximation. Related Work Logarithmic regret bounds for bandits. There is a long line of work that establishes logarithmic regret guarantees in bandit problems. Essentially, bandits are a special case of RL where the transition to the next state does not depend on the action that was taken by the agent. An extensive list of such algorithms can be found in Bubeck & Cesa-Bianchi (2012), Slivkins ( 2019 ), Lattimore & Szepesvári (2020). Logarithmic regret bounds for RL. A series of works are devoted to proving instance-dependent logarithmic regret bounds in tabular RL. Ok et al. (2018), Simchowitz & Jamieson (2019) prove lower bounds that show that a logarithmic dependence on T is unavoidable. Considering the upper bounds, Auer & Ortner (2007), Tewari & Bartlett (2007) establish logarithmic regret guarantees in the average reward setting. Both of these guarantees are asymptotic as they require the number of interactions T with the MDP to be large enough. Regarding non-asymptotic bounds, Jaksch et al. (2010) provide such an algorithm that achieves O(D 2 |S| 2 |A| log(T )/gap min ) regret for the average-reward MDP, where D is the diameter of the MDP. For episodic MDPs, logarithmic regret upper bounds are established in Simchowitz & Jamieson (2019), Yang et al. (2021). The work that is probably the most closely related to ours is He et al. ( 2021 ). It provides instance-dependent logarithmic regret guarantees both in the model-free and model-based setting with linear function approximation. Moreover, the proposed algorithms update the policy in every episode. Our work generalizes these results since the linear regime is a special case of the setting we are studying. Furthermore, we achieve an exponential improvement on the adaptivity complexity over their algorithms. Bandits with limited adaptivity complexity. There is a lot of interest in obtaining bandit algorithms that update their policies rarely (Abbasi-Yadkori et al.