ICML2023

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

Kaiwen Wang, Nathan Kallus, Wen Sun

36 citations

Abstract

In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance τ . Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is Ω( √ τ -1 AK), where A is the number of actions and K is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of Ω( √ τ -1 SAK) (with normalized cumulative rewards), where S is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of O( √ τ -1 SAK) under a continuity assumption and in general attains a near-optimal regret of O(τ -1 √ SAK), which is minimax-optimal for constant τ . This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient. * Correspondence to https://kaiwenw.github.io/ . CVaR RL without regret guarantees: Keramati et al. ( 2020 ) proposed a distributional RL approach (Bellemare et al., 2017) for RL with the CVaR objective. A key difference is that Keramati et al. (2020) focuses on the easier task of identifying a policy with high CVaR. On the other hand, Bastani et al. ( 2022 ) and our work focuses on algorithms with low CVaR regret, which guarantees safe exploration. Note that low-regret methods can be converted into probably approximately correct (PAC) CVaR RL, by taking the uniform mixture of policies from the low-regret algorithm. Tamar et al. (2015) derived the policy gradient for the CVaR RL objective and showed asymptotic convergence to a local optimum. Chow & Ghavamzadeh (2014) developed actor-critic algorithms for the mean-CVaR objective, i.e., maximizing expected returns subject to a CVaR constraint. Another motivating perspective for CVaR RL is its close ties to robust MDPs (Wiesemann et al., 2013) . Specifically, Chow et al. (2015, Proposition 1) showed that the CVaR of returns is equivalent to the expected returns under the worst-case perturbation of the transition kernel in some uncertainty set. While the uncertainty set is not rectangular, Chow et al. ( 2015 ) derived tractable robust Bellman equations and proved convergence to a globally optimal CVaR policy. However, these methods for CVaR RL do not lead to low-regret algorithms, which is our focus. Risk-sensitive RL with different risk measures: Prior works have also proved risk-sensitive RL regret bounds in the context of other risk measures that are not directly comparable to the CVaR RL setting we consider. Fei et al. (2020, 2021); Liang & Luo (2022) showed Bellman equations and regret guarantees with the entropic risk measure based on an exponential utility function. Du et al. (2022); Lam et al. (2023) studied the more conservative Iterated CVaR objective, which considers the risk of the reward-to-go at every step along the trajectory. In contrast, our setup aims to holistically maximize the CVaR of the total returns. Risk-sensitive regret lower bounds: Fei et al. (2020); Liang & Luo ( 2022 ) showed regret lower bounds for risk-sensitive RL with the entropic risk measure. We show tight lower bounds for risk-sensitive MAB and RL with the CVaR objective, which to the best of our knowledge are the first lower bounds for this problem. Safety in offline RL: While our focus is online RL, risk-aversion has also been studied in offline RL. Some past works include offline learning with risk measures (Urpí et al., 2021) and distributional robustness