ICML2025

Online Learning in Risk Sensitive constrained MDP

Arnob Ghosh, Mehrdad Moharrami

Abstract

We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of Õ V g,max K 3/4 + H 4 S 2 A log(1/δ)K 3/4 and constraint violation of Õ V g,max H 3 S 2 A log(1/δ)K 3/4 with probability at least 1 -δ, where S and A denote the cardinalities of the state and action spaces, respectively, H is the episode length, K is the number of episodes, α < 0 is the risk-aversion parameter, and V g,max = 1 |α| (exp(|α|H) -1). To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.