ICLR2023
Risk-Aware Reinforcement Learning with Coherent Risk Measures and Non-linear Function Approximation
Thanh Lam, Arun Verma, Bryan Kian Hsiang Low, Patrick Jaillet
Abstract
Reinforcement Learning (RL) is a branch of machine learning that focuses on training agents to make sequential decisions. By interacting with the environment, an RL agent learns optimal policies that guide its actions. While traditional RL algorithms focus primarily on maximizing expected rewards, they often overlook the risks associated with uncertain or adverse outcomes. This limitation is particularly problematic in high-stakes applications-such as autonomous driving, healthcare, and finance-where the consequences of poor decision-making can be significant. To address this gap, the field of risk-sensitive reinforcement learning has emerged, enhancing the safety and robustness of RL agents in uncertain environments. This thesis explores advancements in risk-sensitive RL by developing novel algorithms, frameworks, and analysis techniques to address uncertainty and robustness in sequential decision-making. One of the primary focuses is the application of Entropic Value-at-Risk (EVaR), a recently introduced risk measure, to RL. Unlike the conventional Conditional Value-at-Risk (CVaR), EVaR characterizes distributional uncertainty using Kullback-Leibler (KL) divergence, which better aligns with common practices in machine learning. This alignment enables a broader application in risk-sensitive RL problems where robustness to uncertainty is crucial. To achieve this, we propose value iteration and policy gradient algorithms that incorporate EVaR optimization within the Markov Decision Process (MDP) framework. The proposed algorithms are shown to converge and perform effectively through numerical experiments, demonstrating the practicality and relevance of EVaR for robust decision-making in RL. Building upon this exploration of risk measures, we introduce the φ-Divergence-Risk (PhiD-R), a general class of coherent risk measures that includes existing risk measures such as CVaR and EVaR as special cases and extends the potential for RL applications by covering a broader range of risk preferences. The PhiD-R class allows the study of risk-sensitive RL using various φ-divergences, thus creating a flexible framework adaptable to multiple types of uncertainty. For this class of risk measure, we develop a trajectory-based policy gradient method tailored specifically for PhiD-R, providing both theoretical convergence guarantees and practical validations through extensive simulation experiments. This work not only enhances our understanding of risk-sensitive learning but also contributes algorithms that are robust and versatile across a range of RL environments. In addition to exploring risk measures, this dissertation examines the robustness of risk-sensitive RL under Robust MDPs (RMDPs). RMDPs provide a framework for decision-making under worst-case scenarios by optimizing over ambiguity sets, which define possible variations in the transition dynamics. While previous research on RMDPs has largely focused on risk-neutral approaches, we extend this work to risk-sensitive contexts. Leveraging the coherence properties of CVaR, we establish a connection between robustness and risk sensitivity, thereby enabling risk-sensitive RL techniques to solve robust decision-making problems. We further introduce a novel risk measure, NCVaR, specifically designed to handle state-action-dependent uncertainties, a common feature in real-world applications. Through value iteration algorithms and simulations, we validate that NCVaR optimization improves robustness in complex and uncertain RL environments. The thesis also addresses a critical challenge in RL: exploration. In traditional reward-free RL, exploration is guided without a specific reward function, enabling adaptability across various reward settings. However, efficient exploration strategies in risk-sensitive RL are still underdeveloped. To fill this gap, we propose a risk-sensitive reward-free RL framework based on CVaR, aiming to balance efficient exploration with risk constraints. We develop the CVaR-RF-UCRL algorithm, designed to perform effective CVaR-based exploration under risk-sensitive criteria, and establish its performance guarantees by proving it is PAC with sample complexity upper bound. We further introduce two planning algorithms, CVaR-VI and CVaR-VI-DISC, and validate the approach with empirical experiments, demonstrating its utility in safe and efficient exploration. We also establish a lower bound on the sample complexity for any CVaR-RF algorithm.