ICLR2026

Predictive CVaR Q-learning

Ju-Hyun Kim, Seungki Min

被引用 2 次

摘要

We propose a sample-efficient Q-learning algorithm for reinforcement learning with the Conditional Value-at-Risk (CVaR) objective. Our method introduces two key innovations. First, we propose the predictive tail value function, a novel formulation of risk-sensitive action value, admits a recursive structure as in the conventional risk-neutral Bellman equation. This novel formulation addresses the problem of noisy policy evaluation originating from the non-decomposable objective. Second, we introduce a two-way exploration strategy that explores the agent's risk-sensitivity level in addition to its actions. This technique mitigates the "blindness to success" phenomenon by preventing premature convergence to overly conservative policies. We establish a rigorous theoretical foundation for this framework, including a new Bellman optimality equation and a policy improvement theorem. Empirical results demonstrate that our algorithm significantly improves both CVaR performance and learning stability.