ICLR2023

Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality

Haoye Lu, Daniel Herman, Yaoliang Yu

摘要

In recent years, single-objective reinforcement learning (SORL) algorithms have received a significant amount of attention and seen some strong results. However, it is generally recognized that many practical problems have intrinsic multi-objective properties that cannot be easily handled by SORL algorithms. Although there have been many multi-objective reinforcement learning (MORL) algorithms proposed, there has been little recent exploration of the fundamental properties of the spaces we are learning in. In this paper, we perform a rigorous analysis of policy induced value functions and use the insights to distinguish three views of Pareto optimality. The results imply the convexity of the induced value function's range for stationary policies and suggest that any point of its Pareto front can be achieved by training a policy using linear scalarization (LS). We show the problem that leads to the suboptimal performance of LS can be solved by adding strongly concave terms to the immediate rewards, which motivates us to propose a new vector reward-based Q-learning algorithm, CAPQL. Combined with an actor-critic formulation, our algorithm achieves state-of-the-art performance on multiple MuJoCo tasks in the preference agnostic setting. Furthermore, we empirically show that, in contrast to other LS-based algorithms, our approach is significantly more stable, achieving similar results across various random seeds. Published as a conference paper at ICLR 2023 of policy alterations on the induced value function. We find that single-state policy alterations are insufficient to optimize the induced value function in a MORL setting, but show how it can be done by a multi-state update. We also prove that improving in all states is generally not possible. From here, we show that the range of value functions is convex, which suggests that linear scalarization (LS) is not the bottleneck in finding PE policies. We discuss the deficiencies of existing LS-based algorithms as suggested by our theory and fix them by augmenting the reward function using a strongly convex term. These insights motivate us to propose a new MORL algorithm (CAPQL) which achieves state-of-the-art performance on multiple MoJoCo environments in the preference agnostic setting. An ablation study is performed to understand how augmentation affects the algorithm's performance. MULTI-OBJECTIVE RL PROBLEM To begin, we will do a quick review of MORL problems and the notation we will be using, as well as introduce our definitions of Pareto optimality. Like SORL problems, we consider an agent interacting with an environment. At each step, the agent performs an action based on the current state and the environment returns a reward and the next state. Our setting assumes a vector reward in R d and is reduced to a SORL problem if d = 1. We model the interaction as a Markov Decision Process (MDP) (S, A, R, P, γ). As usual, S and A are the sets of states and actions, and γ ∈ (0, 1) is the discount factor. Our discussion considers finite A and S.When the agent takes action a ∈ A in state s ∈ S, the environment gives reward R(a, s) ∈ R d and moves to the next state following the transition probability P (a, s) ∈ ∆ |S| . In this paper, we consider an infinite-horizon MORL problem and assume bounded rewards. Let R(s) = [R(a, s)|a ∈ A] ∈ R d×|A| and P(s) = [P (a, s)|a ∈ A] ∈ R |S|×|A| , Π the set of all stationary policies, where π ∈ Π maps a state to a distribution over actions. Following the work of Roijers et al. ( 2013 ), given π ∈ Π, the induced value function V π (s) ∈ R d returns the expected sum of discounted reward over the interaction trajectory with the initial state s, 2 V π (s) := E ∞ t=0 γ t R(a t , s t ) with s t ∼P (a t-1 , s t-1 ), a t-1 ∼π(s t-1 ), s 0 = s. (1) Let µ : S → [0, 1] be the probability distribution over initial states. The expected value function is: γ t R(a t , s t ) with s t ∼ P (a t-1 , s t-1 ), a t-1 ∼ π(s t-1 ) and s 0 ∼ µ . (2)