AAAI2023

Preference-Controlled Multi-Objective Reinforcement Learning for Conditional Text Generation

Wenqing Chen, Jidong Tian, Caoyun Fan, Yitian Li, Hao He, Yaohui Jin

2 citations

Abstract

Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. Preferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preferencebased MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems. Note to Practitioners-Decision-making problems with multiple conflicting objectives are common in real-world applications, e.g., energy management must balance system lifespan, chargedischarge cycles, and energy procurement costs; autonomous driving vehicles must balance safety, speed, and passenger comfort. While multi-objective reinforcement learning (MORL) is an effective framework for these problems, its dependence on pre-defined reward functions can limit its application in complex situations, as designing a reward function often fails to capture the full complexity of the task fully. This paper introduces preference-based MORL (Pb-MORL), which utilizes user preference data to optimize policies, thereby eliminating the complexity of reward design. Specifically, we construct a multiobjective reward model that aligns with user preferences and demonstrate that optimizing this model can derive Pareto optimal solutions. Pb-MORL is effective, easy to deploy, and is expected to be applied in complex systems, e.g., multi-energy management through preference feedback and adaptive autonomous driving policies for diverse situations.