NeurIPS2022
Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation
Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, Ding Zhao
36 citations
Abstract
Curriculum Reinforcement Learning (CRL) aims to create a sequence of tasks, starting from easy ones and gradually learning towards difficult tasks. In this work, we focus on the idea of framing CRL as interpolations between a source (auxiliary) and a target task distribution. Although existing studies have shown the great potential of this idea, it remains unclear how to formally quantify and generate the movement between task distributions. Inspired by the insights from gradual domain adaptation in semi-supervised learning, we create a natural curriculum by breaking down the potentially large task distributional shift in CRL into smaller shifts. We propose GRADIENT, which formulates CRL as an optimal transport problem with a tailored distance metric between tasks. Specifically, we generate a sequence of task distributions as a geodesic interpolation (i.e., Wasserstein barycenter) between the source and target distributions. Different from many existing methods, our algorithm considers a task-dependent contextual distance metric and is capable of handling nonparametric distributions in both continuous and discrete context settings. In addition, we theoretically show that GRADIENT enables smooth transfer between subsequent stages in the curriculum under certain conditions. We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance. However, most of the existing methods, that interpret the curriculum as shifting distributions, use Kullback-Leibler (KL) divergence to measure the distance between distributions. This setting imposes several restrictions. First, due to either problem formulations or the computational feasibility, existing methods often require the distribution to be parameterized, e.g., Gaussian [8, 9, 10, 13] , which limits the usage in practice. Second, most of the existing algorithms using KL divergence implicitly assume an l 2 Euclidean space which ignores the manifold structure when parameterizing RL environments [14] . In light of the aforementioned issues with the existing CRL method, we propose GRADIENT, an algorithm that creates a sequence of task distributions gradually morphing from the source to the target distribution using Optimal Transport (OT). GRADIENT approaches CRL from a gradual domain adaptation (GDA) perspective, breaking the potentially large domain shift between the source and the target into smaller shifts to enable efficient and smooth policy transfer. In this work, we first define a distance metric between individual tasks. Then we can find a series of task distributions that interpolate between the easy and the difficult task distribution by computing the Wasserstein barycenter. GRADIENT is able to deal with both discrete and continuous environment parameter spaces, and nonparametric distributions (represented either by explicit categorical distributions or implicit empirical distributions of particles). Under some conditions [15] , GRADIENT provably ensures a smooth adaptation from one stage to the next. We summarize our main contributions as follows: 1. We propose GRADIENT, a novel CRL framework based on optimal transport to generate gradually morphing intermediate task distributions. As a result, GRADIENT requires little effort to transfer between subsequent stages and therefore improves the learning efficiency towards difficult tasks. 2. We develop π-contextual-distance to measure the task similarity and compute the Wasserstein barycenters as intermediate task distributions. Our proposed method is able to deal with both continuous and discrete context spaces as well nonparametric distributions. We also prove the theoretical bound of policy transfer performance which leads to practical insights. 3. We demonstrate empirically that GRADIENT has stronger learning efficiency and asymptotic performance in a wide range of locomotion and manipulation tasks when compared with state-ofthe-art CRL baselines. Related Work Curriculum reinforcement learning. Curriculum reinforcement learning (CRL) [6, 16] focuses on the generation of training environments for RL agents. There are several objectives in CRL: improving learning efficiency towards difficult tasks (time-to-threshold), maximum return (asymptotic performance), or transfer policies to solve unseen tasks (generalization). From a domain randomization perspective, Active Domain Randomization [5, 17] uses curricula to diversify the physical parameters of the simulator to facilitate the generalization in sim-to-real transfer. From a game-theoretical perspective, adversarial training is also developed to improve the robustness of RL agents in unseen environments [18, 19, 20, 21] . From an intrinsic motivation perspective, methods have been proposed to create curricula even in the absence of a target task to be accomplished [22, 13, 23] . CRL as an interpolation of distributions.