NeurIPS2022

DASCO: Dual-Generator Adversarial Support Constrained Offline Reinforcement Learning

Quan Vuong, Aviral Kumar, Sergey Levine, Yevgen Chebotar

6 citations

Abstract

In offline RL, constraining the learned policy to remain close to the data is essential 1 to prevent the policy from outputting out-of-distribution (OOD) actions with erro-2 neously overestimated values. In principle, generative adversarial networks (GAN) 3 can provide an elegant solution to do so, with the discriminator directly providing 4 a probability that quantifies distributional shift. However, in practice, GAN-based 5 offline RL methods have not outperformed alternative approaches, perhaps because 6 the generator is trained to both fool the discriminator and maximize return – two 7 objectives that are often at odds with each other. In this paper, we show that the 8 issue of conflicting objectives can be resolved by training two generators: one that 9 maximizes return, with the other capturing the “remainder” of the data distribution 10 in the offline dataset, such that the mixture of the two is close to the behavior policy. 11 We show that not only does having two generators enable an effective GAN-based 12 offline RL method, but also approximates a support constraint, where the policy 13 does not need to match the entire data distribution, but only the slice of the data 14 that leads to high long term performance. We name our method DASCO, for 15 D ual-Generator A dversarial S upport C onstrained O ffline RL. On benchmark tasks 16 that require learning from sub-optimal data, DASCO significantly outperforms 17 prior methods that enforce distribution constraint. 18