NeurIPS2021

Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning

Hiroki Furuta, Tadashi Kozuno, Tatsuya Matsushima, Yutaka Matsuo, Shixiang Shane Gu

14 citations

Abstract

Recently many algorithms were devised for reinforcement learning (RL) with function approximation. While they have clear algorithmic distinctions, they also have many implementation differences that are algorithm-independent and sometimes under-emphasized. Such mixing of algorithmic novelty and implementation craftsmanship makes rigorous analyses of the sources of performance improvements across algorithms difficult. In this work, we focus on a series of off-policy inference-based actor-critic algorithms -MPO, AWR, and SAC -to decouple their algorithmic innovations and implementation decisions. We present unified derivations through a single control-as-inference objective, where we can categorize each algorithm as based on either Expectation-Maximization (EM) or direct Kullback-Leibler (KL) divergence minimization and treat the rest of specifications as implementation details. We performed extensive ablation studies, and identified substantial performance drops whenever implementation details are mismatched for algorithmic choices. These results show which implementation or code details are co-adapted and co-evolved with algorithms, and which are transferable across algorithms: as examples, we identified that tanh Gaussian policy and network sizes are highly adapted to algorithmic types, while layer normalization and ELU are critical for MPO's performances but also transfer to noticeable gains in SAC. We hope our work can inspire future work to further demystify sources of performance improvements across multiple algorithms and allow researchers to build on one another's both algorithmic and implementational innovations. 1 Recently, there has been a series of off-policy algorithms derived from this perspective for learning policies with function approximations [2, 22, 48] . Notably, Soft Actor Critic (SAC) [22, 23] , based on a maximum entropy objective and soft Q-function, significantly outperforms on-policy [52, 54] and off-policy [36, 18, 13] methods. Maximum a posteriori Policy Optimisation (MPO) [2], Related Work Inference-based RL algorithms RL as probabilistic inference has been studied in several prior contexts [60, 61, 34, 11, 57, 45] , but many recently-proposed algorithms [2, 22, 48] are derived separately and their exact relationships are difficult to get out directly, due to mixing of algorithmic and implementational details, inconsistent implementation choices, environment-specific tunings, and benchmark differences. Our work organizes them as a unified policy iteration method, to clarify their exact mathematical algorithmic connections and tease out subtle, but important, implementation differences. We center our analyses around MPO, AWR, and SAC, because they are representative algorithms that span both 50, 43, 44, 1, 56, 42] and 51, 12, 29, 21, 33, 32] RL and achieve some of the most competitive performances on popular benchmarks [6, 58] . REPS [50], an EM approach, inspired MPO, AWR, and our unified objective in Eq. 1, while Soft Q-learning [21], a practical extension of KL control to continuous action space through Liu and Wang [38] , directly led to the development of SAC. Meta analyses of RL algorithms While many papers propose novel algorithms, some recent works focused on meta analyses of some of the popular algorithms, which attracted significant attention due to these algorithms' high-variance evaluation performances, reproducibility difficulty [9, 28, 65] , and frequent code-level optimizations [25, 63, 10, 3] . Henderson et al. [25] empirically showed how these RL algorithms have inconsistent results across different official implementations and high variances even across runs with the same hyper-parameters, and recommended a concrete action item for the community -use more random seeds. Tucker et al. [63] show that high performances of actiondependent baselines [19, 17, 37, 20, 67] were more directly due to different subtle implementation choices. Engstrom et al. [10] focus solely on PPO and TRPO, two on-policy algorithms, and discuss