NeurIPS2022

The Phenomenon of Policy Churn

Tom Schaul, André Barreto, John Quan, Georg Ostrovski

34 citations

Abstract

We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs, the most likely one being deep learning with high-variance updates. Finally, we hypothesise that policy churn is a potentially beneficial but overlooked form of implicit exploration, which casts -greedy exploration in a fresh light, namely that -noise plays a much smaller role than expected. The Phenomenon Reinforcement learning (RL) involves agents that incrementally update their policy. This process is driven by the objective of maximising reward, and based on experience that the agent generates via exploration. The sequence of policies π 0 , . . . , π k , . . . , π T usually starts from a randomly initialised policy π 0 and aims to end at a near-optimal policy π T ≈ π * . Ideally, steps in that sequence (π k → π k+1 ) are policy improvements that increase expected reward. This paper studies the amount of policy change that goes along with such a policy update process (for a definition, see Section 1.1). In particular, it makes the core observation that policy change in practice (as illustrated in some typical deep RL settings) is orders of magnitude larger than could have been expected, and stands in contrast to various reference algorithms (Sections 1.2 and 3.3). Key observation 1: The greedy policy changes much more rapidly than you probably think. a a As a coarse magnitude for the impatient reader: in a typical run of DQN on Atari, the greedy policy changes in ≈ 10% of all states after a single gradient update (Figure 1 and Section 1.2). We dub this phenomenon "policy churn" to highlight that most of this policy change may be unnecessary. We study the phenomenon in depth, determining the range of deep RL scenarios it appears in, fleshing out its properties, and in the process narrowing the space of potential causes and mechanisms involved using a set of ablations (Section 3). Our second key message relates the phenomenon of churn to exploration, specifically in the context of -greedy exploration (Section 2), with some more speculative ramifications in Section 4. Key observation 2: Policy churn is a significant driver of exploration. a a This holds both in the sense that reducing churn can reduce performance, and in the sense that explicitly adding noise becomes unnecessary in the presence of churn (i.e., = 0 is viable). 36th Conference on Neural Information Processing Systems (NeurIPS 2022).