NeurIPS2021

Optimal Policies Tend To Seek Power

Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

111 citations

Abstract

Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of the objectives we specify for them. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decision processes (MDPs), we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that in these environments, most reward functions make it optimal to seek power by keeping a range of options available and, when maximizing average reward, by navigating towards larger sets of potential terminal states. Some actions have a greater probability of being optimal We claim that optimal policies "tend" to take certain actions in certain situations. We first consider the probability that certain actions are optimal. Reconsider the reward function e r , optimized at γ = 1 2 . Starting from , the optimal trajectory goes right to r to r , where the agent remains. The right action is optimal at under these incentives. Optimal policy sets capture the behavior incentivized by a reward function and a discount rate. Definition 4.1 (Optimal policy set function). Π * (R, γ) is the optimal policy set for reward function R at γ ∈ (0, 1). All R have at least one optimal policy π ∈ Π [Puterman, 2014] . Π * (R, 0) := lim γ→0 Π * (R, γ) and Π * (R, 1) := lim γ→1 Π * (R, γ) exist by lemma E.35 (taking the limits with respect to the discrete topology over policy sets). We may be unsure which reward function an agent will optimize. We may expect to deploy a system in a known environment, without knowing the exact form of e.g. the reward shaping [Ng et al., 1999] or intrinsic motivation [Pathak et al., 2017] . Alternatively, one might attempt to reason about future RL agents, whose details are unknown. Our power-seeking results do not hinge on such uncertainty, as they also apply to degenerate distributions (i.e. we know what reward function will be optimized). Definition 4.2 (Reward function distributions). Different results make different distributional assumptions. Results with D any ∈ D any := ∆(R |S| ) hold for any probability distribution over R |S| . D bound is the set of bounded-support probability distributions D bound . For any distribution X over R, D X-IID := X |S| . For example, when X u := unif(0, 1), D Xu-IID is the maximum-entropy distribution. D s is the degenerate distribution on the state indicator reward function e s , which assigns 1 reward to s and 0 elsewhere. With D any representing our prior beliefs about the agent's reward function, what behavior should we expect from its optimal policies? Perhaps we want to reason about the probability that it's optimal to go from to ∅, or to go to r and then stay at r . In this case, we quantify the optimality probability of F := e + γ 1-γ e ∅ , e + γe r + γ 2 1-γ e r . Definition 4.3 (Visit distribution optimality probability). Let F ⊆ F(s), γ ∈ [0, 1]. P Dany (F, γ) := P R∼Dany ∃f π ∈ F : π ∈ Π * (R, γ) . Alternatively, perhaps we're interested in the probability that right is optimal at . Definition 4.4 (Action optimality probability). At discount rate γ and at state s, the optimality probability of action a is P Dany (s, a, γ) := P R∼Dany ∃π * ∈ Π * (R, γ) : π * (s) = a . Proposition 6.9 (Keeping options open tends to be POWER-seeking and tends to be optimal). Suppose F a := F(s | π(s) = a) contains a copy of F a := F(s | π(s) = a ) via φ. 1. If s ∈ REACH s, a , then ∀γ ∈ [0, 1] : E sa∼T (s,a) POWER Dbound (s a , γ) ≥ most: Dbound E s a ∼T (s,a ) POWER Dbound (s a , γ) .