ICLR2022
Actor-critic is implicitly biased towards high entropy optimal policies
Yuzheng Hu, Ziwei Ji, Matus Telgarsky
12 citations
Abstract
We show that the simplest actor-critic method -a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or explorationdoes not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ǫ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution. Algorithm 1 Single-trajectory linear actor-critic. Inputs: actor iterations t and step size θ; critic iterations N and step size η. Initialize: actor weights W 0 = 0 ∈ R d×k , pre-softmax mapping p 0 (s, a) := s T W 0 a, policy π 0 := φ(p 0 ) (cf. eq. (1.1)); sample initial state/action/reward triple (s 0,0 , a 0,0 , r 0,0 ). for i = 0, 1, 2, . . . , t -1 do Critic update: use π i to interact with the MDP, obtaining state/action/reward triples (s i,j , a i,j , r i,j ) j≤N by continuing the existing trajectory, and form TD estimates with initial condition U i,0 = 0. Set U i := 1 N j<N U i,j and Q i (s, a) := s T U i a, and also (s i+1,0 , a i+1,0 , r i+1,0 ) := (s i,N , a i,N , r i,N ) to continue the existing trajectory in next iteration. Actor update: set W i+1 := W i + θ U i and p i+1 := p i + θ Q i and π i+1 := φ(p i+1 ). end for as will be discussed further in Section 1.2. The choice of finite state space is to remove measuretheoretic concerns and to allow a simple characterization of the maximum entropy optimal policy. Lemma 1.2 (simplification of Lemma A.1). Under Assumption 1.1, there exists a unique maximum entropy policy π, which satisfies π(s, •) = Uniform(A s ) for every state s, and moreover has a stationary distribution s π . To round out this introductory presentation of the actor, the last component is the update: , where Q i is the TD estimate of the Q function, to be discussed shortly. This update is explicitly a mirror descent or dual averaging update of the policy, where we use a mirror mapping φ to obtain the policy π i+1 from pre-softmax values p i+1 . As mentioned before, this update appears in prior work in the tabular setting with natural policy gradient and actor-critic (Agarwal et al., 2021b; Khodadadian et al., 2021) , and will be related to other methods in Section 1.2. We will further motivate this update in Section 2. The final assumption and description of the critic are as follows. As will be discussed in Section 2, the policy becomes optimal if Q i is an accurate estimate of the true Q function. We employ a standard TD update with no projections or constraints. To guarantee that this linear model of Q i is accurate, we make a standard linear MDP assumption (