ICLR2021
Learning Value Functions in Deep Policy Gradients using Residual Variance
Yannis Flet-Berliac, Reda Ouhamma, Odalric-Ambrym Maillard, Philippe Preux
16 citations
Abstract
Stochastic policy π θ (a|s) with parameter θ An agent in state s t interacts with an environment by sampling action a t ∼ π θ (•|s t ), receives reward r t and transitions to a new state s t+1 .