ICLR2021

Learning Value Functions in Deep Policy Gradients using Residual Variance

Yannis Flet-Berliac, Reda Ouhamma, Odalric-Ambrym Maillard, Philippe Preux

被引用 16 次

摘要

Stochastic policy π θ (a|s) with parameter θ An agent in state s t interacts with an environment by sampling action a t ∼ π θ (•|s t ), receives reward r t and transitions to a new state s t+1 .