NeurIPS2023

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Yashaswini Murthy, Mehrdad Moharrami, R. Srikant

被引用 9 次

摘要

Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to 1, which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the asymptotic error goes to zero in the limit as policy evaluation and policy improvement errors go to zero. Preprint. Under review. goal in this paper is to derive similar results for average-reward problems. Average-reward problems [ABB01, WNS21, ZWSW21, Mah96, YB09, Gos04, Sin94] are, in some sense, harder to study than their discounted-reward counterparts. We now discuss why this is so by recalling the error bounds for discounted-reward problems and examining them in an appropriate limit to study their applicability to average-reward problems. The fundamental result on approximate policy iteration for discounted reward MDPs which allows the connection to policy-based methods is the following [BT96]: where α is the discount factor, H α = 1/(1 -α), J α * is the optimal value function, J µ k is the value function associated with the policy obtained after k iterations of approximate policy iteration, ǫ is the policy improvement error, and δ is the policy evaluation error. Thus, as ǫ, δ → 0, we recover the result that standard policy iteration converges. On the other hand, to understand whether the above bound is useful to study average-reward problems, we write Equation (1) as Under mild conditions [Ber11, Ros14], it is well known that each element of the value function vector approaches the average-reward for each fixed policy and for the optimal policy, in the limit α → 1. Hence, the left-hand side of the above equation is an approximation to the error in approximate policy iteration for average-reward MDPs; the right-hand side which gives an upper bound on this error blows up to infinity, i.e., when α → 1. Thus, unlike in the discounted-reward case, the above bound fails to even recover the well-known convergence of standard policy iteration in average-reward case in the limit ǫ, δ → 0 (since we have to let α → 1 before letting ǫ, δ → 0 ). However, as mentioned in [BT96], it is also well known that approximate policy iteration performs much better than the bound suggests. The main goal of our paper is to resolve this discrepancy between theory and practice.