NeurIPS2023

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Jiayi Huang, Han Zhong, Liwei Wang, Lin Yang

被引用 15 次

摘要

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are heavy-tailed, i.e., with only finite (1+ϵ)(1+\epsilon)-th moments for some ϵ(0,1]\epsilon\in(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, Heavy-OFUL, for heavy-tailed linear bandits, achieving an instance-dependent TT-round regret of O~(dT1ϵ2(1+ϵ)t=1Tνt2+dT1ϵ2(1+ϵ))\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big), the first of this kind. Here, dd is the feature dimension, and νt1+ϵ\nu_t^{1+\epsilon} is the (1+ϵ)(1+\epsilon)-th central moment of the reward at the tt-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as Heavy-LSVI-UCB, achieves the first computationally efficient instance-dependent KK-episode regret of O~(dHUK11+ϵ+dHVK)\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K}). Here, HH is length of the episode, and U,V\mathcal{U}^*, \mathcal{V}^* are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound Ω(dHK11+ϵ+dH3K)\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K}) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.