ICML2021

Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

Andrea Zanette

75 citations

Abstract

Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive exponential information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) realizability holds, 2) the batch algorithm observes the exact reward and transition functions, and 3) the batch algorithm is given the best a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if the action-value function of every policy admits a linear representation. If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an online algorithm, showing that there exist RL problems where batch RL is exponentially harder than online RL even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work introduces a new `oracle + batch algorithm' framework to prove lower bounds that hold for every distribution, and automatically recovers traditional fixed distribution lower bounds as a special case. Finally this work helps formalize the issue known as deadly triad and explains that the bootstrapping problem is potentially more severe than the extrapolation issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.