NeurIPS2023

Policy Finetuning in Reinforcement Learning via Design of Experiments using Offline Data

Ruiqi Zhang, Andrea Zanette

11 citations

Abstract

In some applications of reinforcement learning, a dataset of pre-collected experience is already available but it is also possible to acquire some additional online data to help improve the quality of the policy. However, it may be preferable to gather additional data with a single, non-reactive exploration policy and avoid the engineering costs associated with switching policies. In this paper we propose an algorithm with provable guarantees that can leverage an offline dataset to design a single non-reactive policy for exploration. We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected. Related Work In this section we discuss some related literature. Our work is related to low-switching algorithms, but unlike those, we focus on the limit case where no-switiches are allowed. For more related work about low-switching algorithms, offline RL, task-agnostic RL, and reward-free RL we refer to Appendix F. Low -switching RL In reinforcement learning, [Bai et al., 2019] first proposed Q-learning with UCB2 exploration, proving an O(H 3 |S| |A| log K) switching cost. This was later improved by a factor of H by the UCBadvantage algorithm in [Zhang et al., 2020b]. Recently, [Qiao et al., 2022] generalized the policy elimination algorithm from [Cesa-Bianchi et al., 2013] and introduced APEVE, which attains an optimal O(H |S| |A| log log K) switching cost. The reward-free version of their algorithm (which is not regret minimizing) has an O(H |S| |A|) switching cost. Similar ideas were soon applied in RL with linear function approximation [Gao et al., 2021, Wang et al., 2021, Qiao and Wang, 2022] and general function approximation [Qiao et al., 2023]. Additionally, numerous research efforts have focused on low-adaptivity in other learning domains, such as batched dueling bandits [Agarwal et al., 2022], batched convex optimization [Duchi et al., 2018], linear contextual bandits [Ruan et al., 2021], and deployment-efficient RL [Huang et al., 2022]. Our work was inspired by the problem of non-reactive policy design in linear contextual bandits. Given access to an offline dataset, [Zanette et al., 2021a] proposed an algorithm to output a single exploratory policy, which generates a dataset from which a near-optimal policy can be extracted. However, there are a number of additional challenges which arise in reinforcement learning, including the fact that the state space is only partially explored in the offline dataset. In fact, in reinforcement learning, [Xiao et al., 2022] established an exponential lower bound for any non-adaptive policy learning algorithm starting from tabula rasa. Setup Throughout this paper, we let [n] = 1, 2, ..., n. We adopt the big-O notation, where O(•) suppresses poly-log factors of the input parameters. We indicate the cardinality of a set X with |X |. Let us define the comparator policy π † * used for the comparison in eq. ( 5 .2) to be the (deterministic) policy with the highest value function on the sparsified MDP: π † * := arg max π∈Π V 1 (s 1 ; P † , r † , π). We can bound the suboptimality using the triangle inequality as