ICML2024

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Kihyuk Hong, Ambuj Tewari

被引用 5 次

摘要

We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an ǫ-optimal policy with O(ǫ -2 ) sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of O(ǫ -2 ) with partial data coverage assumption. Our work is an improvement upon a recent work that requires O(ǫ -4 ) samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals. Computationally efficient Support constraints N General FQI [22] No Yes No ǫ -2 General CBPL [17] No Yes Yes ǫ -2 General Minimax [31] Yes No No ǫ -2 General CPPO [25] Yes No No ǫ -2 General Minimax [33] No No No ǫ -2 Linear PSPI [31] Yes Yes No ǫ -5 Linear MDPs Primal-Dual [11] Yes Yes No ǫ -4 Linear MDPs Primal-Dual (Ours) Yes Yes Yes ǫ -2 computational efficiency of algorithms for the general function approximation setting is judged based on the efficiency when applied to linear function class. As the table shows, our algorithm is the first computationally efficient algorithm with sample complexity O(ǫ -2 ) for finding ǫ-optimal policy under partial data coverage assumption. Moreover, our algorithm supports constraints on additional reward signals. Offline RL with General Function Approximation Offline RL with general function approximation is widely studied in the discounted infinite-horizon setting. When casting the linear function approximation setting to the general function approximation setting, we get the realizability and Bellman completeness for free when using linear function class since the value function under linear function approximation is linear. In Table 1, we only compared works on general function approximation that assumes realizability, Bellman completeness and data coverage. There are other works that relax Bellman completeness assumption at the cost of introducing another assumption. For example, Hong et al. [13], Xie et al. [32], Zhan et al. [35], and Zhu et al. [36] relax Bellman completeness assumption and introduce marginalized importance weight assumption.