ICLR2026
Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits
Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Tong Zhang, Quanquan Gu
9 citations
Abstract
Many offline reinforcement learning algorithms are underpinned by -divergence regularization, but their sample complexity defined with respect to regularized objectives still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the sample complexity for offline -divergence-regularized contextual bandits. For reverse Kullback–Leibler (KL) divergence, arguably the most commonly used one, we achieve an sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing bound under all-policy concentrability and bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for -divergences with strongly convex , to which reverse KL does not belong, we show that the sharp sample complexity is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with -divergence regularization.