ICLR2026
Sharp asymptotic theory for Q-learning with LD2Z learning rate and its generalization
Soham Bonnerjee, Zhipeng Lou, Wei Biao Wu
Abstract
Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant () or polynomially decaying () learning schedules. However, it is well known the these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (LD2Z: ) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (PD2Z-: ). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with PD2Z- schedule, which then is used to derive a central limit theory for a new tail Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as strong invariance principle) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that LD2Z and in general PD2Z- achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of LD2Z while providing practical guidelines for inference through our results.