NeurIPS2025

Non-Asymptotic Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

Zaiwei Chen

Abstract

This work presents the first finite-time analysis of average-reward Q-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes that act as local clocks for each state-action pair. We show that the mean-square error of this Q-learning algorithm, measured in the span seminorm, converges at a rate of Õ(1/k). To establish this result, we demonstrate that adaptive stepsizes are necessary: without them, the algorithm fails to converge to the correct target. Moreover, adaptive stepsizes can be viewed as a form of implicit importance sampling that counteracts the effect of asynchronous updates. Technically, the use of adaptive stepsizes causes each Q-learning update to depend on the full sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. Introduction Reinforcement Learning (RL) has become as a powerful framework for solving sequential decisionmaking problems, as demonstrated by its growing impact across a range of real-world applications, including autonomous robotics [64], game-playing AI [63] , and the development of large language models [15] . Given the promising potential of RL, establishing strong theoretical foundations to guide its practical implementation is of significant importance. An RL problem is typically modeled as a Markov decision process (MDP) [67], but its objective can vary by application. Finite-horizon RL maximizes cumulative rewards over a fixed time, while infinite-horizon discounted RL introduces a discount factor γ ∈ (0, 1) to prioritize immediate rewards over future rewards. However, selecting an appropriate discount factor can be challenging. For instance, it is often unclear how far into the future decisions should account for rewards. Moreover, in high-frequency decision-making tasks such as robotic control, queuing systems, and financial trading, introducing a discount factor may not be appropriate. The infinite-horizon average-reward setting addresses these challenges by optimizing the long-run average reward without requiring a discount factor. However, the absence of discounting introduces unique challenges for both algorithm design and theoretical analysis. For example, the associated Bellman operator is no longer a norm-contractive mapping [58] , sample-based updates add additional complexities to the structure of the algorithm [1], and optimality depends on value differences rather than absolute values. These challenges are particularly evident in Q-learning [75], one of the most classical and practically impactful RL algorithms. Due to its popularity and its role as a major milestone in RL [53], substantial efforts have been dedicated to providing theoretical guarantees, especially in terms of convergence rates, for Q-learning. In the discounted setting, the first finite-time analysis of Q-learning was 39th Conference on Neural Information Processing Systems (NeurIPS 2025).