ICML2025

Near-Optimal Sample Complexity for MDPs via Anchoring

Jongmin Lee, Mario Bravo, Roberto Cominetti

摘要

We study a new model-free algorithm to compute ε-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an ε-optimal policy with sample and time complexity O(|S||A|∥h * ∥ 2 sp /ε 2 ) both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor ∥h * ∥ sp . Although the complexity bound involves the span seminorm ∥h * ∥ sp of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.