NeurIPS2025

Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning

Na Li, Zewu Zheng, Wei Ni, Hangguan Shan, Wenjie Zhang, Xinyu Li

Abstract

Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (RTZ-VI-LCB) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally. Lower bound equilibria not only between the two players but also with their adversarial strategies, considering worst-case environments selected from predefined uncertainty sets for each player. Despite recent efforts [21, 5, 48, 29] , there remains a fundamental gap in learning effectively in offline RTZMGs, primarily due to high sample complexity. For a tabular RTZMG (formal definition in Section 2) with horizon length H, states S, actions A, B, and uncertainty levels σ + , σ - for the two players, the best sample complexity for ε-optimal robust Nash equilibrium (NE) in the offline setting to date is O CrH 5 S 2 AB ε 2 achieved by P 2 M 2 PO [5], which demonstrates near-optimal sample complexity in H, S, and A, B, but overlooks the influence of uncertainty levels and faces the curse of multiagency [40] . Hence, the key research question addressed in this paper is Can we design an efficient algorithm for offline RTZMGs with partial state-action coverage while ensuring robustness to uncertainties? (3) Throughout this paper, ϱ n denotes the initial distribution related to a historical dataset. We use the short-hand notation for the occupancy distribution with respect to (w.r.t.) the behavior policy (µ n , ν n ) as: which are simplified to d n,P 0 h (s) = d µ n ,ν n ,P 0 h (s) and d n,P 0 h (s, a, b) = d µ n ,ν n ,P 0 h (s, a, b). Similarly, for any product policy (µ, ν), we define: ∀(h, s, a, b) ∈ [H] × S × A × B d µ,ν,P h (s) := P(s h = s | s 1 ∼ ϱ, µ, ν, P ); d µ,ν,P h (s, a, b) := d µ,ν,P h (s)µ h (a | s) ν h (b | s). (5) Robust value functions. In RTZMGs, players seek to optimize their worst-case performance across all possible transition kernels within their respective uncertainty sets U σ + ρ P 0 and U σ - ρ P 0 . For any product policy (µ × ν) ∈ ∆(A × B), the max-player's worst-case performance at time step h is measured with the robust value function V µ,ν,σ + h and the robust Q-function Q µ,ν,σ + h , ∀(h, s, a, b) ∈ [H] × S × A × B, as given by V µ,ν,σ + h (s) := inf P ∈U σ + ρ (P 0 ) V µ,ν,P h (s), Q µ,ν,σ + h (s, a, b) := inf P ∈U σ + ρ (P 0 ) Q µ,ν,P h , (6a) V µ,ν,σh (s) := sup P ∈U σ - ρ (P 0 ) V µ,ν,P h (s), Q µ,ν,σh (s, a, b) := sup P ∈U σ - ρ (P 0 ) Q µ,ν,P h , (6b) where V µ,ν,P h