ICML2024

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Dan Qiao, Yu-Xiang Wang

被引用 5 次

摘要

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zerosum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of O( √ H 3 S 2 ABK), while the batch complexity is only O(H + log log K). In the above, S denotes the number of states, A, B are the number of actions for the two players respectively, H is the horizon and K is the number of episodes. Furthermore, we prove a batch complexity lower bound Ω( H log A K + log log K) for all algorithms with O( √ K) regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity. Algorithms for Markov games Single-agent (B=1)? Regret Batch complexity VI-ULCB [Bai and Jin, 2020] No O( √ H 3 S 2 ABT ) K Nash VI [Liu et al., 2021] No [Bai and Jin, 2020] No Ω( H 2 S(A + B)T ) No constraints. Lower bound (Theorem 4.2) No if O( √ T ) ("Optimal regret") Ω( H log A K + log log K) Algorithms for bandit games Single-agent (B=1)? Regret Batch complexity BaSE [Gao et al., 2019] † Yes O( √ AK) O(log log K) Our Algorithm 6 (Theorem 5.1) No O( √ ABK) O(log log K) Algorithms for reward-free exploration Single-agent (B=1)? Sample (episode) complexity Batch complexity VI-Explore [Bai and Jin, 2020] No