NeurIPS2020

Robust Multi-Agent Reinforcement Learning with Model Uncertainty

Kaiqing Zhang, Tao Sun, Yunzhe Tao, Sahika Genc, Sunil Mallya, Tamer Basar

被引用 104 次

摘要

In this work, we study the problem of multi-agent reinforcement learning (MARL) with model uncertainty, which is referred to as robust MARL. This is naturally motivated by some multi-agent applications where each agent may not have perfectly accurate knowledge of the model, e.g., all the reward functions of other agents. Little a priori work on MARL has accounted for such uncertainties, neither in problem formulation nor in algorithm design. In contrast, we model the problem as a robust Markov game, where the goal of all agents is to find policies such that no agent has the incentive to deviate, i.e., reach some equilibrium point, which is also robust to the possible uncertainty of the MARL model. We first introduce the solution concept of robust Nash equilibrium in our setting, and develop a Qlearning algorithm to find such equilibrium policies, with convergence guarantees under certain conditions. In order to handle possibly enormous state-action spaces in practice, we then derive the policy gradients for robust MARL, and develop an actor-critic algorithm with function approximation. Our experiments demonstrate that the proposed algorithm outperforms several baseline MARL methods that do not account for the model uncertainty, in several standard but uncertain cooperative and competitive MARL environments. Equal Contribution 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. * ,t i∈N and π * ,t = N j=1 π j * ,t denote the equilibrium policies of the nature and the equilibrium joint policies of all agents, respectively, computed from Qi t i∈N at time t. The term π 0,i * ,t (s)[a] denotes the a-th element of the policy output π 0,i * ,t (s), a real vector that lies in Ri s ⊆ R |A| . Convergence. Note that convergence of the update (3.3) is in general hard to establish, as the Bellman operator induced by solving a general-sum game in (3.2) does not always satisfy the conditions for the convergence of Q-learning in MDPs and generalized MDPs [42]. As recognized in [25, 26, 27] , convergence of Q-learning in general-sum Markov games indeed requires more conditions. We will establish the convergence of (3.3) under certain conditions, mostly motivated from [25] . Due to space limitation, we defer the results in Supplementary §A.2. The results, though not generally apply to all robust Markov games, provide some proof-of-concept justifications and sanity-check for the convergence of the value-based/Q-learning update. Indeed, developing provable convergent Success Rate