AAAI2024

Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning

Tianchen Zhu, Yue Qiu, Haoyi Zhou, Jianxin Li

被引用 9 次

摘要

Multi-agent Preference-Based Reinforcement Learning (MAPbRL) is promising in offline policy learning by leveraging human preferences to replace complex manual reward designing. Current MAP-bRL methods use complicated structures to realize better reward modeling with off-the-shelf MARL algorithms and obtain the joint policy based on it. However, it faces a severe preference-behavior mismatch problem stemming from the instability of RL training and global-local preference inconsistency datasets in offline MARL, resulting in potential suboptimal policy convergence. To address this problem, we propose Agent-aware Multi-Agent Direct Preference Optimization (AMADPO) by utilizing a multi-agent preference predictor to guide agent-aware direct optimization from imbalanced preference labels, which can learn coordination policy from both positive and negative segments. Experimental results in SMAC environment show substantial improvements in global-local preference inconsistency datasets, demonstrating the effectiveness of AMADPO in solving the preference-behavior mismatch problem.