ICML2025

Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang

摘要

Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and longterm planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparsereward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning.