NeurIPS2025

Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

Oussama Zekri, Nicolas Boullé

摘要

Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for finetuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO . Introduction Diffusion models have become efficient generative modeling tools in various tasks, including image and video generation (Song et al., 2021; Ho et al., 2020) . Although most of the applications of diffusion models depend on a continuous state space (such as images), recent works extended these models to discrete settings, enabling their use in language modeling and other discrete generative tasks (