ACL2025

DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

Amitava Das, Suranjana Trivedy, Danush Khanna, Yaswanth Narsupalli, Basab Ghosh, Rajarshi Roy, Gurpreet Singh, Vinija Jain, Vasu Sharma, Aishwarya Naresh Reganti, Aman Chadha

摘要

The rapid advancement of large language models (LLMs) has revolutionized numerous applications, but presents significant challenges in aligning these models with diverse human values, ethical standards, and specific user preferences. Direct Preference Optimization (DPO) has become a cornerstone for preference alignment but is constrained by reliance on fixed divergence measures and limited feature transformations. We introduce DPO-Kernels, an innovative enhancement of DPO that integrates kernel methods to overcome these challenges through four key contributions: (i) Kernelized Representations: These representations lay the groundwork for enhanced divergence measures by leveraging polynomial, RBF, Mahalanobis, and spectral kernels for richer, more expressive feature transformations. Additionally, we introduce a hybrid loss that combines embedding-based loss with probabilitybased loss, enhancing the optimization process beyond traditional DPO; (ii) Divergence Alternatives: Incorporating Jensen-Shannon, Hellinger, Rényi, Bhattacharyya, Wasserstein, and f-divergences to boost stability and robustness; (iii) Data-Driven Selection: Choosing the optimal kernel-divergence pair among 28 combinations (4 kernels × 7 divergences) is challenging. We introduce automatic metrics that analyze the data to select the best pair, eliminating the need for manual tuning; (iv) Hierarchical Mixture of Kernels (HMK): Combining local and global kernels for precise and * Work done outside of role at Meta. † Work done outside of role at Amazon. large-scale semantic modeling. This approach automatically selects the optimal kernel mixture during training, enhancing modeling flexibility. Evaluations on 12 datasets demonstrate that DPO-Kernels achieve state-of-the-art generalization in factuality, safety, reasoning, and instruction following. While alignment generally carries the risk of overfitting, grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, we show that DPO-Kernels maintain robust generalization bounds in LLMs. Comprehensive resources are available to facilitate further research and application of DPO-Kernels.