CCS2025

PreferCare: Preference Dataset Copyright Protection in LLM Alignment by Watermark Injection and Verification

Jian Lou, Chenyang Zhang, Xiaoyu Zhang, Kai Wu

Abstract

With the urgent need to enhance the safety of LLM applications, there has been a growing focus on alignment training algorithms designed to keep large language models (LLMs) behaving in alignment with human values. Alignment training algorithms rely heavily on preference datasets, which are essential for finetuning LLMs to follow human preferences. However, generating and annotating these datasets is often costly and labor-intensive, making it critical to protect their copyright against unauthorized use. In this paper, we propose PreferCare, the first framework tailor-made for preference dataset copyright protection via watermark injection and verification. PreferCare comprises two consecutive stages: injection and verification. In the injection stage, a style transfer-based watermark signal and a bi-level watermark optimization process are designed to embed the watermark into the preference dataset. In the verification stage, we employ statistical tests to determine whether a suspect LLM has used the watermarked preference dataset without authorization. Extensive experiments on multiple popular LLMs have demonstrated that PreferCare achieves effectiveness, harmlessness, transferability, and robustness across diverse settings, and can successfully verify the watermark within 20 queries.