EMNLP2025
Detoxifying Large Language Models via the Diversity of Toxic Samples
Ying Zhao, Yuanzhao Guo, Xuemeng Weng, Yuan Tian, Wei Wang, Yi Chang
Abstract
Warning: This work contains content that may be offensive or upsetting. Eliminating toxicity from large language models (LLMs) is critical to ensure user safety. However, current methods suffer limitations in the analysis and utilization of toxic samples, failing to fully harness their potential. Through comparative analysis of toxic and safe samples, we identified that (i) toxic samples exhibit diversity and (ii) there lies specificity within this diversity. These findings suggest that leveraging these characteristics of toxic samples could enhance the performance of algorithms in LLMs detoxification. Thus, we propose a novel diverse detoxification framework, Di-vDetox, which comprises two innovative components: a Multi-Category-Induced Personalized Sample Generation (MPSG) strategy and a Scaled Contrastive Direct Preference Optimization (SC-DPO) approach. The former is designed to elicit a variety of personalized toxic responses from LLMs, while the latter is constructed to precisely and fully utilize these toxic responses. Experiments on benchmark datasets across different model scales and various detoxification tasks confirm the effectiveness of our architecture. Our codes are available at https: //github.com/zy1998-c/DivDetox .