KDD2025
LLM-Eraser: Optimizing Large Language Model Unlearning through Selective Pruning
Shengming Zhang, Le Zhang, Jingbo Zhou, Zhi Zheng, Hui Xiong
被引用 2 次
摘要
We focus on unlearning unwanted knowledge in autoregressive large language models (LLMs) through pruning. Our goal is to selectively remove undesirable information (e.g., harmful responses, privacy-sensitive data) while ensuring the preservation of desirable knowledge (e.g., positive responses and objective facts). Previous approaches use gradient ascent (GA) over undesired knowledge to inversely optimize LLMs, which compromises the model's performance on desired knowledge. To address this limitation, we introduce a novel two-stage approach, named LLM-Eraser, for selectively identifying and editing parameters specifically associated with undesirable knowledge. LLM-Eraser operates in two stages: localization and unlearning. During the localization stage, we utilize neuron scores and trainable soft masks to identify parameters crucial to the undesired knowledge. In the unlearning stage, we prune these identified parameters and apply a selective post-training process to enhance the model's selectiveness. Our experiments, conducted across five task datasets, demonstrate that LLM-Eraser effectively unlearns undesirable knowledge-evidenced by the model's near-random performance on multiple-choice questions related to the erased knowledge-while maintaining high proficiency in desirable knowledge, with an average performance deficit of only 2.5%.