ACL2024

Mitigating Privacy Seesaw in Large Language Models: Augmented Privacy Neuron Editing via Activation Patching

Xinwei Wu, Weilong Dong, Shaoyang Xu, Deyi Xiong

摘要

Protecting privacy leakage in large language models remains a paramount challenge. In this paper, we reveal Privacy Seesaw in LLM privacy protection via neuron editing, a phenomenon where measures to secure specific private information inadvertently heighten exposure risks for other privacy. Through comprehensive analysis, we identify the amount of targeted privacy data and the volume of edited privacy neurons as the two central triggers to this issue. To mitigate privacy seesaw, we propose Augmented Privacy Neuron Editing via Activation Patching (APNEAP), a novel framework designed to well balance model performance with privacy protection. The proposed APNEAP augments collected private data by automatically synthesizing new private data, which deactivates the first trigger to the privacy seesaw issue. Additionally, it adapts activation patching to privacy neuron editing for switching off the second trigger to the privacy seesaw problem. Experimental results show that the proposed APNEAP is capable of alleviating the privacy seesaw phenomenon and offers a more stable and reliable approach to privacy protection in LLMs than previous methods. 2023; Li et al., 2023a). Second, LLMs tend to 042 memorize training data, including unique instances 043 (Carlini et al., 2019, 2021). Previous studies have 044 shown that private information could be success-045 fully extracted from LLMs such as ChatGPT with 046 meticulously crafted prompts, underscoring the ur-047 gency of privacy protection for LLMs (Li et al., 048 2023a). 049 In order to protect privacy of LLMs, machine 050 unlearning and neuron-based methods have been 051 proposed. The former aims to make LLMs for-052 get targeted datasets through fine-tuning on small 053 batches of data. Ishibashi and Shimodaira (2023) 054 render private information harmless through Sani-055 tization Tuning. Jang et al. (2022) reduce privacy 056 leakage risks by reversely learning the gradient of 057 private data. The later seeks to reduce the like-058 lihood of eliciting private information by editing 059 neurons directly. Wu et al. (2023) propose DEPN, 060 an efficient approach to locating and editing privacy 061 neurons for language models. 062 However, we have confronted with unexpected 063 results with the neuron-based protection methods. 064 115 Privacy Protection in NLP Privacy protection in 116 language models are categorized into three stages: 117 data processing, training & fine-tuning, and post-118 processing (Guo et al., 2022; Sousa and Kern, 119 2023). In data processing, methods like redirec-120 tion and anonymization aim to remove sensitive 121