ISSTA2025

Patch the Leak: Strengthening CodeLLMs Against Privacy Extraction Threats

Yongjian Guo, Wanlun Ma, Xi Xiao, Sheng Wen, Peng Di, Xiaogang Zhu

摘要

CodeLLMs tend to memorize their training data and can reconstruct personal information (PI) when given specific prompts. Despite the application of privacy anonymization methods to remove PI in foundational LLMs, the previous experiments using state-of-the-art PI extraction attacks like CODEBREAKER and CodexLeaks on multiple open-source and commercial CodeLLMs demonstrate that such information cannot be fully eliminated. Furthermore, we found that commercial models exhibit significantly lower leakage rates (approximately 20% lower) compared to open-source models, and we hypothesize this is related to the stronger model alignment. Addressing the lack of effective defenses against PI extraction, we treat PI leakage as a form of misalignment and propose PI-ALIGN, a novel framework inspired by adversarial learning. PI-ALIGN pairs CodeLLMs with the CODEBREAKER attack framework as an adversarial dual model and leverages the optimized GRPO (Group Relative Policy Optimization) process to realign the model during fine-tuning. This approach is expected to enhance the model's robustness against PI extraction attacks by adversarially training it against CODEBREAKER. We also outline our experimental evaluation framework to systematically validate PI-ALIGN's effectiveness, aiming to provide insights into countering PI extraction attacks on CodeLLMs.