EMNLP2023
MoPe: Model Perturbation based Privacy Attacks on Language Models
Marvin Li, Jason Wang, Jeffrey G. Wang, Seth Neel
8 citations
Abstract
Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present MoPe θ (Model Perturbations), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. MoPe θ adds noise to the model in parameter space and measures the drop in log-likelihood at a given point x, a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. Across language models ranging from 70M to 12B parameters, we show that MoPe θ is more effective than existing loss-based attacks and recently proposed perturbation-based methods. We also examine the role of training point order and model size in attack success, and empirically demonstrate that MoPe θ accurately approximate the trace of the Hessian in practice. Our results show that the loss of a point alone is insufficient to determine extractability-there are training points we can recover using our method that have average loss. This casts some doubt on prior works that use the loss of a point as evidence of memorization or "unlearning." * Alphabetical order; equal contribution.