ICLR2025

Scalable Extraction of Training Data from Aligned, Production Language Models

Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tramèr, Katherine Lee

摘要

Large language models are prone to memorizing some of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope that model alignment-a standard training process that tunes a model to harmlessly follow user instructions-would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model's alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI's ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models. * Equal contribution 1 While limited information is available about proprietary production models, some aligned models like GPT-4 have been trained to "refuse to answer certain types of requests," including those related to training data extraction (OpenAI, 2023). Published as a conference paper at ICLR 2025 abusing a production system's finetuning interface (Peng et al., 2023) , which allows users to further train a model on provided data (Qi et al., 2023) . These prior attacks have been successful at making aligned models output harmful content, but not training data. In this paper, we develop new attack techniques for this purpose. Note that we deliberately refrain from giving a formal definition of alignment. The primary reason is that different organizations have different definitions for what it means for a model to be "aligned with human preferences." Further, even if we have a loose or informal sense of what a particular organization views as alignment-through, e.g., high-level technical reports (OpenAI, 2023)-this does not directly translate to understanding exactly how these organizations employ concrete training techniques to align their proprietary models. We instead characterize aligned models in terms of specific behaviors they should not exhibit-in our case exact regurgitation of training data. EXPERIMENTAL SETUP Validating memorization. Typically, we would validate training data extraction by searching for the extracted text in the training dataset. However, proprietary language models like ChatGPT and Gemini do not have public training datasets. Since it is widely known that a large fraction of these models' training data is scraped from the public web, prior works have resorted to manual Google searches to check for the presence of model generations online (Carlini et al., 2021) . This is timeconsuming and does not scale. We propose a more scalable approach. First, we approximate the web-based training data of production models by building a large corpus of text from the internet-by merging (and deduplicating) four of the largest published language model training datasets: The Pile (Gao et al., 2020), RefinedWeb (Penedo et al., 2023) , RedPajama (Together, 2023a), and Dolma (Soldaini, 2023) (Appendix A.4). This corpus, which we call AUXDATASET, is the largest public index of LLM training data to date (9 terabytes). We then approximate an internet-wide search by performing a local search over this corpus. We implement a suffix array for efficient search over AUXDATASET. (See Appendix A.5 and Lee et al. (2022) for details.) We thus call a subsequence of a generation memorized if (as in Definition 1) a 50-token-length subsequence exactly appears in AUXDATASET. This validation method will only be able to provide a loose lower bound for memorization; it will undercount the success of training data extraction since AUXDATASET does not include the full training dataset for proprietary models. Moreover, we do not count sequences that are approximately memorized, i.e., a generation for which a near-exact match (e.g., a paraphrase) appears in the AUXDATASET. Nevertheless, this validation methodology satisfies our goals: we aim to provide a lower bound on the amount of extracted memorized text, and to demonstrate that we are able to extract exact subsequences of training data from aligned language models. Models. We study training data extraction in two production language model families that have been trained with alignment, ChatGPT (from OpenAI) and Gemini (from Google). For ChatGPT, we consider the two latest versions of aligned and conversational models at the moment of writing (gpt-3.5-turbo and gpt-4), and for Gemini, we consider the latest publicly available version: Gemini 1.5 Pro, a state-of-the-art model for long-context generation understanding. We compare these production, closed-weight models (i.e., embedded in systems, and which we interact with via developer APIs) with several open-weight (i.e., with publicly accessible weights), unaligned language models including GPT-2 (1.5B) (Radford et al., 2019) , GPT-Neo (6B) (Black et al., 2021) , Pythia (1.4B and 6.9B) (Biderman et al., 2023) , OPT (1.3B and 6.7B) (Zhang et al., 2022) , LLaMA (7B and 65B) (Touvron et al., 2023a), RedPajama-INCITE base (3B and 7B) (Together, 2023b), Mistral (7B)