NeurIPS2021

An Uncertainty Principle is a Price of Privacy-Preserving Microdata

John M. Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson L. Garfinkel, Daniel Kifer, Philip Leclerc, William Sexton, Ashley Simpson, Christine Task, Pavel Zhuravlev

被引用 17 次

DOI arXiv 出版方

摘要

Privacy-protected microdata are often the desired output of a differentially private algorithm since microdata is familiar and convenient for downstream users. However, there is a statistical price for this kind of convenience. We show that an uncertainty principle governs the trade-off between accuracy for a population of interest ("sum query") vs. accuracy for its component sub-populations ("point queries"). Compared to differentially private query answering systems that are not required to produce microdata, accuracy can degrade by a logarithmic factor. For example, in the case of pure differential privacy, without the microdata requirement, one can provide noisy answers to the sum query and all point queries while guaranteeing that each answer has squared error O(1/ǫ 2 ). With the microdata requirement, one must choose between allowing an additional log 2 (d) factor (d is the number of point queries) for some point queries or allowing an extra O(d 2 ) factor for the sum query. We present lower bounds for pure, approximate, and concentrated differential privacy. We propose mitigation strategies and create a collection of benchmark datasets that can be used for public study of this problem. We present such lower bound results for pure differential privacy [16] , approximate differential privacy [15] , and concentrated differential privacy [8], with nearly matching upper bounds. We note that this uncertainty principle affects some, but not all, possible datasets. That is, there are datasets for which the error penalties do not exist. Thus, the goal in practical privacy-protected microdata generation should be to minimize the occurrence of this uncertainty principle (since eliminating it entirely is impossible). To this end, we propose a benchmark suite of real and synthetic datasets that can be used by the wider community for further study of this problem. We also propose some algorithms, inspired by our lower and upper bound proofs, for mitigating the effects of this uncertainty principle. Limitations: empirically, these algorithms perform well on the benchmarks but we do not have theoretical proofs of performance. Preliminaries Let D denote a dataset, M a differentially private algorithm, and let D be a privacy-preserving dataset (e.g., M (D) = D). A counting query q is associated with a predicate ψ, and the query answer q(D) is the number of records in D that satisfy ψ. We let q 1 , . . . , q d represent a set of d counting queries whose corresponding predicates ψ 1 , . . . , ψ d are disjoint (no record can satisfy more than one of the predicates). We also let q * denote their sum: q * (D) = d i=1 q i (D). Differential Privacy Differential privacy is currently considered the gold standard in privacy protections. It relies on the concept of neighboring datasets, defined as follows. Definition 1 (Neighbors). Two datasets D 1 and D 2 are neighbors, denoted by D 1 ∼ D 2 , if D 1 can be obtained from D 2 by adding or removing one record. Using this concept of neighbors, differential privacy ensures that adding or removing one record from a dataset has little effect on the probabilistic outcomes of an algorithm: Definition 2 (Differential Privacy [16] ). Given privacy parameters ǫ > 0 and δ ≥ 0, a randomized algorithm M satisfies (ǫ, δ)-DP if for all pairs of datasets D 1 , D 2 that are neighbors of each other, and for all S ⊆ range(M ), the following equation holds: where the probability is only over the randomness in M (not the randomness in the data). When δ = 0, we say that M satisfies pure differential privacy (also known as ǫ-differential privacy or ǫ-DP) and when δ > 0 we say that M satisfies approximate differential privacy. Another important version of differential privacy, is ρ-zCDP (concentrated differential privacy): Definition 3 (zCDP [8] ). Given a privacy parameter ρ, a randomized algorithm M satisfies ρ-zCDP if for all pairs of datasets D 1 , D 2 that are neighbors of each other and all numbers α > 1, is the Renyi divergence of order α between probability distributions P and Q. Although zCDP is difficult to interpret, there are useful results that help provide intuition. First, any M that satisfies ǫ-differential privacy also satisfies ρ-zCDP with ρ = ǫ 2 2 [8]. In general a ρ-zCDP algorithm does not satisfy pure differential privacy but does satisfy (ǫ, δ)-DP for infinitely many pairs of ǫ and δ that lie along a curve (see [9] and [2] for conversions between ρ-zCDP and (ǫ, δ)-DP). Algorithm Design with Differential Privacy A few basic principles underlie the construction of many algorithms for differential privacy. The first is sensitivity, which measures the maximum impact that one record can have on a set of queries (regardless of input data):