VLDB2023
DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms
Shweta Patwa, Danyu Sun, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy
2 citations
Abstract
Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. When synthetic data is generated, users may be interested in knowing if their aggregated queries generating such statistics can be reliably answered on the synthetic data, for instance, to decide if the synthetic data is suitable for specific tasks. However, the standard data generation systems do not provide "per-query" quality guarantees on the synthetic data, and the users have no way of knowing how much the aggregated statistics on the synthetic data can be trusted. To address this problem, we present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other while guaranteeing differential privacy. We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally. Differential Privacy We use Differential Privacy (DP) [11] as the measure of privacy. We say that two databases ๐ท and ๐ท โฒ are neighbors if they differ by a single tuple. This is denoted by ๐ท โ ๐ท โฒ . Definition 2.2 (Differential Privacy [15] ). A randomized mechanism M is said to satisfy ๐-DP if โ๐ โ ๐ ๐๐๐๐ (M) and โ๐ท, ๐ท โฒ pair of neighboring databases, i.e., ๐ท โ ๐ท โฒ , Smaller ๐ gives stronger privacy guarantee. Definition 2.3 (Global Sensitivity). For a scalar query ๐, its global sensitivity is given by ฮ๐ = max ๐ทโ๐ท โฒ |๐(๐ท) -๐(๐ท โฒ )|. Definition 2.4 (Downward Local Sensitivity). For a scalar query ๐, its downward local sensitivity on database ๐ท is given by Example 2.5. Consider the private database ๐ท and sum query ๐ 2 from Example 2.1.Suppose ๐๐๐(๐๐๐๐๐ก๐๐-๐๐๐๐) is 0, 1, . . . , 99999, so ฮ๐ 2 = 99999 since the maximum change in the sum over all pairs of neighboring databases is the maximum value in the domain. On the other hand, given a database ๐ท, ๐ท๐ ๐,๐ท equals the largest value in ๐๐๐๐๐ก๐๐-๐๐๐๐ from tuples in ๐ท with ๐๐๐ข๐๐๐ก๐๐๐ equal to 12-th. Properties like composition [11] and post-processing [13] give a modular way to build complex DP mechanisms: Proposition 2.6. [11, 13] give the following: (1) ) (Parallel composition) If each M ๐ accesses disjoint sets of tuples, then they together satisfy max ๐ ๐ ๐ -DP. (3) (Post-processing) Any function applied to the output of an ๐-DP mechanism M also satisfies ๐-DP. Laplace mechanism (LM). The Laplace mechanism [14] is a common building block in DP mechanisms and is used to get a noisy estimate for scalar queries with numeric answers. The noise injected is calibrated to the global sensitivity of the query.