VLDB2023

DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms

Shweta Patwa, Danyu Sun, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy

2 citations

Abstract

Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. When synthetic data is generated, users may be interested in knowing if their aggregated queries generating such statistics can be reliably answered on the synthetic data, for instance, to decide if the synthetic data is suitable for specific tasks. However, the standard data generation systems do not provide "per-query" quality guarantees on the synthetic data, and the users have no way of knowing how much the aggregated statistics on the synthetic data can be trusted. To address this problem, we present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other while guaranteeing differential privacy. We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally. Differential Privacy We use Differential Privacy (DP) [11] as the measure of privacy. We say that two databases ๐ท and ๐ท โ€ฒ are neighbors if they differ by a single tuple. This is denoted by ๐ท โ‰ˆ ๐ท โ€ฒ . Definition 2.2 (Differential Privacy [15] ). A randomized mechanism M is said to satisfy ๐œ–-DP if โˆ€๐‘† โІ ๐‘…๐‘Ž๐‘›๐‘”๐‘’ (M) and โˆ€๐ท, ๐ท โ€ฒ pair of neighboring databases, i.e., ๐ท โ‰ˆ ๐ท โ€ฒ , Smaller ๐œ– gives stronger privacy guarantee. Definition 2.3 (Global Sensitivity). For a scalar query ๐‘ž, its global sensitivity is given by ฮ”๐‘ž = max ๐ทโ‰ˆ๐ท โ€ฒ |๐‘ž(๐ท) -๐‘ž(๐ท โ€ฒ )|. Definition 2.4 (Downward Local Sensitivity). For a scalar query ๐‘ž, its downward local sensitivity on database ๐ท is given by Example 2.5. Consider the private database ๐ท and sum query ๐‘ž 2 from Example 2.1.Suppose ๐‘‘๐‘œ๐‘š(๐‘๐‘Ž๐‘๐‘–๐‘ก๐‘Ž๐‘™-๐‘”๐‘Ž๐‘–๐‘›) is 0, 1, . . . , 99999, so ฮ”๐‘ž 2 = 99999 since the maximum change in the sum over all pairs of neighboring databases is the maximum value in the domain. On the other hand, given a database ๐ท, ๐ท๐‘† ๐‘ž,๐ท equals the largest value in ๐‘๐‘Ž๐‘๐‘–๐‘ก๐‘Ž๐‘™-๐‘”๐‘Ž๐‘–๐‘› from tuples in ๐ท with ๐‘’๐‘‘๐‘ข๐‘๐‘Ž๐‘ก๐‘–๐‘œ๐‘› equal to 12-th. Properties like composition [11] and post-processing [13] give a modular way to build complex DP mechanisms: Proposition 2.6. [11, 13] give the following: (1) ) (Parallel composition) If each M ๐‘– accesses disjoint sets of tuples, then they together satisfy max ๐‘– ๐œ– ๐‘– -DP. (3) (Post-processing) Any function applied to the output of an ๐œ–-DP mechanism M also satisfies ๐œ–-DP. Laplace mechanism (LM). The Laplace mechanism [14] is a common building block in DP mechanisms and is used to get a noisy estimate for scalar queries with numeric answers. The noise injected is calibrated to the global sensitivity of the query.