SIGMOD2025

Private Synthetic Data Generation in Bounded Memory

Rayne Holland, Seyit Camtepe, Chandra Thapa, Minhui Xue

摘要

Protecting sensitive information on data streams is a pivotal challenge for modern systems. Current approaches to providing privacy in data streams can be broadly categorized into two strategies. The first strategy involves transforming the stream into a private sequence of values, enabling the subsequent use of non-private methods of analysis. While effective, this approach incurs high memory costs, often proportional to the size of the database. Alternatively, a compact data structure can be used to provide a private summary of the stream. However, these data structures are limited to predefined queries, restricting their flexibility. To overcome these limitations, we propose a lightweight synthetic data generator, PrivHP, that provides differential privacy guarantees. PrivHP is based on a novel method for the private hierarchical decomposition of the input domain in bounded memory. As the decomposition approximates the cumulative distribution function of the input, it serves as a lightweight structure for synthetic data generation. PrivHP is the first method to provide a principled trade-off between accuracy and space for private hierarchical decompositions. It achieves this by balancing hierarchy depth, noise addition, and selective pruning of low-frequency subdomains while preserving high-frequency ones, all identified in a privacy-preserving manner. To ensure memory efficiency, we employ private sketches to estimate subdomain frequencies without accessing the entire dataset. Central to our approach is the introduction of a pruning parameter k , which enables an almost smooth interpolation between space usage and utility, and a measure of skew tail k , which is a vector of subdomain frequencies containing all but the largest k coordinates. PrivHP processes a dataset X using M = O (k log 2 | X |)) space and, on input domain Ω = [0,1] d , while maintaining ε-differential privacy, produces a synthetic data generator that is at distance O ( M (1-1/d) /ε n + ||tail k ( X )|| 1 /M 1/d n ) from the empirical distribution in the expected Wasserstein metric. Compared to the state-of-the-art, PMM, which achieves accuracy O ((ε n) -1/d ) with memory O (ε n), our method introduces an additional approximation error term of O (||tail k ( X )|| 1 /(M 1/d n)), but operates in significantly reduced space. Additionally, we provide interpretable utility bounds that account for all error sources, including those introduced by the fixed hierarchy depth, privacy noise, hierarchy pruning, and frequency approximations.