SIGMOD2025

Differentially Private Substring and Document Counting

Giulia Bernardini, Philip Bille, Inge Li Gørtz, Teresa Anna Steiner

被引用 1 次

摘要

Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database ( substring counting ) and (ii) how many documents in the collection contain the pattern as a substring ( document counting ). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an ε-differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of O(𝓁 • polylog(n𝓁|Σ|)), where 𝓁 is the maximum length of a document in the database, n is the number of documents, and |Σ| is the size of the alphabet. We show that this is optimal up to a O(polylog(𝓃𝓁)) factor. Further, we show that for (ε,δ)-differential privacy, the bound for document counting can be improved to O(√𝓁 • polylog(𝓃𝓁|Σ|)). Additionally, our data structures are efficient. In particular, our data structures use O(𝓃𝓁 2 ) space, O(𝓃 2 𝓁 4 ) preprocessing time, and O(|P|) query time where P is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and q -grams. For q -grams, we further improve the preprocessing time of the data structure.