SIGMOD2025

A Theoretical Framework for Distribution-Aware Dataset Search

Aryan Esmailpour, Sainyam Galhotra, Rahul Raychaudhury, Stavros Sintos

1 citation

Abstract

Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristic methods, which often result in biased outcomes. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings. More specifically, let P =P 1 ,..., P N be a repository of N datasets, such that each P i ⊂ ℝ d , where d is a constant. We study the percentile-aware indexing (Ptile) problem and the preference-aware indexing (Pref) problem under the centralized and the federated setting. In the centralized setting, we assume direct access to the datasets in P . In the federated setting we are given a synopsis S P i which is a compressed representation of P i that captures the structure of P i , for every i ∈ [N]. For the Ptile problem, the goal is to construct a data structure such that given a predicate (query rectangle R and an interval θ) report all indexes J such that j ∈ J if and only if |P j ∩ R|/|P j | ∈ [N]. For the Ptile problem, the goal is to construct a data structure such that given a predicate (query vector v → and an interval θ) report all indexes J such that j ∈ J if and only if ω k (P j ,v → )∈ θ, where ω k (p j ,v → ) is the score (inner-product) of the k -th largest projection of P j on v → . We first show lower bounds for the Ptile and Pref problems in the centralized setting, showing that we cannot hope for near-linear data structures with polylogarithmic query time. Then we focus on approximate data structures for both problems in both settings. We show Ø(N) space data structures with Ø(N) preprocessing time, that can answer Ptile and Pref queries in Ø(1+OUT) time, where OUT is the output size. The data structures return a set of indexes J such that: i) for every P i that satisfies the predicate, i ∈ J and ii) if j ∈ J then P j satisfies the predicate up to an additive error of ε+2δ, where ε is an arbitrarily small constant and δ is the error of the synopses.