SIGMOD2023
Query-Guided Resolution in Uncertain Databases
Osnat Drien, Matanya Freiman, Antoine Amarilli, Yael Amsterdamer
被引用 3 次
摘要
We present a novel framework for uncertain data management. We start with a database whose tuple correctness is uncertain and an oracle that can resolve the uncertainty, i.e., decide if a tuple is correct or not. Such an oracle may correspond, e.g., to a data expert or to a crowdsourcing platform. We wish to use the oracle to clean the database with the goal of ensuring the correct answer for specific mission-critical queries. To avoid the prohibitive cost of cleaning the entire database and to minimize the expected number of calls to the oracle, we must carefully select tuples whose resolution would suffice to resolve the uncertainty in query results. In other words, we need a query-guided process for the resolution of uncertain data. We develop an end-to-end solution to this problem, based on the derivation of query answers and on correctness probabilities for the uncertain data. At a high level, we first track Boolean provenance to identify which input tuples contribute to the derivation of each output tuple, and in what ways. We then design an active learning solution for iteratively choosing tuples to resolve, based on the provenance structure and on an evolving estimation of tuple correctness probabilities. We conduct an extensive experimental study to validate our framework in different use cases. P R E P R I N T exhaustive cleaning of the entire database, which can be prohibitively costly, previous work suggested approaches for semi-automated, interactive or crowdpowered cleaning processes (e.g., [8, 11, 19, 20, 70, 22, 32, 60, 64, 81, 93, 96] ). Typically, such processes clean only a part of the data and/or only specific types of data errors, such as constraint violations. In the present work, we study the problem of query-guided uncertainty resolution. We start from a database whose data correctness is uncertain (i.e., which may contain incorrect tuples) and a query (or a set of queries) that capture the data relevant for analysis. Our goal is to identify the precise set of correct query results, by resolving the uncertainty of input tuples. Of course, we can do this by naively verifying every input tuple, thereby achieving a certain and correct input database. However, as mentioned above, verifying all input tuples may be too costly. We thus aim at verifying a subset of the input tuples that suffices to determine the correct results of the given query. As we will show, there exist such subsets that are typically significantly smaller than the full database. Resolving the uncertainty of a tuple is abstractly modeled as a probe to an oracle: in practice, the oracle may be data experts, crowd workers, high-quality external sources, etc. Therefore, our challenge is identifying which tuples to probe in order to minimize the number of oracle calls. Our novel solution addresses this challenge by accounting, in a fine-grained manner, for how the data is derived and for the probabilities of probe answers.