SIGMOD2025
Guardrail: Automated Integrity Constraint Synthesis From Noisy Data
Pingchuan Ma, Zhaoyu Wang, Zhenlan Ji, Zongjie Li, Ao Sun, Shuai Wang
摘要
Data quality issues have been a long-standing challenge in the database community. Erroneous data can lead to incorrect query results, which in turn affect the credibility of the data-driven decisions. To circumvent this issue, a common practice is to discovery integrity constraints and enforce them on the data to ensure its quality. For instance, one can use constraints entailed by functional dependencies (FDs) to detect violations in the data. However, existing approaches fail to effectively discover them from noisy data. In this paper, we present a novel form of integrity constraints as a program under a domain-specific language (DSL) that can be used to detect and rectify errors in the data. On top of DSL, we propose an efficient synthesis algorithm that leverages the statistical structural properties of the data to generate the sketch of the program that significantly reduces the search space and speedup the synthesis process. To demonstrate the usefulness of our approach, we evaluate it on 12 real-world datasets for error detection. Then, we show that the synthesized integrity constraints can be used to solidify ML-integrated SQL queries over 48 queries, leading to an average reduction of 87% in the error rates. Our open-source artifact, including the G uardrail framework and the datasets, is available for the community to use [2].