SIGMOD2025

Minimum Change≠ Best Cleaning: Parallel and Incremental Error Detection under Integrity Constraints

Jiahui Chen, Yu Sun, Shaoxu Song, Haiwei Zhang, Xiaojie Yuan

被引用 1 次

摘要

Erroneous data frequently arise in practical scenarios due to a variety of factors, severely degrading data quality and impeding downstream applications. A widely adopted strategy for error detection is to detect conflicts based on integrity constraints and identify the minimum number of errors, thereby ensuring that the remaining cells satisfy the constraints. However, the minimum change principle may not be applicable in practical scenarios, since errors can occur simultaneously or irregularly. Therefore, this study employs Bayesian statistics to identify erroneous attribute values in conflicting cells that violate inter-attribute dependencies, rather than simply relying on the minimum change principle. This approach ensures that our work neither misses multiple erroneous attribute values conflicting with each other nor mistakenly detects outliers without errors. Furthermore, to address the efficiency issues commonly encountered in constraint-based data cleaning methods, we design 1) parallel conflict detection and error determination methods with the guaranteed parallel scalability, and 2) efficient incremental error detection strategies that can also be executed in parallel with such guarantees. Experiments conducted on various datasets demonstrate the superiority of our error detection methods in terms of both effectiveness and efficiency.