VLDB2025

Cleaning both Data Errors and Inaccurate Constraints on Numerical Sequential Data

Xiaoou Ding, Muyun Zhou, Yida Liu, Chen Wang, Hongzhi Wang, Jianmin Wang

摘要

Numerical sequence data from intelligent devices often have quality issues. While existing data cleaning methods focus on repairing data, we address the problem of repairing both data errors and inaccurate constraints. We propose two operations for modifying inaccurate constraints: expanding and compressing their value domains. Our solution includes constraint modification functions and algorithms to prevent under-and over-fitting in data cleaning. Theoretical evaluations demonstrate its reliability and effectiveness of the proposed solution, which achieves optimal repair with the distance no greater than |Σ ′ 𝑙 | • 𝜖 𝑒 + |Σ ′ 𝑟 | • 𝜖 𝑠 from the optimal repair. Experiments on real-life and synthetic datasets show that our bND-CRepair method improves F1-score by 17.6% compared to using the original constraints and performs best in MNAD. Results show high-level performance with the combination of our bNDCRepair and the state-of-the-art CVtRepair and Clean4TSDB in sequential data tasks.