VLDB2025

DemandClean: A Multi-Objective Learning Framework for Balancing Model Tolerance to Data Authenticity and Diversity

Zekai Qian, Xiaoou Ding, Chen Wang, Hongzhi Wang

摘要

Real-world datasets often suffer from multiple quality issues, hindering downstream model performance and increasing cleaning costs. To address this, we propose DemandClean, a reinforcement learning-based adaptive data cleaning framework that dynamically balances cleaning effectiveness and operational costs. DemandClean explicitly considers data authenticity (alignment with real-world facts), diversity (richness of feature values), and downstream models' noise tolerance. We categorize data errors as missing (reducing authenticity and diversity), semantic (affecting only authenticity), and syntactic (affecting authenticity but potentially increasing diversity). Based on these errors, DemandClean intelligently selects among Repair, Delete, or No actions, guided by error rates and model robustness. For interpretability, the framework visually distinguishes authenticity, diversity, and tolerance. Extensive experiments confirm that DemandClean achieves near-optimal accuracy at substantially reduced preprocessing costs. Specifically, it reduces repair actions by 80.0% and deletions by 80.7% compared to "Repair All" strategies, while maintaining or even exceeding their predictive performance, thus offering an interpretable, cost-effective, and scalable solution for practical applications.