WWW2023

Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency

Shaochen Yu, Lei Han, Marta Indulska, Shazia Sadiq, Gianluca Demartini

3 citations

Abstract

Format inconsistency is one of the most frequently appearing data quality issues encountered during data cleaning. Existing automated approaches commonly lack applicability and generalisability, while approaches with human inputs typically require specialized skills such as writing regular expressions. This paper proposes a novel hybrid human-machine system, namely “Data-Scanner-4C”, which leverages crowdsourcing to address syntactic format inconsistencies in a single column effectively. We first ask crowd workers to create examples from single-column data through “data selection” and “result validation” tasks. Then, we propose and use a novel rule-based learning algorithm to infer the regular expressions that propagate formats from created examples to the entire column. Our system integrates crowdsourcing and algorithmic format extraction techniques in a single workflow. Having human experts write regular expressions is no longer required, thereby reducing both the time as well as the opportunity for error. We conducted experiments through both synthetic and real-world datasets, and our results show how the proposed approach is applicable and effective across data types and formats.