VLDB2025

FDepHunter: Harnessing Negative Examples to Expose Fakes and Reveal Ghosts

Pavel Koupil, Jáchym Bártík, Stefan Klessinger, André Conrad, Stefanie Scherzinger

Abstract

Functional dependency (FD) discovery is fundamental in data profiling. Inevitably, existing approaches can return fake FDs that hold only coincidentally. Moreover, these approaches fall short of identifying ghost FDs that would be observable in a clean dataset, but that remain undetected because of outliers in the data. We introduce an interactive method for dependency discovery that augments an Armstrong relation with additional tuples. We rely on artificially generated negative examples that emulate real-world tuples to help expose fake FDs. In addition, we rely on domain experts to confirm that positive examples indeed reflect the characteristics of the original dataset. Our tool prototype FDepHunter thus provides a novel human-in-the-loop workflow where the set of discovered FDs can be iteratively refined.