VLDB2025

Data Imputation with Limited Data Redundancy Using Data Lakes

Chenyu Yang, Yuyu Luo, Chuanxuan Cui, Ju Fan, Chengliang Chai, Nan Tang

9 citations

Abstract

Data imputation is essential for many data science applications. Existing methods rely heavily on sufficient data redundancy from within-table values. However, many real-world datasets often lack such data redundancy, necessitating external data sources. In this paper, we introduce a retrieval-augmented imputation framework, LakeFill , which combines large language models (LLMs) and data lakes to address this challenge. Unlike existing "table-level" retrieval methods designed for question answering, which retrieve data in the granularity of tables, LakeFill performs fine-grained "tuple-level" retrieval, optimized specifically for data imputation at the tuple level. It encodes (possibly incomplete) tuples to capture nuanced similarities and differences, enabling effective identification of candidate tuples. A novel reranking method that integrates checklist-based training data annotation with stratified training group construction further refines the retrieved tuples. Finally, a reasoner with a novel two-stage confidence-aware imputation ensures reliable imputation results. Extensive experiments show that LakeFill significantly outperforms state-of-the-art methods for data imputation when there is limited data redundancy.