ICLR2023
Improving the imputation of missing data with Markov Blanket discovery
Yang Liu, Anthony C. Constantinou
Abstract
The process of imputation of missing data typically relies on generative and regression models. These approaches often operate on the unrealistic assumption that all of the data features are directly related with one another, and use all of the available features to impute missing values. In this paper, we propose a novel Markov Blanket discovery approach to determine the optimal feature set for a given variable by considering both observed variables and missingness of partially observed variables to account for systematic missingness. We then incorporate this method to the learning process of the state-of-the-art MissForest imputation algorithm, such that it informs MissForest which features to consider to impute missing values, depending on the variable the missing value belongs to. Experiments across different case studies and multiple imputation algorithms show that the proposed solution improves imputation accuracy, both under random and systematic missingness. Recently, causal information has also been adopted to feature selection for missing data imputation. Kyono et al. (2021) proposed to impute missing values of a variable given its causal parents derived from the weights of the input layer in the neural network. Similarly, Yu et al. (2022) proposed the MimMB framework that learns Markov Blankets (MBs) to be used for feature selection in imputation, which is an iterative process that learns MBs from the imputed data and updates the learned MB after each iteration. Note that while MimMB is related to our work, since we also use MB construction for feature selection, an important distinction between the two is that MimMB combines MBs with imputed data whereas, as we later describe in Section 3, the learning phase of MBs that we propose is separated from imputation, accounts for partially observed variables, and improves computational efficiency. In this paper, we use the graphical expression of missingness proposed by Mohan et al. (2013) , known as m-graph, which is a graph that captures observed variables in conjunction with the possible