SIGMOD2025

Two Birds with One Stone: Efficient Deep Learning over Mislabeled Data through Subset Selection

Yuhao Deng, Chengliang Chai, Kaisen Jin, Linan Zheng, Lei Cao, Ye Yuan, Guoren Wang

2 citations

Abstract

Using a large training dataset to train a big and powerful model -- a typical practice in modern deep learning, often suffers from two major problems: the expensive and slow training process and the error-prone labels. The existing approaches, targeting either speeding up the training by selecting a subset of representative training instances (subset selection) or eliminating the negative effect of mislabels during training (mislabel detection), do not perform well in this scenario due to overlooking one of these two problems. To fill this gap, we propose Deem, a novel data-efficient framework that selects a subset of representative training instances under label uncertainty. The key idea is to leverage the metadata produced during deep learning training, e.g., training losses and gradients, to estimate the label uncertainty and select the representative instances. In particular, we model the problem of subset selection under uncertainty as a problem of finding a subset that closely approximates the gradient of the whole training data set derived on soft labels. We show that it is an NP-hard problem with submodular property and propose a low complexity algorithm to solve this problem with an approximate ratio. Training on this small subset thus improves the training efficiency while guaranteeing the model's accuracy. Moreover, we propose an efficient strategy to dynamically refine this subset during the iterative training process. Extensive experiments on 6 datasets and 10 baselines demonstrate that Deem accelerates the training process up to 10X without sacrificing the model accuracy.