ICLR2023
Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks
Zhen Lin, Shubhendu Trivedi, Jimeng Sun
2 citations
Abstract
Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. In high-risk applications like healthcare, practitioners require fully calibrated probability predictions for decision-making. That is, conditioned on the prediction vector, every class' probability should be close to the predicted value. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs, reduce classification accuracy in the process, or only calibrate the predicted class. This paper proposes a new Kernel-based calibration method called KCal. Unlike existing calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, KCal learns a metric space on the penultimate-layer latent embedding and generates predictions using kernel density estimates on a calibration set. We first analyze KCal theoretically, showing that it enjoys a provable full calibration guarantee. Then, through extensive experiments across a variety of datasets, we show that KCal consistently outperforms baselines as measured by the calibration error and by proper scoring rules like the Brier Score. Recent research effort has started to focus on full calibration, for example, in Vaicenavicius et al. (2019); Kull et al. (2019); Widmann et al. (2019); Karandikar et al. (2021); Mukhoti et al. (2020); Patel et al. (2021). We approach this problem by leveraging the latent neural network embedding in a nonparametric manner. Nonparametric methods such as histogram binning (HB) (Zadrozny & Elkan, 2001) and isotonic regression (IR) (Zadrozny & Elkan, 2002), are natural for calibration and have become popular. Gupta & Ramdas (2021) recently showed a calibration guarantee for HB. However, HB usually leads to noticeable drops in accuracy (Patel et al., 2021), and IR is prone to overfitting (Niculescu-Mizil & Caruana, 2005) . Unlike existing methods, we take one step back and train a new low-dimensional metric space on the penultimate-layer embeddings of DNNs. Then, we use a kernel density estimationbased classifier to predict the class probabilities directly. We refer to our Kernel-based Calibration method as KCal. Unlike most calibration methods, KCal provides high probability error bounds for full calibration under standard assumptions. Empirically, we show that with little overhead, KCal outperforms all existing calibration methods in terms of calibration quality, across multiple tasks and DNN architectures, while maintaining and sometimes improving the classification accuracy. Summary of Contributions: • We propose KCal, a principled method that calibrates DNNs using kernel density estimation on the latent embeddings. • We present an efficient pipeline to train KCal, including a dimension-reducing projection and a stratified sampling method to facilitate efficient training. • We provide finite sample bounds for the calibration error of KCal-calibrated output under standard assumptions. To the best of our knowledge, this is the first method with a full calibration guarantee. • In extensive experiments on multiple datasets and state-of-the-art models, we found that KCal outperforms existing calibration methods in commonly used evaluation metrics. We also show that KCal provides more reliable predictions for important classes in the healthcare datasets. The code to replicate all our experimental results is submitted along with supplementary materials. RELATED WORK Research on calibration originated in the context of meteorology and weather forecasting (see Murphy & Winkler (1984) for an overview) and has a long history, much older than the field of machine