NeurIPS2021

Revisiting the Calibration of Modern Neural Networks

Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, Mario Lucic

被引用 510 次

摘要

Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties. Contributions. To address this need, we provide a systematic comparison of recent image classification models, relating their accuracy, calibration, and design features. We find that: 1. The best current models, including the non-convolutional MLP-Mixer (Tolstikhin et al., 2021) and Vision Transformers (Dosovitskiy et al., 2021) , are well calibrated compared to past models and their performance is more robust to distribution shift. 2. In-distribution calibration slightly deteriorates with increasing model size, but this is outweighed by a simultaneous improvement in accuracy. 3. Under distribution shift, calibration improves with model size, reversing the trend seen indistribution. 4. Accuracy and calibration are correlated under distribution shift, such that optimizing for accuracy may also benefit calibration. 5. Model size, pretraining duration, and pretraining dataset size cannot fully explain differences in calibration properties between model families. Our results suggest that further improvements in model accuracy will continue to benefit calibration. They also hint at architecture as an important determinant of model calibration. We provide code and a large dataset of calibration measurements, comprising 180 distinct models from 16 families, each evaluated on 79 ImageNet-scale datasets and 28 metric variants. 1 Related Work Measures of model calibration. The losses that are commonly used to train classification models, such as cross-entropy and squared error, are proper scoring rules (Gneiting et al., 2007) and are therefore guaranteed to yield perfectly calibrated models at their minimum-in the infinite-data limit. However, in practice, due to model mismatch and overfitting, even losses based on proper scoring rules may result in poor model calibration. Miscalibration is commonly quantified in terms of Expected Calibration Error (ECE; Naeini et al. 2015) , which measures the absolute difference between predictive confidence and accuracy. We focus on ECE because it is a widely used and accepted calibration metric. Nevertheless, it is well understood that estimating ECE accurately is difficult because estimators can be strongly biased and many estimator variants exist (