NeurIPS2022

Hyperparameter Sensitivity in Deep Outlier Detection: Analysis and a Scalable Hyper-Ensemble Solution

Xueying Ding, Lingxiao Zhao, Leman Akoglu

29 citations

Abstract

Outlier detection (OD) literature exhibits numerous algorithms as it applies to diverse domains. However, given a new detection task, it is unclear how to choose an algorithm to use, nor how to set its hyperparameter(s) (HPs) in unsupervised settings. HP tuning is an ever-growing problem with the arrival of many new detectors based on deep learning, which usually come with a long list of HPs. Surprisingly, the issue of model selection in the outlier mining literature has been "the elephant in the room"; a significant factor in unlocking the utmost potential of deep methods, yet little said or done to systematically tackle the issue. In the first part of this paper, we conduct the first large-scale analysis on the HP sensitivity of deep OD methods, and through more than 35,000 trained models, quantitatively demonstrate that model selection is inevitable. Next, we design a HP-robust and scalable deep hyper-ensemble model called ROBOD that assembles models with varying HP configurations, bypassing the choice paralysis. Importantly, we introduce novel strategies to speed up ensemble training, such as parameter sharing, batch/simultaneous training, and data subsampling, that allow us to train fewer models with fewer parameters. Extensive experiments on both image and tabular datasets show that ROBOD achieves and retains robust, state-of-the-art detection performance as compared to its modern counterparts, while taking only 2-10% of the time by the naïve hyper-ensemble with independent training. HPs are limited to one-class models [14, 40, 41] . More recently, general-purpose internal (i.e., unsupervised) model evaluation heuristics have been proposed [15, 29, 30] , which solely rely on the input data (without labels) and the output (i.e., outlier scores). MetaOD [47] employs meta-learning to transfer information from similar historical tasks to a new task for model selection, which has only been tested on traditional OD models. Different from those that aim to select a single model, ensemble models have also been employed for OD [3], including those that combine models from the same family [26] as well as heterogeneous detectors from different families [33] . Regarding deep OD methods, we have surveyed a large collection of recent papers and their experimental testbed and HP settings, a summary of which is given in Appx. A.1 Table 8. To our surprise, we found hardly any discussion on model selection, with only a few work presenting sensitivity analysis with respect to not all but some, model-specific HPs. While some work reserve labeled validation/hold-out data to tune a subset of the HPs [6, 18, 23] , majority of them fix the HP values and call them "recommended"/default settings [5, 11, 37, 38, 46, 49] . Moreover, a non-negligible number of existing work choose some critical HPs empirically on test data (!) to yield optimum results [4, 35, 48] (See Table 8 , last column). Some work that builds on previous models (e.g., deep SVDD-based methods [35] vs. multi-sphere extension [13], transformation-based methods [16] for images vs. their extension to vector data [5], AnoGAN [37] and the follow-up EGBAD [46] ) use the same architecture and HP settings as the prior work for consistent/"fair" comparison. However it is unlikely that the same HP values would work comparably for different models. Admittedly, it is challenging to tune (a long list of) HPs in the absence of labels, yet, the opacity in the deep OD literature warrants careful investigation on the stability of model performance under varying HP settings, and ultimately on the fair comparison between these and traditional OD methods. Deep Model Ensembles. Recently, deep NN predictions have been found to be often poorly calibrated [20] . As Bayesian learning does not offer straightforward training, deep ensemble models have been proposed as a simple alternative [25] to improve predictive uncertainty, as well as efficient ways of training deep NN ensembles [19, 42] . In this work, we leverage ensemble modeling toward a different goal: to improve the stability and robustness of unsupervised OD models to HP settings, combining predictions from models with different HPs into an OD hyper-ensemble. The closest to our work is Wenzel et al.'s deep hyper-ensemble [43] , which, different from ours, considers supervised problems, to further foster diversity in the ensemble and thereby achieve better uncertainty estimation. 3 Hyperparameter-Sensitivity Analysis of Deep OD 3.1 Testbed Setup Models. We study HP sensitivity of five deep OD methods of four different types: a basic deep autoencoder VanillaAE trained with reconstruction loss, robust deep autoencoder RDA [48], one-class classification based DeepSVDD [35], adversarial training based GANomaly [4], and an (AE) ensemble model RandNet [11]. These exhibit 4 to 8 HPs, as listed in Table 1 . (See Appx. A.2 for descriptions.) (Note that RandNet is not a hyper-ensemble: members use the same HP configs except for NN spar