NeurIPS2021
Label-Imbalanced and Group-Sensitive Classification under Overparameterization
Ganesh Ramachandra Kini, Orestis Paraskevas, Samet Oymak, Christos Thrampoulidis
118 citations
Abstract
The goal in label-imbalanced and group-sensitive classification is to optimize relevant metrics such as balanced error and equal opportunity. Classical methods, such as weighted cross-entropy, fail when training deep nets to the terminal phase of training (TPT), that is training beyond zero training error. This observation has motivated recent flurry of activity in developing heuristic alternatives following the intuitive mechanism of promoting larger margin for minorities. In contrast to previous heuristics, we follow a principled analysis explaining how different loss adjustments affect margins. First, we prove that for all linear classifiers trained in TPT, it is necessary to introduce multiplicative, rather than additive, logit adjustments so that the interclass margins change appropriately. To show this, we discover a connection of the multiplicative CE modification to the cost-sensitive support-vector machines. Perhaps counterintuitively, we also find that, at the start of training, the same multiplicative weights can actually harm the minority classes. Thus, while additive adjustments are ineffective in the TPT, we show that they can speed up convergence by countering the initial negative effect of the multiplicative weights. Motivated by these findings, we formulate the vector-scaling (VS) loss, that captures existing techniques as special cases. Moreover, we introduce a natural extension of the VS-loss to group-sensitive classification, thus treating the two common types of imbalances (label/group) in a unifying way. Importantly, our experiments on state-of-the-art datasets are fully consistent with our theoretical insights and confirm the superior performance of our algorithms. Finally, for imbalanced Gaussian-mixtures data, we perform a generalization analysis, revealing tradeoffs between balanced / standard error and equal opportunity. Connections to related literature CE adjustments. The use of wCE for imbalanced data is rather old [XM89], but it becomes ineffective under overparameterization, e.g. [BL19]. This deficiency has led to the idea of additive label-based parameters ι y on the logits [KHB + 18, CWG + 19, TWL + 20, MJR + 20, WCLL18]. Specifically, [MJR + 20] proved that setting ι y = log(π y ) (π y denotes the prior of class y) leads to a Fisher consistent loss, termed LA-loss, which outperformed other heuristics (e.g., focal loss [LGG + 18]) on SOTA datasets. However, Fisher consistency is only relevant in the large sample size limit. Instead, we focus on overparameterized models. In a recent work, [YCZC20] proposed the CDT-loss, which instead uses multiplicative label-based parameters ∆ y on the logits. The authors arrive at the CDT-loss as a heuristic means of compensating for the empirically observed phenomenon of that the last-layer minority features deviate between training and test instances [KK20]. Instead, we arrive at the CDT-loss via a different viewpoint: we show that the multiplicative weights are necessary to move decision boundaries towards majorities when training overparameterized linear models in TPT. Moreover, we argue that while additive weights are not so effective in the TPT, they can help in the initial phase of training. Our analysis sheds light on the individual roles of the two different modifications proposed in the literature and naturally motivates the VS-loss in (2). Compared to the above works we also demonstrate the successful use of VS-loss in group-imbalanced setting and show its competitive performance over alternatives in [SKHL19, HNSS18, OSHL19]. Beyond CE adjustments there is active research on alternative methods to improve fairness metrics, e.g. [KXR + 20, ZCWC20, LMZ + 19, OWZY16]. These are orthogonal to CE adjustments and can potentially be used in conjunction. Relation to vector-scaling calibration. Our naming of the VS-loss is inspired by the vector scaling (VS) calibration [GPSW17], a post-hoc procedure that modifies the logits v after training via v → ∆ ⊙ v + ι, where ⊙ is the Hadamard product. [ZCO20] shows that VS can improve calibration for imbalanced classes, but, in contrast to VS calibration, the multiplicative/additive scalings in our VS-loss are part of the loss and directly affect training. Blessings/curses of overparameterization. Overparameterization acts as a catalyst for deep neural networks [NKB + 19]. In terms of optimization, [SHN + 18, OS19, JT18, AH18] show that gradient-based algorithms are implicitly biased towards favorable min-norm solutions. Such solutions, are then analyzed in terms of generalization showing that they can in fact lead to benign overfitting e.g. [BLLT20, HMRT19]. While implicit bias is key to benign overfitting it may come with certain downsides. As a matter of fact, we show here that certain hyper-parameters (e.g. additive ones) can be ineffective in the interpolating regime in promoting fairness. Our argument essentially builds on characterizing the implicit bias of wCE/LA/CDT-losses. Related to th