NeurIPS2023
H-Consistency Bounds: Characterization and Extensions
Anqi Mao, Mehryar Mohri, Yutao Zhong
28 citations
Abstract
A series of recent publications by Awasthi, Mao, Mohri, and Zhong [2022b] have introduced the key notion of H-consistency bounds for surrogate loss functions. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are both non-asymptotic and hypothesis set-specific and thus stronger and more informative than Bayes-consistency. However, determining if they hold and deriving these bounds have required a specific proof and analysis for each surrogate loss. Can we derive more general tools and characterizations? This paper provides both a general characterization and an extension of H-consistency bounds for multi-class classification. We present new and tight H-consistency bounds for both the family of constrained losses and that of comp-sum losses, which covers the familiar crossentropy, or logistic loss applied to the outputs of a neural network. We further extend our analysis beyond the completeness assumptions adopted in previous studies and cover more realistic bounded hypothesis sets. Our characterizations are based on error transformations, which are explicitly defined for each formulation. We illustrate the application of our general results through several special examples. A by-product of our analysis is the observation that a recently derived multi-class H-consistency bound for cross-entropy reduces to an excess bound and is not significant. Instead, we prove a much stronger and more significant guarantee. Preliminaries We denote by X the input space, by Y the output space, and by D a distribution over X×Y. We consider the standard scenario of multi-class classification, where Y = 1, . . . , n. Given a hypothesis set H of functions mapping X × Y to R, the multi-class classification problem consists of finding a hypothesis h ∈ H with small generalization error R 0-1 (h), defined by R 0-1 (h) = E (x,y)∼D [ 0-1 (h, x, y)], where 0-1 (h, x, y) = 1 h(x)≠y is the multi-class zero-one loss with h(x) = argmax y∈Y h(x, y) the prediction of h for the input point x. We also denote by H(x) the set of all predictions associated to input x generated by functions in H, that is, H(x) = h(x)∶ h ∈ H. We will analyze the guarantees of surrogate multi-class losses in terms of the zero-one loss. We denote by a surrogate loss and by R (h) its generalization error, R (h) = E (x,y)∼D [ (h, x, y)]. For a loss function , we define the best-in-class generalization error within a hypothesis set H as R * (H) = inf h∈H R (h), and refer to R (h) -R * (H) as the estimation error. We will study the key notion of H-consistency bounds [Awasthi et al., 2022a,b], which are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error, for some real-valued function f that is non-decreasing: ). These bounds imply that the zero-one estimation error is at most f ( ) whenever the surrogate loss estimation error is bounded by . Thus, the learning guarantees provided by H-consistency bounds are both non-asymptotic and hypothesis set-specific. The function f appearing in these bounds is expressed in terms of a minimizability gap, which is a quantity measuring the difference of bestin-class error R * (H) and the expected best-in-class conditional error E [ (h, x, y)] and C * (H, x) = inf h∈H C (h, x) are the conditional error and best-in-class conditional error respectively. We further write ∆C ,H = C (h, x) -C * (H, x) to denote the conditional regret. Note that that the minimizability gap is an inherent quantity depending on a hypothesis set H and the loss function .