NeurIPS2022
A Theory of PAC Learnability under Transformation Invariances
Han Shao, Omar Montasser, Avrim Blum
25 citations
Abstract
Transformation invariances are present in many real-world problems. For example, image classification is usually invariant to rotation and color transformation: a rotated car in a different color is still identified as a car. Data augmentation, which adds the transformed data into the training set and trains a model on the augmented data, is one commonly used technique to build these invariances into the learning process. However, it is unclear how data augmentation performs theoretically and what the optimal algorithm is in presence of transformation invariances. In this paper, we study PAC learnability under transformation invariances in three settings according to different levels of realizability: (i) A hypothesis fits the augmented data; (ii) A hypothesis fits only the original data and the transformed data lying in the support of the data distribution; (iii) Agnostic case. One interesting observation is that distinguishing between the original data and the transformed data is necessary to achieve optimal accuracy in setting (ii) and (iii), which implies that any algorithm not differentiating between the original and transformed data (including data augmentation) is not optimal. Furthermore, this type of algorithms can even "harm" the accuracy. In setting (i), although it is unnecessary to distinguish between the two data sets, data augmentation still does not perform optimally. Due to such a difference, we propose two combinatorial measures characterizing the optimal sample complexity in setting (i) and (ii)(iii) and provide the optimal algorithms. How does data augmentation perform theoretically? What is the optimal algorithm in terms of sample complexity under transformation invariances? We formalize the problem of binary classification under transformation invariances in the PAC model. Given instance space X , label space Y = 0, 1, and hypothesis class H, we consider the following three settings according to different levels of realizability. (i) Invariantly realizable setting: There exists a hypothesis h * ∈ H such that h * can correctly classify not only the natural data (drawn from the data distribution) but also the transformed data. For example, considering the transformation of rotating images where all natural images are upright, the hypothesis h * can correctly classify every upright image (natural data) and their rotations (transformed data). (ii) Relaxed realizable setting: There exists a hypothesis h * ∈ H such h * has zero error over the support of the data distribution (and therefore will correctly classify the transformed data that lies in the support of the data distribution), but h * may not correctly classify transformed data that lies outside the support of the natural data distribution. For example, there exists an h * classifying all small rotations that lie in the support of the distribution correctly, but misclassifying upside-down cars. (iii) Agnostic setting: Every hypothesis in H might not fit the natural data. In most of this work, we consider the case where the set of transformations forms a group (e.g., all rotations and all color translations), which is a classic setting studied in literature (e.g., Cohen and Welling, 2016; Bloem-Reddy and Teh, 2020; Chen et al., 2020) . Some algorithms and analyses in this work also apply to non-group transformations (e.g., croppings). Main contributions First, we show that DA outperforms vanilla ERM but is sub-optimal in setting (i) above. We then introduce a complexity measure (see Definition 4) that characterizes the optimal sample complexity of learning in setting (i), and we give an optimal (up to log-factors) algorithm in this setting based on 1-inclusion-graph predictors. Second, we characterize the complexity of learning in setting (ii) when the learner only receives the augmented data (without specifying which are natural). Such a characterization provides us with a sufficient condition under which DA "hurts". Third, we introduce a complexity measure (see Definition 5) that characterizes the optimal sample complexity of learning in settings (ii) and (iii) above, and we give optimal algorithms for these settings. Finally, we also provide adaptive learning algorithms that interpolate between settings (i) and (ii), i.e., when h * is partially invariant. We want to emphasize that our complexity measures take into account the complexity of both the hypothesis class H and the set of transformations being considered. The results are formally summarized in Section 3. Related work Theoretical guarantees of DA has received a lot of attention recently. Chen et al. (2020); Lyle et al. ( 2020 ) study theoretical guarantees of DA under the assumption of "equality" in distribution, i.e., for any transformation in the transformation group, the data distribution of the transformed data is approximately the same as that of the natural data (e.g., the upside-down variations of images happen at the same probability as the original upright image