NeurIPS2020

The Implications of Local Correlation on Learning Some Deep Functions

Eran Malach, Shai Shalev-Shwartz

12 citations

Abstract

It is known that learning deep neural-networks is computationally hard in the worst-case. In fact, the proofs of such hardness results show that even weakly learning deep networks is hard. In other words, no efficient algorithm can find a predictor that is slightly better than a random guess. However, we observe that on natural distributions of images, small patches of the input image are correlated to the target label, which implies that on such natural data, efficient weak learning is trivial. While in the distribution-free setting, the celebrated boosting results show that weak learning implies strong learning, in the distribution-specific setting this is not necessarily the case. We introduce a property of distributions, denoted "local correlation", which requires that small patches of the input image and of intermediate layers of the target function are correlated to the target label. We empirically demonstrate that this property holds for the CIFAR and ImageNet data sets. The main technical results of the paper is proving that, for some classes of deep functions, weak learning implies efficient strong learning under the "local correlation" assumption. To start off, let us take a closer look into computational hardness results on learning neural-networks. Over the years, the theoretical machine learning community has established many such hardness results, drawing from different hardness assumptions [26, 29, 9, 33] . While these results differ in their technical details, they all share one thing in common: they all show that weakly learning neural-networks is computationally hard. That is, all these works analyze cases where no efficient algorithm can achieve test performance that is even slightly better than a random guess, although there exists a neural-network that perfectly fits the data. While these results have great theoretical implications, we claim that they have nothing to do with understanding learnability of neural-networks on natural data. Indeed, in natural problems, even ones that are considered very challenging, achieving better-than-random performance is usually trivial. To demonstrate this, we perform the following simple experiment: we train a linear classifier on a single patch taken from images in some image classification task. We observe that even on a complex task such as ImageNet, a linear predictor that gets only a 3 ⇥ 3 patch as an input, achieves 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.