NeurIPS2022

On the Spectral Bias of Convolutional Neural Tangent and Gaussian Process Kernels

Amnon Geifman, Meirav Galun, David Jacobs, Ronen Basri

被引用 20 次

摘要

We study the properties of various over-parametrized convolutional neural architectures through their respective Gaussian process and neural tangent kernels. We prove that, with normalized multi-channel input and ReLU activation, the eigenfunctions of these kernels with the uniform measure are formed by products of spherical harmonics, defined over the channels of the different pixels. We next use hierarchical facotorizable kernels to bound their respective eigenvalues. We show that the eigenvalues decay polynomially, quantify the rate of decay, and derive measures that reflect the composition of hierarchical features in these networks. Our results provide concrete quantitative characterization of over-parameterized convolutional network architectures. All models we consider use ReLU activation. While we do not explicitly account for intermediate pooling, our work can readily be extended to handle such layers as well. We assume our networks receive multi-channel input signals, with the channels for each pixel normalized to unit norm. Our results establish that: 1. The eigenfunctions of the three kernels include either products of spherical harmonics (SHs) or their shift invariant sums, with each harmonic term defined over the channels of one pixel. 2. The corresponding eigenvalues decay polynomially with the frequency of the eigenfunctions, at a rate that depends on the number of input channels. 3. The eigenvalues include a multiplicative factor that reflects the hierarchical structure of the features in the corresponding network. With the equivariant architecture, this factor is large for pixels at the center of the receptive field and small in the periphery. For the other two kernels this multiplicative factor is large for pixels close to each other and becomes very small for pixels far from each other. Our results show that CNNs, like FC-networks, are biased toward learning low frequency functions. However, point (2) tells us that CNNs can learn high frequency functions, when these are localized in a subset of the pixels, much more rapidly than FC networks. It is important to keep in mind that we are referring to the frequency of the function the network has learned, which is a function over the space of all images; this does not refer to the power spectrum of individual images. Put differently, high frequency reflects high variability of the target function for similar input images. Interestingly, the rate of decay does not depend directly on image size or the size of the convolution filter. Point (3) tells us that CNNs learn spatially localized functions more rapidly than functions with global dependence on an entire image. This shows that CNNs can change their output significantly based on relatively small, spatially localized features; FC networks will take much longer to learn these spatially localized features. In both cases we quantify this bias. Preliminaries and notations We consider a multi-channel 1-D input signal x of length d with ζ channels, represented by a ζ × d matrix. We further set D = ζd and refer to the columns of x as pixels, denoted We note that our results can readily be applied also to 2-D, multi-channel signals. We assume further that the entries of each pixel are normalized to unit norm, i.e., x = 1. The input space, therefore, is a Cartesian product of spheres, which we call multisphere, i.e., x We denote by s i x = (x i+1 , ...x d , x 1 , ...x i ) the cyclic shift of x to the left by i pixels. We use multi-index notations, i.e., n = (n 1 , ..., n d ), k = (k 1 , ..., k d ) ∈ N d to denote vectors of polynomial orders or frequencies. N denotes the natural numbers including zero, and b n , λ k ∈ R are scalars that depend on vectors of indices n or k. We denote monomials by , and allow also for a scalar exponent, i.e., Therefore, the power series n≥0 b n t n should read n1≥0,n2≥0,... b n1,n2,... t n1 1 t n2 2 ... We denote the uniform distribution in a domain Ω by Unif(Ω). We write f (x) ∼ g(x) when lim x→∞ f (x)/g(x) = 1. Throughout the paper we assume all kernels are differentiable at zero infinitely many times and their power series converge in the hypercube [-1, 1] d . Our theorems and lemmas are proved in the appendix. The network model We consider convolutional neural network architectures (Figure 1 ) of the following form. Given a multichannel 1-D input signal x ∈ MS(ζ, d) arranged in a ζ × d matrix, we use a shift equivariant backbone and three heads to produce scalar features. The network, defined in Table 1 , begins with a 1 × 1 convolution layer, followed by L -1 ≥ 1 stride-1 convolutional layers with filters of size q, producing at each layer the same number of feature channels m in each of the d locations. The f Eq head produces one scalar entry for the shift equivariant network (i.e., the tuple (f Eq (x, •), ..., f Eq (s d-1 x, •)) produces the shift-equivariant output); f Tr corresponds to applying a fully connected layer at the top layer,