ICLR2023
A Kernel Perspective of Skip Connections in Convolutional Networks
Daniel Barzilai, Amnon Geifman, Meirav Galun, Ronen Basri
被引用 3 次
摘要
Over-parameterized residual networks are amongst the most successful convolutional neural architectures for image processing. Here we study their properties through their Gaussian Process and Neural Tangent kernels. We derive explicit formulas for these kernels, analyze their spectra and provide bounds on their implied condition numbers. Our results indicate that (1) with ReLU activation, the eigenvalues of these residual kernels decay polynomially at a similar rate as the same kernels when skip connections are not used, thus maintaining a similar frequency bias; (2) however, residual kernels are more locally biased. Our analysis further shows that the matrices obtained by these residual kernels yield favorable condition numbers at finite depths than those obtained without the skip connections, enabling therefore faster convergence of training with gradient descent. Published as a conference paper at ICLR 2023 work of Lee et al. (2019); Xiao et al. (2020) and Chen et al. (2021), who related between the condition number of NTK and the trainability of corresponding finite width networks. 3 PRELIMINARIES We consider mutli-channel 1-D input signals x ∈ R C0×d of length d with C 0 channels. We use 1-D input signals to simplify notations and note that all our results can naturally be extended to 2-D signals. Let MS (C 0 , d) = S C0-1 × . . . × S C0-1 d times ⊆ √ d S dC0-1 be the multi-sphere, so x = (x 1 , ..., x d ) ∈ MS (C 0 , d) iff ∀i ∈ [d], x i = 1. For our analysis, we assume that the input signals are distributed uniformly on the multi-sphere. The discrete convolution of a filter w ∈ R q with a vector v ∈ R d is defined as where 1 ≤ i ≤ d. We use circular padding, so indices [v] j with j ≤ 0 and j > d are well defined. We use multi-index notation denoted by bold letters, i.e., n, k ∈ N d , where N is the set of natural numbers including zero. b n , λ k ∈ R are scalars that depend on n, k, and for t ∈ R d we let t n = t n1 1 • ... • t n d d . As is convention, we say that n ≥ k iff n i ≥ k i for all i ∈ [d]. Thus, the power series n≥0 b n t n should read n1≥0,n2≥0,... b n1,n2,... t n1 1 t n2 2 ... We further use the following notation to denote sub-vectors and sub-matrices. ∀i ∈ N, let D that for a matrix M we can write: 2 . We use (s i v) j = v j+i to denote the cyclic shift of v to the left by i pixels. Finally, for every kernel K : R d × R d → R we define the normalized kernel to be K (x, z) = K(x,z) √ K(x,x)K(z,z)