NeurIPS2020

Implicit Regularization in Deep Learning May Not Be Explainable by Norms

Noam Razin, Nadav Cohen

被引用 169 次

摘要

Mathematically characterizing the implicit regularization induced by gradientbased optimization is a longstanding pursuit in the theory of deep learning. A widespread hope is that a characterization based on minimization of norms may apply, and a standard test-bed for studying this prospect is matrix factorization (matrix completion via linear neural networks). It is an open question whether norms can explain the implicit regularization in matrix factorization. The current paper resolves this open question in the negative, by proving that there exist natural matrix factorization problems on which the implicit regularization drives all norms (and quasi-norms) towards infinity. Our results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank. We demonstrate empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning. 1 The remainder of the paper is organized as follows. Section 2 reviews related work. Section 3 presents the deep matrix factorization model. Section 4 delivers our analysis, showing that its implicit regularization can drive all norms to infinity. Experiments, with both the analyzed setting and tensor factorization, are given in Section 5. Finally, Section 6 summarizes. Related work Theoretical analysis of implicit regularization in deep learning is a highly active area of research. Our work extends the bulk of literature concerning mathematical characterization of the implicit regularization induced by gradient-based optimization. 6 Existing characterizations focus on different aspects of learning, for example: dynamics of optimization ([2, 29, 53, 8, 32, 48, 30]); curvature ("flatness") of obtained minima ([60]); frequency spectrum of learned input-output mappings ([71]); invariant quantities throughout training ([27]); and statistical properties imported from data ([12]). A ubiquitous approach, arguably more prevalent than the aforementioned, is to demonstrate that learned input-output mappings minimize some notion of norm, or analogously, maximize some notion of margin. Works along this line have treated various models, 7 including: linear (single-layer) predictors ([79, 35, 46, 3]); normalized linear models ([85]); certain polynomially parameterized linear models ([84]); homogeneous (and sum of homogeneous) models ([61, 57, 47]); ultra-wide neural networks ([43, 59, 17, 67]); linear neural networks with a single output ([62, 36, 45]); and matrix factorization -the subject of our inquiry. Matrix factorization is perhaps the most extensively studied model in the context of implicit regularization induced by non-convex gradient-based optimization. It corresponds to linear neural networks with multiple inputs and outputs, typically trained to recover low-rank linear mappings. The literature on matrix factorization for low-rank matrix recovery is far too broad to cover here -we refer to [16] for a recent survey, while mentioning that the technique is often attributed to [13] . Notable works proving successful recovery of a low-rank matrix via matrix factorization trained by gradient descent with no explicit regularization are [82, 58, 56] . Of these, [56] can be viewed as affirming Conjecture 1 (from [34] ) for a certain special case. 8 [11] has affirmed Conjecture 1 under different assumptions, but nonetheless argued empirically that it does not hold true in general, resonating with Conjecture 2 (from [8]). To the best of our knowledge, no theoretical support for the latter was provided prior to its proof in this paper. We note that the proof relies on technical results derived in [6] and [8] (restated in Subappendix D.2.1 for completeness). Extending the research on matrix factorization, the use of tensor factorization for recovering low-rank tensors is a frequent topic of investigation (cf. [14, 86, 89, 87, 49, 63, 44, 1, 4]). Nevertheless, the experiments reported in this paper provide the first evidence we are aware of for such use to be successful under gradient-based optimization with no explicit regularization (in particular without imposing low-rank on the tensor factorization).