NeurIPS2022

Annihilation of Spurious Minima in Two-Layer ReLU Networks

Yossi Arjevani, Michael Field

被引用 11 次

摘要

We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. Use is made of the rich symmetry structure to develop a novel set of tools for studying the mechanism by which over-parameterization annihilates spurious minima. Sharp analytic estimates are obtained for the loss and the Hessian spectrum at different minima, and it is proved that adding neurons can turn symmetric spurious minima into saddles; minima of lesser symmetry require more neurons. Using Cauchy's interlacing theorem, we prove the existence of descent directions in certain subspaces arising from the symmetry structure of the loss function. This analytic approach uses techniques, new to the field, from algebraic geometry, representation theory and symmetry breaking, and confirms rigorously the effectiveness of over-parameterization in making the associated loss landscape accessible to gradient-based methods. For a fixed number of neurons and inputs, the spectral results remain true under symmetry breaking perturbation of the target. This example highlights the special role that the standard representation plays in the annihilation of spurious minima (see Section 5 and the concluding remarks). The sharp estimates of the Hessian spectrum further demonstrate how symmetry breaking enables a complete characterization of the dynamics of gradient-based methods, locally, in the vicinity of symmetric critical points. The dependence of such methods on stability of critical points therefore indicates that attempts for a global theory should be preceded by a good description of the mechanism by which spurious minima transform into saddles-the aim of this work. Next, we relate our results to the existing literature. Annihilation of spurious minima on account of over-parameterization. Existing methods for the analysis of optimization problem (2) include: mean-field [4], optimal transport [2], NTK [20, 21, 22] and the thermodynamic limit [5, 16, 23, 24] . These methods operate by passing to limiting regimes where the number of inputs or neurons is taken to infinity. A growing number of works has limited the explanatory power of such approaches [25, 26] . Approaches for addressing the loss landscapes in finite parameter regimes