NeurIPS2022

On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias

Itay Safran, Gal Vardi, Jason D. Lee

24 citations

Abstract

We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with r neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most O(r) linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is Õ(r) and independent of the sample size. * Equal contribution We now turn to discuss some of the related work in the literature which is most relevant to ours in more detail. Related work Implicit bias in neural networks. The literature on the implicit bias in neural networks has rapidly expanded in recent years, and cannot be reasonably surveyed here (see Vardi [2022] for a survey). In what follows, we discuss only results which apply to depth-2 ReLU networks. By Lyu and Li [2019] , Ji and Telgarsky [2020] homogeneous neural networks (and specifically depth-2 ReLU networks) trained with exponentially-tailed classification losses converge in direction to a KKT point of the maximum-margin problem. Our analysis of the implicit bias relies on this result. We note that the aforementioned KKT point may not be a global optimum (see a discussion in Section 4). For depth-2 ReLU networks trained with the square loss there are no known guarantees on the implicit bias (cf. Vardi and Shamir [2021], Timor et al. [2022] ). Several works in recent years studied the implication of minimizing the ℓ 2 norm of the weights on the function space in depth-2 univariate ReLU networks. In Savarese et al. [2019] and Ergen and Pilanci [2021a] it is shown that a minimal-norm fit for a sample is given by the linear spline interpolation (i.e., a "connect-the-dots" function). In such linear spline interpolation the number of linear regions is small. The former work considered only regression, while the latter considered both regression and classification. Note that margin-maximization is equivalent to norm-minimization with margin at least 1. Thus, our result can also be viewed as an analysis of the implication of the bias towards norm-minimization on the learned function. We emphasize three important differences between the results from Savarese et al. [2019], Ergen and Pilanci [2021a] and ours: