NeurIPS2024

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu

Abstract

We study the problem of gradient descent learning of a single-index target function f(x)=σ(x,θ)f_*(\boldsymbol{x}) = \textstyle\sigma_*\left(\langle\boldsymbol{x},\boldsymbol{\theta}\rangle\right) under isotropic Gaussian data in Rd\mathbb{R}^d, where the unknown link function σ:RR\sigma_*:\mathbb{R}\to\mathbb{R} has information exponent pp (defined as the lowest degree in the Hermite expansion). Prior works showed that gradient-based training of neural networks can learn this target with ndΘ(p)n\gtrsim d^{\Theta(p)} samples, and such complexity is predicted to be necessary by the correlational statistical query lower bound. Surprisingly, we prove that a two-layer neural network optimized by an SGD-based algorithm (on the squared loss) learns ff_* with a complexity that is not governed by the information exponent. Specifically, for arbitrary polynomial single-index models, we establish a sample and runtime complexity of nT=Θ(d ⁣ ⁣polylogd)n \simeq T = \Theta(d\!\cdot\! \mathrm{polylog} d), where Θ()\Theta(\cdot) hides a constant only depending on the degree of σ\sigma_*; this dimension dependence matches the information theoretic limit up to polylogarithmic factors. More generally, we show that nd(p1)1n\gtrsim d^{(p_*-1)\vee 1} samples are sufficient to achieve low generalization error, where ppp_* \le p is the generative exponent of the link function. Core to our analysis is the reuse of minibatch in the gradient computation, which gives rise to higher-order information beyond correlational queries.