NeurIPS2025

Quantitative convergence of trained neural networks to Gaussian processes

Andrea Agazzi, Eloy Mósig García, Dario Trevisan

Abstract

In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinitewidth limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time t ≥ 0, demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error. However, these results were largely confined to the initialization regime. To this day, extensions to the full training trajectory remained limited, with few works addressing how approximation errors evolve over time or depend on architectural features such as width and depth. The present work builds on this gap by extending the quantitative convergence discussed above to trained networks, providing explicit bounds on the Wasserstein distance between the network output and the associated Gaussian process for any positive training time. From a spectral perspective, the NTK's conditioning plays a central role in understanding convergence rates and generalization. Lower bounds on the smallest eigenvalue of the empirical NTK have been derived under various conditions. For instance, Karhadkar et al. [2024] and Bombari et al. [2022] provide sharp bounds in the context of ReLU and smooth activation functions, respectively. Additionaly, Carvalho et al. [2025] showed that under very mild assumptions on the non-linearity and non-proportionality of the training data, the analytic NTK is not degenerate. These results are essential for establishing the stability of the gradient flow and, hence, for deriving quantitative convergence guarantees.