ICLR2026
Variational Deep Learning via Implicit Regularization
Jonathan Wenger, Beau Coker, Juraj Marusic, John Patrick Cunningham
1 citation
Abstract
Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters, and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as generalized variational inference and demonstrate the importance of the choice of parametrization. Empirically, our approach demonstrates strong in-and outof-distribution performance without additional hyperparameter tuning and with minimal computational overhead. Bayesian Deep Learning Approximate Bayesian techniques like the Laplace approximation [24] [25] [26] , stochastic weight averaging [27, 28] , deep ensembles [29], and variational approaches [30] [31] [32] [33] attempt to address the aforementioned shortcomings of deep learning by learning a distribution over functions as opposed to merely a point estimate. The idea being that a weighted combination of models, all of which achieve low training error, generalizes more robustly while at the same time providing uncertainty quantification. Variational Inference In Bayesian inference this weighted combination is defined by the posterior distribution p(w | X, y) ∝ p(y | X, w)p(w) over weights, induced by a likelihood p(y | w) and