ICML2025

Global curvature for second-order optimization of neural networks

Alberto Bernacchia

Abstract

2016 Neural networks are an important class of highly flexible and powerful models inspired by the structure of the brain. They consist of a sequence of interconnected layers, each comprised of basic computational units similar to the gates of a classical circuit. And like circuits, they have the capacity to perform simple computational procedures such as those which might underlie the generating process of the dataset they are trained on. The most popular and successful approach for learning neural networks is to optimize their parameters with respect to some objective function using standard methods for nonlinear optimization. Because basic methods like stochastic gradient descent (SGD) can often be very slow for deeply layered neural networks, or ones with recurrent connections, it is worthwhile to consider more advanced methods. In this thesis we review and analyze various such methods that have been proposed over the past few decades, with a particular focus on approximate-Newton/2ndorder ones, and develop two of our own which we call Hessian-free optimization (HF) and Kronecker-factored Approximate Curvature (K-FAC) respectively. Our experiments show that K-FAC can be much faster in practice at optimizing deep neural networks than well-tuned SGD with momentum. This thesis would have been much weaker without the help of my close collaborators Ilya Sutskever and Roger Grosse, to whom I'm truly indebted. And I would like to thank my thesis external examiner Jorge Nocedal for his wisdom and enthusiastic engagement with the ML community. I would also like to acknowledge the great DCS students, postdocs and visitors who have made my time at Toronto a much more enjoyable and interesting one. These include Jake,