NeurIPS2023

A Dynamical System View of Langevin-Based Non-Convex Sampling

Mohammad Reza Karimi Jaghargh, Ya-Ping Hsieh, Andreas Krause

4 citations

Abstract

Non-convex sampling is a key challenge in machine learning, central to nonconvex optimization in deep learning as well as to approximate probabilistic inference. Despite its significance, theoretically there remain some important challenges: Existing guarantees suffer from the drawback of lacking guarantees for the last-iterates, and little is known beyond the elementary schemes of stochastic gradient Langevin dynamics. To address these issues, we develop a novel framework that lifts the above issues by harnessing several tools from the theory of dynamical systems. Our key result is that, for a large class of state-of-the-art sampling schemes, their last-iterate convergence in Wasserstein distances can be reduced to the study of their continuous-time counterparts, which is much better understood. Coupled with standard assumptions of MCMC sampling, our theory immediately yields the last-iterate Wasserstein convergence of many advanced sampling schemes such as mirror Langevin, proximal, randomized mid-point, and Runge-Kutta methods. * Equal contribution. • An additional notable drawback of the current theory is its predominant focus on the basic Euler-Maruyama discretization of (LD) (see, e.g., [4, 20, 54] ). As a result, the convergence analysis of more advanced sampling schemes remains largely unexplored in the fully non-convex regime [1, 24, 25, 34, 53, 61] . § Contributions and Approaches. To overcome the aforementioned challenges, our main contribution, from a high level, can be succinctly summarized as: Under mild assumptions, we prove that the iterates of a broad range of Langevin-based sampling schemes converge to the continuous-time (LD) in Wasserstein distance. (★) Combining (★) with classical results on Langevin diffusion [45] immediately yields the last-iterate convergence in Wasserstein distances for a wide spectrum of sampling schemes, thus resolving all the challenges mentioned above. To illustrate this point, we state a simple version of our main result. Theorem (Informal). Suppose we discretize (LD) as +1 = -+1 (∇ ( ) + noise + bias) + 2 +1 +1 with step-sizes ∈ℕ and i.i.d. standard Gaussians ∈ℕ . Then, under an easy-to-verify condition on the bias (see (5) in Assumption 3), ∈ℕ converges in Wasserstein distance to . In addition, these conditions are satisfied by many advanced sampling schemes. This result is achieved via a new dynamical perspective to study Langevin-based sampling. More specifically,