ICML2023

Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps

Marco Cuturi, Michal Klein, Pierre Ablin

被引用 19 次

摘要

Optimal transport (OT) theory focuses, among all maps T : R d → R d that can morph a probability measure onto another, on those that are the "thriftiest", i.e. such that the averaged cost c(x, T (x)) between x and its image T (x) be as small as possible. Many computational approaches have been proposed to estimate such Monge maps when c is the 2 2 distance, e.g., using entropic maps (Pooladian and Niles-Weed, 2021), or neural networks (Makkuva et al., 2020; Korotin et al., 2020) . We propose a new model for transport maps, built on a family of translation invariant costs c(x, y) := h(xy), where h := 1 2 • 2 2 + τ and τ is a regularizer. We propose a generalization of the entropic map suitable for h, and highlight a surprising link tying it with the Bregman centroids of the divergence D h generated by h, and the proximal operator of τ . We show that choosing a sparsity-inducing norm for τ results in maps that apply Occam's razor to transport, in the sense that the displacement vectors ∆(x) := T (x)x they induce are sparse, with a sparsity pattern that varies depending on x. We showcase the ability of our method to estimate meaningful OT maps for high-dimensional singlecell transcription data, in the 34000-d space of gene counts for cells, without using dimensionality reduction, thus retaining the ability to interpret all displacements at the gene level. 2018), and realign datasets in natural sciences (Janati et al., 2019; Schiebinger et al., 2019) . High-dimensional Transport. OT finds its most straightforward and intuitive use-cases in low-dimensional geometric domains (grids and meshes, graphs, etc...). This work focuses on the more challenging problem of using it on distributions in R d , with d 1. In R d , the ground cost c(x, y) between observations x, y is often the 2 metric or its square 2 2 . However, when used on large-d data samples, that choice is rarely meaningful. This is due to the curse-ofdimensionality associated with OT estimation (Dudley et al., 1966; Weed and Bach, 2019) and the fact that the Euclidean distance loses its discriminative power as dimension grows. To mitigate this, practitioners rely on dimensionality reduction, either in two steps, before running OT solvers, using, e.g.,