NeurIPS2023

Markovian Sliced Wasserstein Distances: Beyond Independent Projections

Khai Nguyen, Tongzheng Ren, Nhat Ho

被引用 10 次

摘要

Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance (K ≥ 1), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of the Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of the MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of the MSW 1 . Due to the scalability, the SW has been applied to almost all applications where the Wasserstein distance is used. For example, we refer to some applications of the SW which are generative modeling [60, 15, 27, 42] , domain adaptation [30] , clustering [28] , approximate Bayesian computation [39], gradient flows [37, 5] , and variational inference [61] . Moreover, there are many attempts to improve the SW. The generalized sliced Wasserstein (GSW) distance that uses non-linear projection is proposed in [26] . Distributional sliced Wasserstein distance is proposed in [44, 45] by replacing the uniform distribution on the projecting directions in SW with an estimated distribution that puts high probabilities for discriminative directions. Spherical sliced Wasserstein which is defined between distributions that have their supports on the hyper-sphere is introduced in [4]. A sliced Wasserstein variant between probability measures over images with convolution is defined in [43] . Despite having a lot of improvements, one common property in previous variants of the SW is that they use independent projecting directions that are sampled from a distribution over a space of projecting direction e.g., the unit-hypersphere. Those projecting directions are further utilized to project two interested measures to corresponding pairs of one-dimensional measures. Due to the independence, practitioners have reported that many projections do not have the power to discriminative between two input probability measures [26, 15] . Moreover, having a lot of projections leads to redundancy and losing computation for uninformative pairs of projected measures. This problem is known as the projection complexity limitation of the SW. To partially address the issue, the max sliced Wasserstein (Max-SW) distance is introduced in [14]. Max-SW seeks the best projecting direction that can maximize the projected Wasserstein distance. Since the Max-SW contains a constraint optimization problem, the projected subgradient ascent algorithm is performed. Since the algorithm only guarantees to obtain local maximum [46] , the performance of empirical estimation Max-SW is not stable in practice [42] since the metricity of Max-SW can be only obtained at the global optimum. Another approach is to force the orthogonality between projecting directions. In particular, K-sliced Wasserstein [50] (K-SW) uses K > 1 orthogonal projecting directions. Moreover, to generalize the Max-SW and the K-SW, max-K sliced Wasserstein (Max-K-SW) distance (K > 1) appears in [12] to find the best K projecting directions that are orthogonal to each other via the projected sub-gradient ascent algorithm. Nevertheless, the orthogonality constraint is computationally expensive and might not be good in terms of reflecting discrepancy between general measures. Moreover, Max-K-SW also suffers from the non-optimality problem which leads to losing the metricity property in practice. To avoid the independency and to satisfy the requirement of creating informative projecting directions efficiently, we propose to impose a sequential structure on projecting directions. Namely, we choose a new projecting direction based on the previously chosen directions. For having more efficiency in computation, we consider first-order Markovian structure in the paper which means that a projecting direction can be sampled by using only the previous direction. For the first projecting direction, it can follow any types of distributions on the unit-hypersphere that were used in the literature e.g., unifo