ICLR2025
Tree-Wasserstein Distance for High Dimensional Data with a Latent Feature Hierarchy
Ya-Wei Eileen Lin, Ronald R. Coifman, Gal Mishne, Ronen Talmon
Abstract
Finding meaningful distances between high-dimensional data samples is an important scientific task. To this end, we propose a new tree-Wasserstein distance (TWD) for high-dimensional data with two key aspects. First, our TWD is specifically designed for data with a latent feature hierarchy, i.e., the features lie in a hierarchical space, in contrast to the usual focus on embedding samples in hyperbolic space. Second, while the conventional use of TWD is to speed up the computation of the Wasserstein distance, we use its inherent tree as a means to learn the latent feature hierarchy. The key idea of our method is to embed the features into a multi-scale hyperbolic space using diffusion geometry and then present a new tree decoding method by establishing analogies between the hyperbolic embedding and trees. We show that our TWD computed based on data observations provably recovers the TWD defined with the latent feature hierarchy and that its computation is efficient and scalable. We showcase the usefulness of the proposed TWD in applications to word-document and single-cell RNA-sequencing datasets, demonstrating its advantages over existing TWDs and methods based on pre-trained models. Recently, hyperbolic geometry (Ratcliffe et al., 1994) has gained prominence in hierarchical representation learning (Chamberlain et al., 2017; Nickel & Kiela, 2017) because the lengths of geodesic paths in hyperbolic spaces grow exponentially with the radius (Sarkar, 2011), a property that naturally mirrors the exponential growth of the number of nodes in hierarchical structures as the depth increases. Methods using hyperbolic geometry typically focus on finding a hyperbolic embedding of the samples, relying on a (partially) known graph, whose nodes represent the samples (Sala et al., 2018) . However, considering such a known hierarchical structure of the samples is fundamentally different than the problem we consider here, where we aim to find meaningful distances between data samples that incorporate the latent hierarchical structure of the features. In this paper, we introduce a new tree-Wasserstein distance (TWD) (Indyk & Thaper, 2003) for this purpose, where we model samples as distributions supported on a latent hierarchical structure. We propose a two-step approach. In the first step, we embed features into continuous hyperbolic spaces (Bowditch, 2007) utilizing diffusion geometry (Coifman & Lafon, 2006) to approximate the hidden