STOC2023
Weighted Edit Distance Computation: Strings, Trees, and Dyck
Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, Barna Saha
4 citations
Abstract
Given two strings of length n over alphabet Σ, and an upper bound k on their edit distance, the algorithm of Myers (Algorithmica'86) and Landau and Vishkin (JCSS'88) from almost forty years back computes the unweighted string edit distance in O(n + k 2 ) time. Till date, it remains the fastest algorithm for exact edit distance computation, and it is optimal under the Strong Exponential Hypothesis (STOC'15). Over the years, this result has inspired many developments, including fast approximation algorithms for string edit distance as well as similar Õ(n+poly(k))time algorithms for generalizations to tree and Dyck edit distances. Surprisingly, all these results hold only for unweighted instances. While unweighted edit distance is theoretically fundamental, almost all real-world applications require weighted edit distance, where different weights are assigned to different edit operations (insertions, deletions, and substitutions), and the weights may vary with the characters being edited. Given a weight function w : Σ ∪ ε × Σ ∪ ε → R ≥0 (such that w(a, a) = 0 and w(a, b) ≥ 1 for all a, b ∈ Σ ∪ ε with a = b), the goal is to find an alignment that minimizes the total weight of edits. Except for the vanilla O(n 2 )-time dynamic-programming algorithm and its almost trivial O(nk)-time implementation (k being an upper bound on the sought total weight), none of the aforementioned developments on the unweighted edit distance applies to the weighted variant. In this paper, we propose the first O(n + poly(k))-time algorithm that computes weighted string edit distance exactly, thus bridging a fundamental decades-old gap between our understanding of unweighted and weighted edit distance. We then generalize this result to weighted tree and Dyck edit distances, bringing in several new techniques, which lead to a deterministic algorithm that improves upon the previous work even for unweighted tree edit distance. Given how fundamental weighted edit distance is, we believe our O(n + poly(k)) algorithm for weighted edit distance will be instrumental for further significant developments in the area. Myers [Mye86] and Landau and Vishkin [LV88] achieves this task in O(n + k 2 ) time by combining suffix trees with an elegant greedy approach. As long as k = O( √ n), the running time of the above algorithm in linear in n. For larger values of k, approximation algorithms for edit distance have been studied extensively [LMS98, Ind01, BYJKK04, BES06, AO09, AKO10], especially recently [HRS19, CDG + 20, GRS20, KS20b, BR20, BEG + 21]. This culminated with the currently best bound by Andoni and Nosatzki [AN20], who obtained a constant-factor approximation algorithm with running time O(n 1+ǫ ) time for any constant ǫ > 0. All of these works require monotonicity and assume that an optimal solution can be extended easily if matching suffixes are added to both strings, none of which may hold in weighted edit distance instances. As a result, the state-of-theart approximation algorithm for weighted edit distance, by Kuszmaul [Kus19], offers much worse trade-off, with an O(n τ )-factor approximation in Õ(n 2-τ ) time for any 0 ≤ τ ≤ 1. Tree Edit Distance: The tree edit distance problem, first introduced by Selkow [Sel77], is a generalization of edit distance in which the task is to compute a measure of dissimilarity between two rooted ordered trees with node labels. In the unweighted version of tree edit distance, every node insertion, deletion, or relabeling operation has unit cost. The problem has numerous applications in compiler optimization [DMRW10], structured data analysis [Cha99, BGK03, FLMM09], image analysis [BS98], and computational biology [HT84, SZ90, Gus97, LMS98, Bil05]. The current best bound on running time of an algorithm for finding exact tree edit distance is due to Dürr [Dür22] who obtained an O(n 2.9149 )-time algorithm for the problem, after a long series of improvements from O(n 6 ) [Tai79] to O(n 4 ) [ZS89], to O(n 3 log n) [Kle98], to O(n 3 ) [DMRW10], and to O(n 2.9546 ) [Mao21]. Moreover, there is a (1 + ǫ)-approximation algorithm for tree edit distance with running time Õ(n 2 ) time due to Boroujeni, Ghodsi, Hajiaghayi, and Seddighin [BGHS19]. Recently, Seddighin and Seddighin [SS22] gave an O(n 1.99 )-time (3 + ǫ)-approximation algorithm for tree edit distance (building on a previous Õ(n)-time O( √ n)-factor approximation algorithm of [BGHS19]). Furthermore, Das, Gilbert, Hajiaghayi, Kociumaka, Saha, and Saleh [DGH + 22] obtained an Õ(n + k 15 )-time algorithm for exact tree edit distance with an upper bound k on the distance (see also an Õ(nk 2 )-time algorithm of Akmal and Jin [AJ21], which improves upon a previous algorithm with running time O(nk 3 ) for the bounded tree edit distance problem [Tou05]). As far as the weighted tree edit distance is concerned, the fastest algorithm, by Demaine, Mozes, Rossman, and Weimann [DMRW10], takes O(n 3 ) time, which matches the conditional lower-bound of Bringmann, Gawr