STOC2020

Approximating text-to-pattern Hamming distances

Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat

被引用 2 次

摘要

We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size σ, compute the Hamming distance (i.e., the number of mismatches) between the pattern and the text at every location. Several randomized (1 + ε)-approximation algorithms have been proposed in the literature (e.g., by Karloff (Inf. Proc. Lett., 1993), Indyk (FOCS 1998), and Kopelowitz and Porat (SOSA 2018)), with running time of the form O(ε -O(1) n log n log m), all using fast Fourier transform (FFT). We describe a simple randomized (1 + ε)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results (all Monte-Carlo randomized) in different settings: 1. We obtain the first linear-time approximation algorithm; the running time is O(ε -2 n). In fact, the time bound can be made slightly sublinear in n if the alphabet size σ is small (by using bit packing tricks). 2. We apply our approximation algorithms to obtain a faster exact algorithm computing all Hamming distances up to a given threshold k; its running time is O(n+min( nk , nk 2 m )), which improves previous results by logarithmic factors and is linear if k ≤ √ m. 3. We alternatively obtain approximation algorithms with better ε-dependence, by using rectangular matrix multiplication. In fact, the time bound is O(n polylog n) when the pattern is sufficiently long, i.e., m ≥ ε -c for a specific constant c. Previous algorithms with the best ε-dependence require O(ε -1 n polylog n) time. 4. When k is not too small, we obtain a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in O((n/k Ω(1) + occ)n o(1) ) time, where occ is the output size. The algorithm leads to a property tester for pattern matching, with high probability returning true if an exact match exists and false if the Hamming distance is more than δm at every location, running in O((δ -1/3 n 2/3 + δ -1 n m ) polylog n) time. 5. We obtain a streaming algorithm to report all locations with Hamming distance approximately less than k, using O(ε -2 √ k polylog n) space. Previously, streaming algorithms were known for the exact problem with O(k polylog n) space (which is tight up to the polylog n factor) or for the approximate problem with O(ε -O(1) √ m polylog n) space. For the special case of k = m, we improve the space usage to O(ε -1.5 √ m polylog n).