VLDB2025
Extensible and Robust Evaluation of Similarity Queries
Daniel Ulrich Schmitt, Thomas Hütter, Nikolaus Augsten
摘要
We study the similarity join problem from a systems perspective. A similarity join retrieves all similar record pairs from two collections based on a given distance function. Existing solutions are often optimized for a single distance function and domain. Such monolithic solutions are limited in both their extensibility to new distance functions and their robustness against changing data characteristics. To address these challenges, we introduce Fast, a similarity join algorithm designed for extensible and robust query evaluation. It leverages a novel abstraction called reductions, which transform similarity join problems from complex domains into simpler ones. A reduction graph is constructed to systematically enumerate query plans. Since cost models for similarity queries are typically unavailable, Fast employs runtime partitioning and a sampling-based strategy to select a near-optimal query plan with performance guarantees. It can utilize prebuilt indexes or build them on-the-fly, incorporating caching techniques to accelerate index construction and probing. Extensive experiments across diverse datasets, domains, and distance functions show that Fast consistently performs close to the optimal plan. Finally, two case studies highlight its strength as a baseline and its utility for prototyping future similarity join algorithms.