KDD2023
Efficient Distributed Approximate k-Nearest Neighbor Graph Construction by Multiway Random Division Forest
Sang-Hong Kim, Ha-Myung Park
4 citations
Abstract
k-nearest neighbor graphs, shortly k-NN graphs, are widely used in many data mining applications like recommendation, information retrieval, and similarity search. Approximate k-NN graph construction has been getting a lot of attention, and most researches focus on developing algorithms that operate efficiently and quickly on a single machine. A few pioneering studies propose distributed algorithms to increase the size of data that can be processed to billions. However, we notice that the distributed algorithms don't perform well enough due to the problems of graph fragmentation and massive data exchange. In this paper, we propose MRDF (Multiway Random Division Forest), a scalable distributed algorithm that constructs highly accurate k-NN graph from numerous high-dimensional vectors quickly. MRDF resolves the problems that the existing distributed algorithms suffer from, through coarse-grained partitioning based on tree path annotation. Experimental results on real-world datasets show that MRDF outperforms the state-of-the-art distributed algorithms with up to 7.6 times faster speed and up to 56%p better accuracy than the second best results.