VLDB2022

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations

Zichun Huang, Shimin Chen

9 citations

Abstract

A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its tableto-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM 3 , an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM 3 to consider partial result caching and support Join-๐‘œ๐‘ queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM 3 outperforms previous Join-Project solutions by a factor of 2.3ร—-18ร—. Compared to RDBMSs, DIM 3 achieves orders of magnitude speedups. * Shimin Chen is the corresponding author. Our codes are available at https://github.com/schencoding/JoinProject-DIM3 . and deduplicates using SELECT DISTINCT. The results can be stored by the application for quick user-specific recommendations. The Join-Project operation is used in a large variety of applications [10], including entity matching, set analytics, and graph analytics. The above is an example of entity matching. Similar examples include finding users who have seen the same movies in the MovieLens data set [15] , and discovering co-authors in the DBLP data set [43] . Moreover, if tuple (๐‘ฅ,๐‘ฆ) represents that set ๐‘ฅ contains element ๐‘ฆ, then the Join-Project operation using ๐‘ฆ as the join key obtains all the pairs of sets that intersect with each other. Furthermore, if we interpret tuple (๐‘ฅ,๐‘ฆ) as an edge between two vertices ๐‘ฅ and ๐‘ฆ in a graph, then the Join-Project operation can be used to compute all pairs of vertices that are indirectly connected. |๐‘Œ | ๐‘˜=1 R ๐‘ฅ ๐‘– ,๐‘ฆ ๐‘˜ S ๐‘ฆ ๐‘˜ ,๐‘ง ๐‘— . A non-zero element C ๐‘ฅ ๐‘– ,๐‘ง ๐‘— > 0 in the matrix corresponds to a tuple (๐‘ฅ ๐‘– ,๐‘ง ๐‘— ) in the final output of the Join-Project operation. Compared to the classical solution, MM performs the join and the deduplication together. There are efficient MM implementations in BLAS (Basic Linear Algebra Subprograms) packages with advanced techniques [9, 26, 37]. Moreover, there are sub-cubic MM algorithms in theory. The best known is the Coppersmith-Winograd algorithm with O (๐‘› 2.373 ) complexity [12]. Hybrid Solution. Recent studies [1, 10] combine the classical solution and the MM solution based on the observation that the classical