VLDB2025

Efficiently Joining Large Relations on Multi-GPU Systems

Tobias Maltenberger, Ilin Tolovski, Tilmann Rabl

被引用 3 次

摘要

Growing data volumes present a mounting challenge to relational joins. GPUs have gained widespread adoption as database accelerators for operators such as joins due to their high instruction throughput and memory bandwidth. Most published GPU-accelerated joins are single-GPU algorithms that do not leverage modern multi-GPU platforms effectively. The few proposed multi-GPU algorithms either fail to exploit the high-speed P2P interconnects between the GPUs or to handle large out-of-core data natively. In this paper, we present a heterogeneous multi-GPU sort-merge join that overcomes both limitations. It is composed of a merge- or radix partitioning-based P2P-enabled multi-GPU sort phase, a parallel CPU-based multiway merge phase, and a hybrid join phase that combines a CPU merge path partition with a binary search-based multi-GPU join strategy. We evaluate our novel multi-GPU join on two platforms with fast NVLink- and NVSwitch-based P2P interconnects. We show that our join outperforms state-of-the-art CPU and GPU baselines regardless of the workload. It outperforms parallel CPU sort-merge and radix-hash joins by up to 15.2× and 5.5×, respectively. Compared to non-P2P-enabled multi-GPU joins, it achieves speedups of 8.7× (sort-merge) and 2.5× (hybrid-radix). We measure that our join's hybrid join phase with overlapped copy and compute operations contributes as little as 22% to its end-to-end runtime. If the input relations are pre-sorted, it is up to 14.4× faster than the hybrid-radix join. Our join scales well with the number of GPUs and benefits from data skew with as much as 12% shorter join durations.