ICML2020

Why Are Learned Indexes So Effective?

Paolo Ferragina, Fabrizio Lillo, Giorgio Vinciguerra

被引用 63 次

摘要

Learned indexes have attracted significant research interest due to their ability to offer better space-time trade-offs compared to traditional B+-tree variants. Among various learned indexes, the PGM-Index based on error-bounded piecewise linear approximation is an elegant data structure that has demonstrated provably superior performance over conventional B+-tree indexes. In this paper, we explore two interesting research questions regarding the PGM-Index: ❶ Why are PGM-Indexes theoretically effective? and ❷ Why do PGM-Indexes underperform in practice? For question ❶, we first prove that, for a set of 𝑁 sorted keys, the PGM-Index can, with high probability, achieve a lookup time of 𝑂 (log log 𝑁 ) while using 𝑂 (𝑁 ) space. To the best of our knowledge, this is the tightest bound for learned indexes to date. For question ❷, we identify that querying PGM-Indexes is highly memory-bound, where the internal error-bounded search operations often become the bottleneck. To fill the performance gap, we propose PGM++, a simple yet effective extension to the original PGM-Index that employs a mixture of different search strategies, with hyper-parameters automatically tuned through a calibrated cost model. Extensive experiments on real workloads demonstrate that PGM++ establishes a new Pareto frontier. At comparable space costs, PGM++ speeds up index lookup queries by up to 2.31× and 1.56× when compared to the original PGM-Index and state-of-the-art learned indexes.