VLDB2025

BigVectorBench: Heterogeneous Data Embedding and Compound Queries are Essential in Evaluating Vector Databases

Guoxin Kang, Zhongxin Ge, Jingpei Hu, Xueya Zhang, Lei Wang, Jianfeng Zhan

4 citations

Abstract

Vector databases are designed to effectively store, organize, and retrieve high-dimensional vectors, enabling faster and more accurate querying and analysis. This study highlights that the performance of cutting-edge vector databases hinges on their proficiency in managing heterogeneous data embedding and handling compound queries. The former task revolves around converting varied data types into a cohesive vector format, while the latter involves processing multimodal or single-modal queries with precise constraints. The paper advocates for evaluating these dual tasks within an integrated benchmark framework. However, state-of-the-art vector database benchmarks overlook heterogeneous data embedding and compound queries, creating a gap in evaluating vector database performance. To address this gap, we introduce BigVectorBench, a benchmark suite designed to evaluate vector database performance. BigVectorBench contributes by defining and evaluating the embedding performance of heterogeneous data. Additionally, it abstracts compound queries, which are increasingly used in real-world applications, replacing unimodal vector searches. Our rigorous evaluations validate the two design decisions of BigVectorBench and identify performance bottlenecks of mainstream vector databases. Its source code and user manual are available from https://github.com/BenchCouncil/BigVectorBench.