SIGMOD2025
Integrating Vector Databases across Embedding Models
Beining Yang, Yang Cao, Yang Ren
摘要
Vector databases have been widely used to implement similarity search over unstructured objects, e.g., documents and images. Each vector database is produced by an embedding model that encodes the objects in a way such that more similar objects are embedded to closer vectors, allowing us to use top-k vector search as an implementation of top-k object similarity search. It is common practice that different vector databases use distinct embedding models and the same object may be encoded by different embedding vectors across databases. As a result, one cannot share and integrate vector databases to expand similarity search across datasets, a property we take for granted for relational databases. In this work, we attempt to break the barrier between different vector databases, by developing an approach to integrating vector databases generated by different embedding models, with neither any access to the encoded data objects nor knowledge of the embedding models. Our approach is rooted in the local isometry hypothesis, a finding made via extensive experiments on real-life embedding vectors, and is backed up by theoretical analysis that bounds the quality of integrated vector database. Experimental results show that we can integrate vector databases produced by various popular embedding models, e.g., NV-embed-V2, OpenAI Ada, GloVe, Mistral and FastText, while offering high recall of top-k similarity search over the integrated datasets.