KDD2025

Robust Tree-based Learned Vector Index with Query-aware Repartitioning

Wenqing Wei, Defu Lian, Qingshuai Feng, Yongji Wu

Abstract

Approximate Vector Retrieval (AVR), which aims to efficiently retrieve the most similar items from a large dataset, is a fundamental task in a variety of applications such as information retrieval, recommender systems, and large language models. Advances in representation learning and multimodal neural models have enabled diverse data types (e.g., text, images, audio) to be embedded into a shared vector space, facilitating similarity-based retrieval in AVR. While single-modal AVR assumes query and database embeddings follow the same distribution (In-Distribution, ID), cross-modal AVR introduces a distribution shift, where query vectors (e.g., text) are Out-of-Distribution (OOD) relative to the database (e.g., images). This mismatch complicates retrieval and degrades accuracy, making it a key challenge in AVR. Existing methods typically focus on either ID or OOD queries but struggle to handle both within a unified framework.