KDD2023

Learning Discrete Document Representations in Web Search

Rong Huang, Danfeng Zhang, Weixue Lu, Han Li, Meng Wang, Daiting Shi, Jun Fan, Zhicong Cheng, Simiu Gu, Dawei Yin

被引用 3 次

摘要

Product quantization (PQ) has been usually applied to dense retrieval (DR) of documents thanks to its competitive time, memory efficiency and compatibility with other approximate nearest search (ANN) methods. Originally, PQ was learned to minimize the reconstruction loss, i.e., the distortions between the original dense embeddings and the reconstructed embeddings after quantization. Unfortunately, such an objective is inconsistent with the goal of selecting ground-truth documents for the input query, which may cause a severe loss of retrieval quality. Recent research has primarily concentrated on jointly training the biencoders and PQ to ensure consistency for improved performance. However, it is still difficult to design an approach that can cope with challenges like discrete representation collapse, mining informative negatives, and deploying effective embedding-based retrieval (EBR) systems in a real search engine.