ICML2024

Bottleneck-Minimal Indexing for Generative Document Retrieval

Xin Du, Lixin Xiu, Kumiko Tanaka-Ishii

2 citations

Abstract

We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document xXx \in X is indexed by tTt \in T, and a neural autoregressive model is trained to map queries QQ to TT. GDR can be considered to involve information transmission from documents XX to queries QQ, with the requirement to transmit more bits via the indexes TT. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes TT can then be regarded as a bottleneck in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.