NeurIPS2024

xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token

Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, Dongyan Zhao

Abstract

This paper introduces xRAG, a novel context compression method designed specifically for retrieval-augmented generation. xRAG redefines the use of document embeddings in dense retrieval-traditionally limited to retrieval purposes-by integrating them as features from the retrieval modality. Through a modality fusion approach, xRAG effectively merges these embeddings into the language model's representation space, eliminating the need for their textual counterparts and achieving an extreme compression rate. In xRAG, the modality bridge is the only trainable component, while the retriever and language model remain frozen. This design choice allows for the reuse of offline-constructed document embeddings and preserves the plug-and-play nature of retrieval augmentation. Experimental results demonstrate that xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks, compatible with various language model backbones, ranging from a dense 7B model to an 8x7B Mixture of Experts configuration. xRAG not only significantly outperforms previous context compression methods but also matches the performance of uncompressed models on several benchmarks, while reducing overall FLOPs by a factor of 3.53. This work pioneers new avenues in retrieval-augmented generation through multimodal fusion, potentially setting a groundwork for future developments in efficient and scalable retrieval systems. How might we mitigate the costs associated with extended context while maintaining the benefits of retrieval augmentation? Recent research interest has converged on a promising direction: Context Compression. This concept is pursued through two primary strategies: soft-prompting methods, such as Gist [58], AutoCompressor [14], and ICAE [19] , which compress the context into dense memory slots, and hard-prompting methods, such as LLMLingua [28] and RECOMP [79] , where