ICML2024

Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia

10 citations

Abstract

This paper introduces RaLMSpec, a framework that accelerates iterative retrieval-augmented language model (RaLM) with speculative retrieval and batched verification. RaLMSpec further introduces several important systems optimizations, including prefetching, optimal speculation stride scheduler, and asynchronous verification. The combination of these techniques allows RaLM-SPec to significantly outperform existing systems. For document-level iterative RaLM serving, evaluation over three LLMs on four QA datasets shows that RaLMSpec improves over existing approaches by 1.75-2.39×, 1.04-1.39×, and 1.31-1.77× when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively. For token-level iterative RaLM (KNN-LM) serving, RaLMSpec is up to 7.59× and 2.45× faster than existing methods for exact dense and approximate dense retrievers, respectively.