NeurIPS2024

Sequoia: Scalable and Robust Speculative Decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

摘要

As the usage of large language models (LLMs) grows, it becomes increasingly important to serve them quickly and efficiently. While speculative decoding has recently emerged as a promising direction for accelerating LLM serving, existing methods are limited in their ability to scale to larger speculation budgets and adapt to different hyperparameters. This paper introduces S EQUOIA , a scalable and robust algorithm for speculative decoding. To improve scalability, S EQUOIA introduces a dynamic programming algorithm to find an optimal tree structure for the speculated tokens. To achieve robust speculative decoding, S EQUOIA uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. S EQUOIA improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to 4 . 04 × , 3 . 73 × , and 2 . 27 × . To serve Llama3-70B-Instruct on a single L40 GPU through offloading, S EQUOIA reduces the per-token decoding latency to 0.60 s/token, 9 . 5 × faster than DeepSpeed-Zero-Inference. The code is available at https://github.com/Infini-AI-Lab/Sequoia .