ACL2024

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao

摘要

This research aims to accelerate the inference 001 speed of large language models (LLMs) with 002 billions of parameters. We propose Smart 003 Parallel Auto-Correct dEcoding (SPACE), an 004 innovative approach designed for achieving 005 lossless acceleration of LLMs. By integrating 006 semi-autoregressive inference and speculative 007 decoding capabilities, SPACE uniquely enables 008 autoregressive LLMs to parallelize token gener-009 ation and verification. This is realized through 010 a specialized semi-autoregressive supervised 011 fine-tuning process that equips existing LLMs 012 with the ability to simultaneously predict mul-013 tiple tokens. Additionally, an auto-correct de-014 coding algorithm facilitates the simultaneous 015 generation and verification of token sequences 016 within a single model invocation. Through 017 extensive experiments on a range of LLMs, 018 SPACE has demonstrated inference speedup 019 ranging from 2.7x-4.0x on HumanEval-X while 020 maintaining output quality. 021 1 Introduction 022 The majority of current large language models 023 (LLMs), including prominent examples like Chat-024 GPT (Brown et al., 2020) and LLaMA (Touvron 025 et al., 2023), are autoregressive (AR) in nature.