ACL2024

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau

Abstract

Practical large language model (LLM) services may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across numerous requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (i.e., key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. Cache Op 18 Context Attention 160 31 Kernel launch overhead, 10 Relay Fusion, 3 System Attention the hardware utilization so that LLMs can have a 042 higher throughput within a fixed hardware budget. 043 LLM services commonly use an application-044 specific system prompt (OpenAI, 2023a) to specify 045 the task's instructions. The system prompt is con-046 catenated with the user prompt as the full input 047 to the LLM for response generation and is shared 048 by all requests to a service. The system prompt 049 becomes long if the service provider wants to pro-050 vide detailed guidelines and examples for better 051 response quality or apply more restrictions/poli-052 cies for ethical safety. As the sequence length that 053 LLMs can process grows (Anthropic, 2023; Chen 054 et al., 2023b; DeepSeek-AI et al., 2024), some 055 emerging professional applications, such as legal 056 analysis (Cui et al., 2023; Nay et al., 2023), health-057 care applications (Steinberg et al., 2021; Rasmy 058 et al., 2021), and the shopping assistant example 059 shown in Fig. 2, may include one or more knowl-060 edge documents to provide domain-specific knowl-061 edge, resulting in even longer system prompts. Al-062 though long system prompts are beneficial to im-063 proving the generation quality or enabling new ap-064 plications, they also pose a challenge to the LLM 065 066 service can be heavily degraded, thus increasing 067 the per-request cost. This is inherently caused by 068 the causal attention, in which each new token is 069 generated by "looking at" all precedent ones. 070 In this paper, we propose a novel approach to 071 mitigate the efficiency problem of using long sys-072 tem prompts in LLM services. Our key obser-073 vation is that there are not only redundant mem-074 ory footprint (Kwon et al., 2023) and computa-075 tions (Gim et al., 2023) corresponding to the sys-076 tem prompt, but also unnecessary memory accesses 077 during causal attention computation. Specifically, 078 while the system prompt is shared by all requests, 079 its hidden states (i.e., key-value pairs) are read 080 from DRAM multiple times by existing attention 081 algorithms such as PagedAttention (Kwon et al., 082 2023) and FlashAttention (Dao et al., 2022; Dao, 083 2023), each for an individual request in the batch. 084 This severely slows down LLM inferences, which 085 are known to be memory-bound (Section 3.2). To 086 eliminate such redundant memory access, we pro-087 pose RelayAttention, an exact algorithm to com-088 pute causal attention based on a mathematical re-089 formulation of it. The key idea of RelayAtten-090 tion is to group the matrix-vector multiplications 091 corresponding to the system prompt into matrix-092 matrix multiplications, which allow loading the 093 hidden states of the system prompt from DRAM 094 exactly once for all request tokens in a batch (Sec-095 tion 3.3). We provide an in-depth analysis of the 096 theoretic speedup via redundancy reduction with 097 IO-awareness (Section 3.4). Our empirical results arXiv preprint arXiv:2210.17323.