SIGMOD2025

HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems

Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, Guihai Chen

4 citations

Abstract

Prompt engineering techniques are widely used to enhance the generation quality of large language models (LLMs). However, the long prompts significantly increase inference latency and reduce inference throughput. Since many prompts share common prefixes, prefix sharing has been proposed to reuse shared prefix KV caches during inference. Nevertheless, the large number of prefix KV caches and the limited GPU memory capacity make it impractical to store all prefix KV caches in GPU memory. This limitation necessitates the use of external memory storage strategies, which often suffer from high I/O overhead and frequent cache misses with traditional approaches. To address these challenges, this paper proposes HotPrefix, a hotness-aware KV cache scheduling framework designed for efficient prefix sharing in LLM inference systems. HotPrefix introduces three core innovations: (1) Dynamic Hotness Tracking, which dynamically monitors and updates the hotness of prefix tree nodes over time; (2) Selective KV Cache Admission, which evaluates evicted KV caches from GPU memory, retaining only high-hotness caches in CPU memory to expand GPU memory capacity and reduce KV cache transfer overhead; (3) Hotness Promotion, which periodically promotes high-hotness prefix tree KV caches from CPU memory to GPU memory. This is combined with an efficient pipeline strategy for I/O and computation, ensuring GPU memory is allocated to the most critical prefixes while masking the I/O overhead associated with KV cache transmission. These mechanisms significantly improve cache hit rates, reduce inference latency, and enhance throughput. Implemented in the SGLang framework, HotPrefix reduces inference latency and increases throughput by up to 2.25× compared with vLLM with prefix sharing enabled. Against SGLang, it achieves up to 2× latency reduction and throughput increase.