NDSS2026

Cache Me, Catch You: Cache Related Security Threats in LLM Serving Frameworks

XiangFan Wu, Lingyun Ying, Guoqiang Chen, Yacong Gu, Haipeng Qu

2 citations

Abstract

Among these optimization strategies, cache is particularly effective, offering significant performance improvements by storing intermediate results to eliminate repetitive computations [7] . Middleware caching solutions, such as GPTCache [8] and ModelCache [9], further extend these efficiency gains. According to the cache mechanism, caching in LLMs can be classified into three categories: prefix cache, multimodal cache, and semantic cache. Prefix cache stores computational states for previously processed tokens, enabling efficient reuse for subsequent queries sharing identical input prefixes (see Figure 1 ). Mainstream inference engines such as vLLM and SGLang have built-in prefix cache support by default. Commercial LLM APIs, including OpenAI and Gemini, also enable prefix cache by default [10], [11] , illustrating its practical application and cost advantage. Multimodal cache involves preprocessing multimodal inputs (e.g., images or audio) to avoid redundant computations upon identical inputs. This approach is already integrated into vLLM for vision models and appears in production pipelines such as Google's Gemini [11] . Whereas, semantic cache works at a higher abstraction level by indexing responses through semantic embeddings, thereby retrieving responses based on query similarity instead of performing full inference. This semantic approach is particularly advantageous in use cases involving repetitive or standardized queries. This semantic approach is adopted by middleware solutions like GPTCache and vector databases integrated within frameworks like LangChain [12], making them highly effective for applications with repetitive or template-based queries. Although cache can greatly reduce response time and improve efficiency, defective implementation can potentially introduce security vulnerabilities. Caching mechanisms typically work in the Key-Value (KV) mode and involve three stages: object serialization, key generation, and cached value retrieval. Flawed design, deficient implementation, and incorrect usage can all lead to security vulnerabilities, which can be exploited to carry out malicious activities. Our investigation identifies several vulnerabilities present at each of these stages, posing critical security threats. For example, improper object serialization may erroneously map distinct inputs (e.g., images) to identical cached representations. Moreover, Non-Cryptographic Hash Functions (NCHFs) [13] are frequently Abstract-Large Language Models (LLMs) are rapidly reshaping digital interactions. Their performance and efficiency are critically dependent on advanced caching mechanisms, such as prefix caching and semantic caching. However, these mechanisms introduce a new attack surface. Unlike prior work focused on LLMs poisoning attacks during the training phase, this paper presents the first comprehensive investigation into cache-related security risks that arise during the LLM inference-time. We conducted a systematic study of the cache implementations in mainstream LLM serving frameworks and then identified six novel attack vectors categorized as: (1) User-oriented Fraud Attacks, which manipulate cache entries to deliver malicious content to users via prefix cache collisions and semantic fuzzy poisoning; and (2) System Integrity Attacks, which exploit cache vulnerabilities to bypass security checks, such as using blockwise or multimodal collisions to evade content moderation. Our experiments on leading open-source frameworks validated these attack vectors and evaluated their impact and cost. Furthermore, we proposed five multilayer defense strategies and assessed their effectiveness. We responsibly disclosed our findings to affected vendors, including vLLM, SGLang, GPTCache, AIBrix, rtp-llm and LMDeploy. All of them have acknowledged the vulnerabilities, and notably, vLLM, GPTCache, and AIBrix have adopted our proposed mitigation methods and fixed their vulnerabilities. Our findings underscore the importance of secure the caching infrastructure in the rapidly expanding LLM ecosystem.