SIGMOD2025

DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention

Hao Yuan, Xin Ai, Qiange Wang, Peizheng Li, Jiayang Yu, Chaoyi Chen, Xinbo Yang, Yanfeng Zhang, Zhenbo Fu, Yingyou Wen, Ge Yu

被引用 2 次

摘要

Graph-based Retrieval-Augmented Generation (GraphRAG) has emerged as a promising paradigm for enhancing LLM reliability by enabling multi-hop reasoning over graph-structured knowledge. However, existing LLMs struggle to efficiently process graph-structured inputs, as traditional attention mechanisms are sequence-based and introduce significant redundancy when serializing graphs into prompt sequences, leading to excessive computation and memory overhead. To address this, we introduce dependency attention, a novel graph-aware attention mechanism that restricts attention computation to token pairs with structural dependencies in the retrieved subgraph. Unlike standard self-attention that computes fully connected interactions, dependency attention prunes irrelevant token pairs and reuses computations along shared relational paths, substantially reducing inference overhead. Building on this idea, we develop DepCache, a KV cache management framework tailored for dependency attention. DepCache enables efficient KV cache reuse through (i) a graph-based KV cache reuse strategy that aligns KV caches across varying prompt contexts, enabling efficient cross-request reuse in GraphRAG, and (ii) a locality-aware replacement policy that leverages spatial and temporal access patterns to improve KV cache hit rate. Evaluations across diverse models and datasets show that DepCache improves LLM inference throughput by 1.5×-5.0× and reduces time-to-first-token latency by up to 3.2×, without compromising generation accuracy.