WWW2026

Probe-and-Fetch: Dynamic KV Cache Pruning for Accelerated Long-Context Inference in Web-Scale AI Search

Yuchen Li, Rui Kong, Xinran Chen, Chengzhe Zhang, Jiamin Chen, Cheng Deng, Xinyu Ma, Haojie Zhang, Tianhao Peng, Hengyi Cai, Shuaiqiang Wang, Jiashu Zhao, Yongqi Zhang, Haoyi Xiong, Jimmy Xiangji Huang, Lei Chen, Jun Wang, Dawei Yin

1 citation

DOI Publisher

Abstract

Generative inference with Large Language Models (LLMs) is the cornerstone of web-scale AI search, where queries are answered using vast, heterogeneous documents retrieved via Retrieval-Augmented Generation (RAG). This paradigm is critically bottlenecked by the cost of self-attention mechanism on long context. The sheer diversity of retrieved web content (multi-sourced, multi-lingual, multi-faceted) makes simple Key-Value (KV) cache optimizations with pre-fixed subsets ineffective, demanding a dynamic, content-aware approach. This challenge, however, introduces a classic chicken-and-egg problem: the model cannot foresee the necessary KV entries for attention without first inferring on the content, yet doing so on the full context is prohibitively expensive. This paper introduces P&F, a unified framework that resolves this dilemma through a core ''probe-and-fetch'' mechanism, which ingeniously integrates with speculative decoding -- an acceleration approach already adopted in web-scale AI search. The probe step repurposes the speculative draft model: while generating candidate tokens, it simultaneously probes the context to predict the most salient KV entries the large model will need for attention. The fetch step immediately acts on this prediction, asynchronously fetching these sparse entries from memory. This synergistic design piggybacks the probing step onto the drafting process, allowing the expensive gathering of a sparse KV cache to be fully masked. Crucially, this co-design breaks the sequential dependency bottleneck that cripples naive integrations of speculative decoding and prefetching due to synchronization issues. Extensive experiments show P&F significantly outperforms state-of-the-art methods in throughput and scalability, offering a practical, drop-in solution. Extensive offline evaluations across various settings and datasets demonstrate that P&F yields superior throughput and scalability compared to advanced baselines, while maintaining model quality across diverse models and scales. In online settings, P&F delivers substantial gains in throughput improvements while preserving response quality, making it well-suited for large-scale industrial deployment in real-time AI Search services.