ICLR2025
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou
Abstract
Large language models (LLMs) encounter computational challenges during long sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length.