ICLR2025

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou

摘要

Large language models (LLMs) encounter computational challenges during long sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length.