ICML2024

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Jan van den Brand, Zhao Song, Tianyi Zhou

被引用 35 次

摘要

Large language models (LLMs) have made fundamental changes in human life. The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]. In this work, we formally define a dynamic version of attention matrix multiplication problem. There are matrices Q,K,VRn×dQ,K, V \in \mathbb{R}^{n \times d}, they represent query, key and value in LLMs. In each iteration we update one entry in KK or VV. In the query stage, we receive (i,j)[n]×[d](i,j) \in [n] \times [d] as input, and want to answer (D1AV)i,j(D^{-1} A V)_{i,j}, where A:=exp(QK)Rn×nA:=\exp(QK^\top) \in \mathbb{R}^{n \times n} is a square matrix and D:=diag(A1n)Rn×nD := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n} is a diagonal matrix. Here 1n{\bf 1}_n denote a length-nn vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. \bullet On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses O(nω(1,1,τ)τ)O(n^{\omega(1,1,\tau)-\tau}) amortized update time, and O(n1+τ)O(n^{1+\tau}) worst-case query time. \bullet On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both O(nω(1,1,τ)τΩ(1))O(n^{\omega(1,1,\tau) - \tau- \Omega(1)}) amortized update time, and O(n1+τΩ(1))O(n^{1+\tau-\Omega(1)}) worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.