ICLR2026

Sublinear Time Quantum Algorithm for Attention Approximation

Zhao Song, Jianfei Xue, Jiahao Zhang, Lichen Zhang

1 citation

Abstract

Given the query, key and value matrices $Q, K, V\in \mathbb{R}^{n\times d}$ , the attention matrix is defined as $\mathrm{Att}(Q, K, V)=D^{-1}AV$ where $A=\exp(QK^\top/\sqrt{d})$ with $\exp(\cdot)$ applied entrywise, $D=\mathrm{diag}(A{\bf 1}_n)$ . The attention matrix is the backbone of modern transformers and large language models, but explicitly forming the softmax matrix $D^{-1}A$ incurs $\Omega(n^2)$ , motivating numerous approximation schemes that reduce runtime to $\widetilde O(nd)$ via sparsity or low-rank factorization.

We propose a quantum data structure that approximates any row of $\mathrm{Att}(Q, K, V)$ using only row queries to $Q, K, V$ . Our algorithm preprocesses these matrices in $\widetilde{O}\left( \epsilon^{-1} n^{0.5} \left( s_\lambda^{2.5} + s_\lambda^{1.5} d + \alpha^{0.5} d \right) \right)$ time, where $\epsilon$ is the target accuracy, $s_\lambda$ is the $\lambda$ -statistical dimension of the exponential kernel defined by $Q$ and $K$ , and $\alpha$ measures the row distortion of $V$ that is at most $d/{\rm srank}(V)$ , the stable rank of $V$ . Each row query can be answered in $\widetilde{O}(s_\lambda^2 + s_\lambda d)$ time.

To our knowledge, this is the first quantum data structure that approximates rows of the attention matrix in sublinear time with respect to $n$ . Our approach relies on a quantum Nyström approximation of the exponential kernel, quantum multivariate mean estimation for computing $D$ , and quantum leverage score sampling for the multiplication with $V$ .