ICLR2026

Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi

摘要

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is M2β2β+1M^{-\frac{2\beta}{2\beta+1}}, where MM is the sample size and β\beta is the Hölder smoothness of the activation function. Importantly, this rate is independent of the embedding dimension dd, the number of tokens NN, and the rank rr of the weight matrix, provided that rd(M/logM)12β+1rd \le (M/\log M)^{\frac{1}{2\beta+1}}. These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.