ICLR2026

Critical attention scaling in long-context transformers

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

19 citations

Abstract

As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length nn increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While attention scaling\text{\emph{attention scaling}} effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor βn\beta_n, theoretical justification for this approach remains lacking.

We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor βn\beta_n: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling βnlogn\beta_n \asymp \log n and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.