ICML2025

Learning In-context n-grams with Transformers: Sub-n-grams Are Near-Stationary Points

Aditya Varre, Gizem Yüce, Nicolas Flammarion

Abstract

Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context n-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent k-gram estimators (for k ⩽ n), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: sub-ngrams are near-stationary points of the population cross-entropy loss, offering theoretical insight into widely observed phenomena such as stagewise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of n-grams, characterized by discrete transitions between near-stationary solutions.