ICML2025

Benign Overfitting in Token Selection of Attention Mechanism

Keitaro Sakamoto, Issei Sato

摘要

Analysis of "benign overfitting" in the token selection of attention mechanism under label noise setting. How do the training dynamics of token selection in attention evolve under label noise? 2. Does the obtained solution generalize well? Model can select one token for each input Still generalizes well Test Set Training Set Overfits to label noise (Memorizes training labels) Clean data Noisy data, and classified as flipped label Benign overfitting: Achieve high generalization while perfectly fitting training data in an over-parameterized model. → Overfits training data, but surprisingly, without hurting generalization. Suppose that the norm of the linear head scales as . Under some parameter assumptions (*, see our paper for details), we have