ICML2024

Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot

Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee

被引用 22 次

摘要

The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. [48] proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an average-case setting and establish an algorithmic separation of transformers over FCNs. Specifically, a one-layer transformer trained with gradient descent provably learns the sparse token selection task and, surprisingly, exhibits strong out-ofdistribution length generalization. We provide empirical simulations to justify our theoretical findings.