ICLR2025

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

Jerry Yao-Chieh Hu, Wei-Po Wang, Ammar Gilani, Chenyang Li, Zhao Song, Han Liu

摘要

We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on single-head transformers with only a single self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequenceto-sequence Lipschitz functions. In addition, we provide an exponential-in-dL and -in-(1/ϵ) lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the soft-prompt-induced keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners. * Equal contribution. Full version and future updates are on arXiv. Throughout this work, universality or universal approximation refers to sequence-to-sequence (seq2seq) universal approximation. Published as a conference paper at ICLR 2025 Here, for a matrix M ∈ R a×b , we write ∥M ∥ max := max i,j |M i,j |. In this work, we aim to investigate the computational limits of all possible efficient algorithms for APTI(d, L, L p , B, δ F ) under realistic setting δ F = 1/poly(L). Contributions. We study the fundamental limits of prompt tuning. Our contributions are threefold: • Universality. We prove that prompt tuning transformers with the simplest configurationssingle-head, single-layer attention -are universal approximators for Lipschitz sequence-tosequence functions. Additionally, we reduce the required number of FFN layers in the prompt tuning transformer to 2. These results improve upon (Wang et al., 2023a), which requires deep transformers with O((L p + L)(1/ϵ) d ) attention layers and O((1/ϵ) d(Lp+L) ) FFN layers. • Memorization. We show that prompt tuning such simple transformers (1-head, 1-layer attention and 2 FNN layers) is capable of complete memorization of datasets without any assumption on the data. Moreover, we establish an exponential-in-dL and -in-(1/ϵ) lower bound on the required softprompt tokens for any dataset, where d, L are the data dimension and sequence length, respectively, and ϵ is the approximation error. Our results improve upon those of (Wang et al., 2023a), which consider datasets with only two-token sequences and focus solely on memorizing the final token. • Efficiency. We address Question 2 by identifying a phase transition behavior in efficiency based on the norm of soft-prompt-induced queries and keys (Theorem 3.1). This establishes an efficiency criterion for prompt tuning inference, enabling efficient (sub-quadratic) algorithms when the criterion is met. Additionally, we address Question 3 by pushing the limits of efficiency in prompt tuning toward nearly-linear time under this criterion (Theorem 3.2). Organization. Section 2 presents a statistical analysis on prompt tuning's universality and memory capacity. Section 3 explore the computational limits of inference with prompt tuning. The appendix includes the related works (Appendix A.1) and the detailed proofs of the main text. Notations. We use lower case letters to denote vectors and upper case letters to denote matrices. The index set 1, ..., I is denoted by [I], where I ∈ N + . We write ℓ α -norm as ∥•∥ α . Throughout this paper, we denote input, label sequences as X, Y ∈ R d×L and prompt sequences as P ∈ R d×Lp . STATISTICAL LIMITS OF PROMPT TUNING: UNIVERSALITY AND CAPACITY To better understand the expressive power of prompt tuning, we explore its universality (Sections 2.3 and 2.4) and memory capacity (Section 2.5) on a transformer of simplest configurations. Overview of Our Results. Let T h,s,r denote transformers with h heads, s hidden size, and r MLP neurons, and let ϵ represent the approximation error tolerance. Let X ∈ R d×L and P ∈ R d×Lp be the input and soft-prompt defined in Definition 1.1, respectively. We answer Question 1 affirmatively, and present three results for transformer models with 1-head, 1-layer attention layers: Lemma 2.1 (1-Head, 1-Layer Attention with Any-Rank Weight Matrices Is Contextual Mapping, Informal Version of Lemma 2.2). A 1-head, 1-layer attention mechanism with weight matrices W K , W Q , W V of any rank is able to associate each input sequence with a unique label sequence. Theorem 2.1 (Universality of Prompt Tuning T 1,1,4 Transformers with O(ϵ -d(Lp+L) ) FFN Layers, Informal Version of Theorem 2.3). Prompt tuning transformers with 1 head, a h