ICML2025
Non-Asymptotic Length Generalization
Thomas Chen, Tengyu Ma, Zhiyuan Li
摘要
Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is 2n -2 where n is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024) . We show that the length complexity of 1-layer C-RASP functions is O(T 2 ) when the ground-truth function has precision T , and that the length complexity of 2-layer C-RASP functions is O(T O(K) ) when the ground-truth function has precision T and K heads. recently shown to have a good prediction power on the length generalization performance of transformers (Zhou et al., 2023; Huang et al., 2024) . We study two subclasses of C-RASP, C-RASP 1 (Theorem 5.5) and C-RASP 2 (Theorem 5.6), which are of depth 1 and 2 respectively. Function Class Complexity of Ground-Truth Function Length of Training Data Sufficient to Generalize DFAs number of states, c 2c -2 (Proposition 3.10) CFGs description length, c no computable bound in c exists (Proposition 4.6) C-RASP 1 precision, T O(T 2 ) (Theorem 5.5) C-RASP 2 precision, T , and number of heads, K O(T O(K) ) (Theorem 5.6) Table 1: Summary of results: upper bounds on minimum length of binary strings in training data which suffices for length generalization. C-RASP is a class of functions which is a superset of functions expressible by finite-precision transformers. For L ∈ 1, 2, C-RASP L is a subclass of depth-L C-RASP functions. 2 Related Work Solomonoff (1964) proposed Bayesian-inference-based algorithms which when given a sequence of symbols, predict the next symbol according to the posterior distribution computed from the Solomonoff-Levin prior distribution. Gold (1967) introduced Identification-in-the-Limit as an asymptotic notion of learnability and proved that many classes of functions can be learned in this sense. Zhou et al. (2023) propose the RASP-L Conjecture, which says that whether a transformer length generalizes on a particular ground-truth function f * is well predicted by whether f * has a short RASP-L description. Huang et al. (2024) formulate the problem of length generalization-and the RASP-L Conjectureformally as a version of Identification-in-the-Limit and prove asymptotic results for identifying languages expressible by Limit Transformers. None of the works above provide non-asymptotic guarantees. We distinguish this work from previous work by providing non-asymptotic bounds on the length of the training data required in order to guarantee that the learner outputs a single hypothesis which exhibits perfect length generalization. Weiss et al. (2021) Zhou et al. (2023), Yang & Chiang (2024), and Shaw et al. (2024) study programming languages which capture the set of functions which transformers can express (like RASP). Regarding theoretical works for length generalization not related to transformers, Marsden et al. (2024) prove length generalization for learning linear dynamical systems with SGD. Abbe et al. (2024) prove out-of-domain generalization for boolean functions of a fixed input size. Non-Asymptotic Length Generalization Notation. For simplicity, we will fix the alphabet Σ = 0, 1 throughout the paper. We define computable (recursive) functions as functions which can be computed by a Turing Machine, which halts on all inputs. We say a function f : N → N is computably bounded (recursively bounded) if there exists a computable (recursive) function g : the binary string encoding of M . Let [N ] := 1, 2, 3, . . . , N -1, N . Denote 0, 1 n as the set of binary strings of length n, 0, 1 ≤n := n j=0 0, 1 j , and 0, 1 * := j≥0 0, 1 j . Note that ϵ, the empty string of length 0, is included in 0, 1 * . Denote 1[•] as the indicator function and cl(A) as the closure of set A ⊂ R d . Representation of Functions. We are interested in learning subsets of computable functions mapping 0, 1 * to 0, 1, denoted by F. Because a learning algorithm can