ICML2025
Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
Nayoung Lee, Ziyang Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos
摘要
Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a selfimprovement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, self-improving enables models to solve problems far beyond their initial training distribution-for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that in some cases filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically teach a model logical extrapolation without any changes to the positional embeddings, or the model architecture. Our findings provide evidence that learn self-improvement is a general purpose and scalable solution for length and easy-to-hard generalization. Our contributions can be summarized as: 1. We apply an iterative self-training framework to train transformers on the arithmetic, maze and string manipulation tasks, and successfully tackle easy-to-hard generalization to extreme out-of-distribution test data. 2. We motivate the importance of a carefully crafted self-improvement schedule and label filtering based on length and majority voting, which are central to consistent self-improvement. 3. We show that the rate of self-improvement can be exponential and pretrained models can achieve faster acceleration in easy-to-hard generalization. 4. We investigate some key failure modes of self-correction due to label noise leading to an error avalanche, and discuss how they can be overcome through weak verification. RELATED WORKS Length and Easy-to-Hard Generalization. Length generalization is concerned with extrapolating to longer sequence lengths than those seen during training (Anil et al., 2022) . Previous approaches to improve length generalization includes architectural modifications, including specialized positional embeddings (