ICML2025
The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, Jacob Andreas
Abstract
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of incontext task examples. We investigate the effectiveness of test-time training (TTT)-temporarily updating model parameters during inference using a loss derived from in-context examplesas a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to 6× higher accuracy compared to fine-tuned baselines-reaching 53.0% on the public validation set with an 8B-parameter LM and 61.9% when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the 10-shot setting by 7.3 percentage points (50.5% to 57.8%). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.