KDD2023

Training Large-scale Foundation Models on Emerging AI Chips

Aashiq Muhamed, Christian Bock, Rahul Solanki, Youngsuk Park, Yida Wang, Jun Huan

被引用 5 次

摘要

Foundation models such as ChatGPT and GPT-4 have garnered significant interest from both academia and industry due to their emergent capabilities, such as few-shot prompting, multi-step reasoning, instruction following, and model calibration. Such capabilities were previously only attainable with specially designed models, such as those using knowledge graphs, but can now be achieved on a much larger scale with foundation models. As the capabilities of foundation models have increased, so too have their sizes at a rate much faster than Moore's law. For example, the BERT large model was initially released as a 334M model in 2018, and by 2023, the largest GPT-4 models are estimated to range between 200-300B, representing an increase of three orders of magnitude in just five years. The training of foundation models requires massive computing power. For instance, training a BERT model on a single state-of-the-art GPU machine with multi-A100 chips can take several days, while training GPT-3 models on a large multi-instance GPU cluster can take several months to complete the estimated 3 X 1023 flops.