OSDI2025

BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching

Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen

29 citations

Abstract

Model autoscaling is the key mechanism for serverless model-as-a-service, but faces a fundamental trade-off between scaling speed and storage/memory usage for caching parameters, and cannot meet frequent multi-host scaling demands. The root cause is a slow, blocking data plane: scaled instances stop while parameters load.

In this paper, we first show that the data plane—loading model checkpoints to accelerators—can be made fast with no or O (1) caching, by loading parameters through the inter-GPU compute network: (1) its speed is comparable to host cache yet underutilized, and (2) scaling multiple instances needs no or O (1) caching via network-optimized multicast. Second, autoscaling can be made live by shifting the scaling abstraction from coarse-grained instance-level to fine-grained layer-level, allowing us to offload layer computation from overloaded instances to scaled ones before parameters fully load.

Under real-world workloads, BlitzScale achieves up to 94 % lower tail latency than the state-of-the-art autoscaling system (ServerlessLLM), and cuts serving GPU time by 49 % versus non-autoscaling systems like DistServe and vLLM at the same SLA. To ease adoption in ecosystems like vLLM and SGLang, we further build BlitzLoad , a lightweight checkpoint engine that brings BlitzScale ’s data plane to existing serving engines with only a few lines of code changes.