OSDI2025

BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching

Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen

被引用 29 次

摘要

Model autoscaling is the key mechanism for serverless model-as-a-service, but faces a fundamental trade-off between scaling speed and storage/memory usage for caching parameters, and cannot meet frequent multi-host scaling demands. The root cause is a slow, blocking data plane: scaled instances stop while parameters load.

In this paper, we first show that the data plane—loading model checkpoints to accelerators—can be made fast with no or O (1) caching, by loading parameters through the inter-GPU compute network: (1) its speed is comparable to host cache yet underutilized, and (2) scaling multiple instances needs no or O (1) caching via network-optimized multicast. Second, autoscaling can be made live by shifting the scaling abstraction from coarse-grained instance-level to fine-grained layer-level, allowing us to offload layer computation from overloaded instances to scaled ones before parameters fully load.

Under real-world workloads, BlitzScale achieves up to 94 % lower tail latency than the state-of-the-art autoscaling system (ServerlessLLM), and cuts serving GPU time by 49 % versus non-autoscaling systems like DistServe and vLLM at the same SLA. To ease adoption in ecosystems like vLLM and SGLang, we further build BlitzLoad , a lightweight checkpoint engine that brings BlitzScale ’s data plane to existing serving engines with only a few lines of code changes.