SOSP2023
Paella: Low-latency Model Serving with Software-defined GPU Scheduling
Kelvin K. W. Ng, Henri Maxime Demoulin, Vincent Liu
被引用 30 次
摘要
Model serving systems play a critical role in multiplexing machine learning inference jobs across shared GPU infrastructure. These systems have traditionally sat at a high level of abstraction---receiving jobs from clients through a narrow API and relying on black-box GPU scheduling mechanisms when dispatching them. Fundamental limitations in the built-in GPU hardware scheduler, in particular, can lead to inefficiency when executing concurrent jobs. The current abstraction level also incurs system overheads that are similarly most significant when the GPU is heavily shared.