SOSP2023

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

Kelvin K. W. Ng, Henri Maxime Demoulin, Vincent Liu

被引用 30 次

摘要

Model serving systems play a critical role in multiplexing machine learning inference jobs across shared GPU infrastructure. These systems have traditionally sat at a high level of abstraction---receiving jobs from clients through a narrow API and relying on black-box GPU scheduling mechanisms when dispatching them. Fundamental limitations in the built-in GPU hardware scheduler, in particular, can lead to inefficiency when executing concurrent jobs. The current abstraction level also incurs system overheads that are similarly most significant when the GPU is heavily shared.