ICML2025

Auto-reconfiguration for Latency Minimization in CPU-based DNN Serving

Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Ryan Stutsman

Abstract

In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe that while intra-operator parallelism across multiple threads effectively reduces inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances, each with smaller batch sizes and fewer threads for intra-op parallelism, can provide lower inference latency. However, the right configuration is difficult to determine manually since it is workload-dependent (DNN model and batch size used by the serving system) and deploymentdependent (number of CPU cores on a server). We present Packrat, a new serving system for online inference that given a model and batch size (B) algorithmically picks the optimal number of instances (i), the number of threads each should be allocated (t), and the batch sizes each should operate on (b) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43× to 1.83× on a range of commonly used DNNs. Auto-reconfiguration for Latency Minimization in CPU-based DNN Serving stances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. In the general case, determining the optimal configuration of ⟨instances, threads, batch⟩ (or ⟨i, t, b⟩ for short) is challenging because it is workload-and deployment-specific. The optimal configuration depends on the specific model being served, input dimensions like the batch size (which is itself dependent on the request arrival rate), and the hardware (e.g., number of cores, memory bandwidth, etc.). Furthermore, even if there were a hypothetical oracle that could provide the optimal ⟨i, t, b⟩ configuration, a user would still have to manually recognize when to change configurations and then reconfigure existing serving systems while specifying thread-core affinities appropriately. Packrat uses a novel algorithm to dynamically determine the optimal ⟨i, t, b⟩ configuration for models on individual servers given a batch of inputs for the model. It does this automatically using a small amount of targeted profiling; from this limited profiling information, it formulates ⟨i, t, b⟩ configurations that are expected to optimize average batch latency for different batch sizes by solving a 2-dimensional knapsack problem using dynamic programming. This lets Packrat quickly find configurations that balance intra-op latency with multi-instance execution without the need for user input and without impractically profiling all possible configuration combinations. Combined with its mechanism of transitioning between configurations, this lets Packrat dynamically reconfigure model instances and threads used for inference, entirely online, so as to optimize inference latency as workloads change. We evaluate Packrat on a single server running TorchServe. Over several models, we show that Packrat improves inference latency and throughput over the baseline approach that maximizes intra-op parallelism by 1.43× to 1.83× averaged over a range of batch sizes. Packrat code is open-source and can be accessed at https://github.com/msr-fiddle/packrat .