SIGMOD2025

CARINA: An Efficient CXL-Oriented Embedding Serving System for Recommendation Models

Peiqi Yin, Qihui Zhou, Xiao Yan, Chao Wang, Eric Lo, Changji Li, Lan Lu, Hua Fan, Wenchao Zhou, Ming-Chang Yang, James Cheng

被引用 2 次

摘要

Embedding-based recommendation models (ERMs) require large memory to host huge embedding tables and involve massive data traffic to read the embeddings. As a new interconnect, CXL suits ERMs since it can scale up single-machine memory with performant remote memory devices. However, directly running DRAM-based ERM serving systems on CXL yields poor performance because the bandwidth of CXL is notably lower than DRAM and can be easily saturated, making CXL memory the bottleneck. The non-uniform memory access (NUMA) architecture in modern CXL servers further decreased the system performance. In this paper, we design Carina for ERM serving on heterogeneous memory with CXL by considering such bandwidth asymmetry. In particular, Carina balances the memory access from different memory devices by storing hot embeddings with high access frequencies on DRAM and specifying the placement of embedding tables on the NUMA nodes. Moreover, Carina adopts bandwidth-aware task execution, which decomposes each batch of ERM requests into fine-grained tasks and schedules the tasks to control the real-time utilization of CXL bandwidth to avoid instantaneous saturation. We evaluate Carina under real CXL devices and find that it outperforms a CXL-oblivious baseline by an average of 5.38x and 4.04x in system throughput and request latency, respectively.