SIGMOD2025
P4KVS: A Role-Replica Separation Offloading Method to Achieve In-Network Consistency for KV Stores Based on P4 Switches
Haojuan Li, Zongpu Zhang, Chenzhen Ye, Ruohan Tang, Jian Li, Haibing Guan, Qiaoling Wang, Pengpeng Zhou
Abstract
Strong consistency, particularly linearizability, is essential for distributed DBMSs deployed in correctness-critical domains such as finance and defense. In general, an optimal linearizability DBMS system focus on two key principles: (1) matching single-node (no-consistency cost) Read/Write performance under strong consistency, and (2) practical deployability via general database compatibility. Unfortunately, existing solutions fall short on both fronts. %However, achieving strong consistency often comes with steep performance penalties. For example, etcd-a widely-used Raft-based system-achieves only 5.9% of the throughput of LevelDB, a single-node store without consistency overhead. To achieve higher performance, software approaches adopt weaker consistency models (e.g., ZAB), rely on narrow network assumptions (e.g., NOPaxos), or expose protocol internals to clients (e.g., CURP), yet still fail to close the performance gap. Recent programmable networking hardware offers promising advances, yet current hardware solutions face practical limitations, including minimal storage and incompatibility with general-purpose databases. We propose P4KVS, the first practical Raft-based in-network consensus offloading solution leveraging programmable switches (P4) for distributed key-value stores. P4KVS offloads only the Leader role to the switch while retaining Followers on servers. Under linearizability, it achieves 74% of single-node LevelDB's throughput for write-heavy workloads, and up to 222.4% for read-heavy workloads by distributing reads across three replicas. This demonstrates that, even under strong consistency, P4KVS can match or exceed the performance of a single-node system. Compared to etcd (which also uses Raft), P4KVS delivers 37.5× higher read throughput and 3520× lower write latency. These results validate our hardware role-replica separation design in eliminating software Raft bottlenecks, while preserving compatibility via standard database interfaces (e.g., LevelDB, etcd) and scaling beyond typical switch memory constraints.