WWW2025

GraphCom: Communication Hierarchy-aware Graph Engine for Distributed Model Training

Xinbiao Gan, Tiejun Li, Liang Wu, Qiang Zhang, Lingyun Song, Bo Yang, Jie Liu, Kai Lu

1 citation

Abstract

Efficient processing of large-scale graphs with billions to trillions of edges is essential for training graph-based large language models (LLMs) in web-scale systems. The increasing complexity and size of these models create significant communication challenges due to the extensive message exchanges required across distributed nodes. Current graph engines struggle to effectively scale across hundreds of computing nodes because they often overlook variations in communication costs within the interconnection hierarchy. This paper presents GraphCom, a communication-efficient message graph engine for graph processing on supercomputers. Our key idea is to leverage the network topology information to perform communication hierarchy-aware message aggregation, where messages are (i) gathered to the responsible nodes (referred to as monitors) in the source domains, (ii) transferred between monitors, and (iii) scattered to the target nodes in the target domains. GraphCom's aggregation is more aggressive in that each source domain (instead of the source node). We have implemented GraphCom on top of MPI. We demonstrate GraphCom's effectiveness with synthetic benchmarks and real-world graphs, utilizing up to 79,024 nodes and over 1.2 million processor cores, demonstrating that GraphCom surpasses leading graph- parallel systems and state-of-the-art counterparts in both throughput and scalability. Moreover, we have deployed GraphCom on a production supercomputer, where it consistently outperforms the top solutions on the Graph500 list. These results highlight the potential GraphCom has to significantly improve the efficiency of distributed large-scale graph-based LLM training by optimizing communication between distributed systems, making it an invaluable graph engine for distributed training tasks on web-scale graphs.