SIGMOD2025
SWASH: A Flexible Communication Framework with Sliding Window-Based Cache Sharing for Scalable DGNN Training
Zhen Song, Yu Gu, Tianyi Li, Yushuai Li, Qing Sun, Yanfeng Zhang, Christian S. Jensen, Ge Yu
2 citations
Abstract
Dynamic Graph Neural Networks (DGNNs) are effective at capturing multidimensional data and enable many important applications. As model training is computationally intensive, distributed DGNN training is employed to accommodate large data. Also, when training DGNNs, so-called sliding window training is used predominantly, as it enhances both accuracy and efficiency. However, current distributed frameworks-such as snapshot partitioning, chunk-based partitioning, and L -hop cache-based communication-free vertex partitioning-are inherently incompatible with sliding window training. While communication-based vertex partitioning supports sliding window training, its design for static graphs limits the effectiveness in distributed DGNN training. Specifically, existing partitioning strategies fail to optimize communication across snapshots, while existing cache reuse and communication scheduling strategies ignore opportunities for optimization between sliding windows. To support distributed sliding window training, we present SWASH, a scalable and flexible communication framework that utilizes a <u>S</u> liding <u>W</u> indow-based c <u>A</u> che <u>SH</u> aring technique. Specifically, we propose a flexible communication framework that supports ratio adjustment and timing selection, as well as hyperparameter settings and adaptive scheduling. We also propose a lightweight partitioning strategy tailored to sliding window-based DGNN training to reduce both partitioning and communication overheads. Finally, to alleviate decreases in accuracy due to reduced communication, we propose a cache-sharing technique based on sliding windows for sharing boundary vertex embeddings. Comprehensive experiments show that SWASH is capable of training speedups of an average of 9.44× over state-of-the-art frameworks while maintaining the accuracy of fully communicating, non-caching training frameworks.