ASE2025

Kair: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training

Yitang Yang, Junhong Liu, Jiapeng Chen, Xiaoyang Sun, Tianyu Wo, Chunming Hu, Chengru Song, Jin Ouyang, Renyu Yang

摘要

The distributed deep learning training process within large-scale clusters serves as the foundation of contemporary artificial intelligence. However, its inherent characteristics make it particularly sensitive to stragglers, specifically the presence of slow workers, which can significantly decelerate the entire procedure. Observability tools are essential for identifying stragglers within systems. However, the prevailing system profiling tools are either designed for single-node analysis, lacking visibility across multiple workers, or they recognize stragglers but only deliver high-level symptoms, providing engineers with insufficient insight into the underlying causes.We design Kair, a robust production-standard observability tool. Kair uses an innovative hierarchical approach, transitioning from statistical anomaly detection to causal inference. It employs Kolmogorov-Smirnov statistics for the identification of statistically anomalous workers and implements a causal path tracing algorithm to accurately determine the specific operations, such as computation or communication, that are responsible for the delay. Kair has been evaluated in a production cluster of 2,048 NVIDIA A800 GPUs and demonstrated high effectiveness in detecting latent stragglers at the framework level that are often overlooked by conventional tools. It offers precise suggestions that markedly reduce processing inefficiencies and engineering workload.