ASE2025
KAIOPS: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale
Zeying Wang, Junhong Liu, Penghao Zhang, Xiaoyang Sun, Xu Wang, Tianyu Wo, Chunming Hu, Chengru Song, Jin Ouyang, Renyu Yang
Abstract
The resilience of large-scale AI training platforms are fundamental to enabling contemporary AI innovation and business development. However, with the rapid increase in the scale and complexity of AI model training tasks, anomalies become the norm rather than the exception at scale. Failing to handle them properly may lead to enormous resource waste and prolonged development cycles. Traditional anomaly detection methods struggle to tackle the complex temporal characteristics and extreme class imbalance inherently manifesting in training tasks, and fall short in automated solution to root cause analysis and the follow-up remediation. This paper proposes KAIOPS, an end-to-end automated platform solution for handling anomalies and engineering experience of daily operational maintenance for large-scale AI training clusters at Kuaishou. KAIOPS employs a Temporal Context Encoding mechanism to precisely capture and encode long-term trends and critical temporal context information within fault evolution. The detection model elaborates a dynamic class-weighted loss function for enhancing the detection performance. To deliver a complete end-to-end intelligent processing pipeline, KAIOPS further leverages knowledge graph and LLMs for automated root cause analysis and actionable solution generation. Extensive experiments, on the basis of data collected from Kuaishou’s production-grade training clusters, show the superior performance of our proposed approach. KAIOPS has been deployed in Kuaishou, in both testbed and production grade environments, consisting of with over 10,000 GPUs, and accelerate the reliability assurance for industry-scale model training and serving.