FSE2025

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning

Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazit, Joanna Kisaakye, Jesse Nyyssölä

摘要

Microservice-based systems (MSS) may fail with various fault types, due to their complex and dynamic nature. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts from practitioners are still required for further root cause analysis to diagnose specific fault types and analyze failure reasons for detected abnormal traces, particularly when abnormal traces do not stem directly from specific services. In this paper, we propose a novel AIOps framework, TraFaultDia, to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia is trained on several abnormal trace classification tasks with a few labeled instances from a MSS using a meta-learning approach. After training, TraFaultDia can quickly adapt to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia’s use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two representative MSS, TrainTicket and OnlineBoutique, with open datasets. In these datasets, each fault category is tied to the faulty system component(s) (service/pod) with a root cause. Our TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. Our results show that, within the MSS it is trained on, TraFaultDia achieves an average accuracy of 93.26% and 85.20% across 50 new, unseen abnormal trace classification tasks for TrainTicket and OnlineBoutique respectively, when provided with 10 labeled instances for each fault category per task in each system. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.