ISSTA2025

Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing

Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Kexin Zhao, An Guo, Zhenyu Chen

1 citation

Abstract

Deep learning (DL) frameworks are essential to DL-based software systems, and framework bugs may lead to substantial disasters, thus requiring effective testing. Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs. However, floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively, leading to existing methods suffering from a lack of suitable test oracles. Some researchers utilize metamorphic testing to tackle this challenge. They design Metamorphic Relations (MRs) based on input data and parameter settings of a single framework interface to generate equivalent test inputs, ensuring consistent execution results between original and generated test inputs. Despite their promising effectiveness, they still face certain limitations. (1) Existing MRs overlook structural complexity, limiting test input diversity. (2) Existing MRs focus on limited interfaces, which limits generalization and necessitates additional adaptations. (3) Their detected bugs are related to the result consistency of single interfaces and far from those exposed in multi-interface combinations and runtime metrics (e.g., resource usage). To address these limitations, we propose ModelMeta, a model-level metamorphic testing method for DL frameworks with four MRs focused on the structure characteristics of DL models. ModelMeta augments seed models with diverse interface combinations to generate test inputs with consistent outputs, guided by the QR-DQN strategy. It then detects bugs through fine-grained analysis of training loss/gradients, memory/GPU usage, and execution time. We evaluate the effectiveness of ModelMeta on three popular DL frameworks (i.e., MindSpore, PyTorch, and ONNX) with 17 DL models from ten real-world tasks ranging from image classification to object detection. Results demonstrate that ModelMeta outperforms state-of-the-art baselines from the perspective of test coverage and diversity of generated test inputs. Regarding bug detection, ModelMeta has identified 31 new bugs, of which 27 have been confirmed, and 11 have been fixed. Among them, seven bugs existing methods cannot detect, i.e., five wrong resource usage bugs and two low-efficiency bugs. These results demonstrate the practicality of our method.