ASE2025
When AllClose Fails: Round-Off Error Estimation for Deep Learning Programs
Qi Zhan, Xing Hu, Yuanyi Lin, Tongtong Xu, Xin Xia, Shanping Li
摘要
Deep learning programs are continually enhanced for improved performance through the use of kernel-level optimizations, parallel training, and low-precision arithmetic. These optimizations provide different implementations that are mathematically equivalent. Round-off error in floating-point computations can lead to differences in the outputs of these implementations, even when the mathematical equivalence holds. When the outputs of customized and reference implementations exceed the tolerance thresholds, it is difficult for developers to distinguish between acceptable round-off errors and implementation bugs. This paper proposes an approach called Render to classify the numerical errors between two implementations based on estimating the maximum round-off error. Render combines dynamic interval arithmetic and round-off error analysis to compute scalable and tight output bounds. We demonstrate the effectiveness of our method on real-world issues by comparing it with the state-of-the-art tool, SATIRE and a High-Precision Re-execution baseline. Experimental results show that our approach identifies at least 25% more errors and achieves an average speedup of 19× compared to SATIRE, enabling developers to debug and optimize implementations more efficiently.