CCS2025

Deep Learning from Imperfectly Labeled Malware Data

Fahad Alotaibi, Euan Goodbrand, Sergio Maffeis

Abstract

Deep learning approaches have achieved remarkable performance in malware classification and detection. However, their success relies on the availability of large, accurately labeled datasets: a critical yet challenging requirement in the malware domain. In practice, most malware datasets are automatically labeled using outputs from antivirus engines, a process that often introduces significant label noise. Such imperfections can severely degrade the performance and generalizability of deep learning models.