NDSS2026

Indicator of Benignity: An Industry View of False Positive in Malicious Domain Detection and its Mitigation

Daiping Liu, Danyu Sun, Zhenhua Chen, Shu Wang, Zhou Li

Abstract

lack an in-depth understanding of the magnitude and impact of false positives (FPs) in large-scale longitudinal real-world deployments. Considering that fear of FPs is one of the main concerns for these systems to be adopted in production [63] , more effort should be put into understanding the characteristics of their FPs and investigating how to further reduce FPs. Measurement Study. In this work, we conduct the first large-scale measurement study of FPs of malicious domain detectors. The primary challenge of such a study lies in collecting data and determining ground truth. To address this challenge, our study relies on FPs reported by users of a security vendor SV. SV has built and deployed tens of malicious domain detectors that detect 1.6M new malicious domains from ∼7B DNS queries on average per day. These malicious domains are used by firewalls of more than 65K organizations around the world. Whenever users notice potential FPs on their firewalls, they can report to SV whose security researchers manually investigate and decide whether an FP report is correct (i.e., Accepted FP) or not (i.e., Rejected FP). An additional unique advantage of user-reported FPs is that users often provide a justification of benignity, so that we can gain deeper insight into why FPs are generated. Considering that users of SV are mostly Security Operations Center analysts of enterprises, their justification along with manual verification by SV researchers makes these FP reports a dataset with reasonable ground truth. During 2019∼2024, users have reported 123,491 FPs, with 121,073 for 118,093 unique fully qualified domain names (FQDNs) accepted and 2,418 for 2,022 unique FQDNs rejected. Our analysis of this dataset reveals several interesting findings. First, half FPs are reported within 120 days after domains are detected as malicious. Thus, to evaluate the FP rate, it is more reasonable to deploy detectors in production for more than 4 months. Second, FPs in production have a long tail distribution, with 97.7% FQDNs in FPs being reported only once by one user. Most of these FQDNs are under unique root domains. As a result, FP mitigation approaches need to be generic to cover a diverse set of benign FQDNs that could become FPs. We further find that current popularitybased top lists that are commonly adopted by detectors, such as Tranco [62], cannot effectively mitigate FPs. In particular, they can only cover at most ∼38% FPs. Therefore, more effort is required to identify better generic ways to mitigate FPs in production. Finally, for ∼55% FPs, benign indicators of the Abstract-Malicious domain detection serves as a critical technique to keep users safe against cyber attacks. Although these systems have demonstrated remarkable detection capabilities, the magnitude of their false positives (FPs) in the real world remains unknown and is often overlooked. To shed light on this essential aspect, we conduct the first measurement study using 6-year FP reports collected from one of the largest global cybersecurity vendors. Our findings reveal that the popularitybased top domain lists that are commonly adopted by current detection systems are insufficient to avoid FPs. In fact, there are still a non-trivial number of FPs in production. We posit that one of the main reasons is that efforts in this area have predominantly focused on detecting malicious indicators, i.e., Indicator of Compromise (IOC), and have made light of the benign ones, i.e., Indicator of Benignity (IOB). Invthis paper, we make the first effort focusing on IOB detection. Our work is built upon our key finding that for many FPs in production, their IOBs can be found on the Internet. However, due to the openness of the Internet and unstructured Web content, we face two main challenges to identify these IOBs: understanding what an IOB is and assessing the trustworthiness of an IOB. To address these challenges, we propose a transitive trust model for IOB and implement it in a system called IOBHunter. IOBHunter leverages LLM and chain-of-thought (CoT) which have demonstrated promising capabilities to address several other security threats. Our evaluation using a dataset that contains verified FPs shows that IOBHunter can achieve 99.22% precision and 68.6% recall. IOBHunter is further evaluated in a two-months real-world deployment, in which IOBHunter has identified 4,338 confirmed FPs and 2,051 compromised domains.