CCS2023

Read Between the Lines: Detecting Tracking JavaScript with Bytecode Classification

Mohammad Ghasemisharif, Jason Polakis

3 citations

Abstract

Browsers and extensions that aim to block online ads and tracking scripts predominantly rely on rules from filter lists for determining which resource requests must be blocked. These filter lists are often manually curated by a community of online users. However, due to the arms race between blockers and ad-supported websites, these rules must continuously get updated so as to adapt to novel bypassing techniques and modified requests, thus rendering the detection and rule-generation process cumbersome and reactive (which can result in major delays between propagation and detection). In this paper, we address the detection problem by proposing an automated pipeline that detects tracking and advertisement JavaScript resources with high accuracy, designed to incur minimal false positives and overhead. Our method models script detection as a text classification problem, where JavaScript resources are documents containing bytecode sequences. Since bytecode is directly obtained from the JavaScript interpreter, our technique is resilient against commonly used bypassing methods, such as URL randomization or code obfuscation. We experiment with both deep learning and traditional ML-based approaches for bytecode classification and show that our approach identifies ad/tracking scripts with 97.08% accuracy, significantly outperforming cutting-edge systems in terms of both precision and the level of required features. Our experimental analysis further highlights our system's capabilities, by demonstrating how it can augment filter lists by uncovering ad/tracking scripts that are currently unknown, as well as proactively detecting scripts that have been erroneously added by list curators. CCS CONCEPTS • Security and privacy → Privacy protections.