KDD2022

Large-Scale Information Extraction under Privacy-Aware Constraints

Rajeev Gupta, Ranganath Kondapally

Abstract

In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data of users due to their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. To provide better experience and assist users in their activities, it is critical to mine certain information from this data. This poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy constraints but need highly accurate models to run on a large amount of diverse data across different users. Anonymization of data is typically used to convert private data into publicly accessible data. But this may not always be feasible and may require complex differential privacy guarantees to be safe from any potential negative consequences. Further, the anonymization process needs to ensure that it retains sufficient information for modeling purposes post anonymization. Other techniques involve building extraction models using a small amount of seen (eyes-on) data with no privacy restrictions (hence, can be labeled) and a large amount of unseen (eyes-off) data which only a machine or a program can access. In this tutorial, we use emails as the canonical example of private data to explain in detail the challenges and solutions for scalable information extraction (IE) under privacy-aware constraints.