ACL2025

Re-identification of De-identified Documents with Autoregressive Infilling

Lucas Georges Gabriel Charpentier, Pierre Lison

摘要

Documents revealing sensitive information about human individuals must often be deidentified prior to being released. This deidentification is typically done by masking all mentions of personal identifiers, thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a de-identified text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on two datasets based on Wikipedia biographies and court cases. Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the reidentification accuracy increases along with the level of background knowledge. whether the de-identification has adequately con-044 cealed the identity of the person(s) mentioned in 045 the original document. Many evaluation techniques 046 assess the performance of de-identification meth-047 ods by comparing their outputs with those of hu-048 man experts (Lison et al., 2021; Pilán et al., 2022). 049 However, those evaluation techniques depend on 050 the availability of human annotations and may be 051 prone to human errors and inconsistencies. 052 An alternative approach to evaluating the de-053 identification performance is through an automated 054 adversary that attempts to infer the original context 055 of each text span that had been masked (Mozes and