ACL2021

Claim Matching Beyond English to Scale Global Fact-Checking

Ashkan Kazemi, Kiran Garimella, Devin Gaffney, Scott Hale

摘要

Manual fact-checking does not scale well to serve the needs of the internet. This issue is further compounded in non-English contexts. In this paper, we discuss claim matching as a possible solution to scale fact-checking. We define claim matching as the task of identifying pairs of textual messages containing claims that can be served with one fact-check. We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims that are first annotated for containing "claim-like statements" and then matched with potentially similar items and annotated for claim matching. Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages. We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low-and high-resource languages in our dataset. We provide evaluations on the performance of our solution and compare with baselines and existing state-of-the-art multilingual embedding models, namely LASER and LaBSE. We demonstrate that our performance exceeds LASER and LaBSE in all settings. We release our annotated datasets 1 , codebooks, and trained embedding model 2 to allow for further research. Table 1: Example message pairs in our data annotated for claim similarity. Item #1 Item #2 Label पािक�तान म� गनपॉइं ट पर हु ई एक डकै ती को बताया जा रहा है मु ं बई की घटना कराची पािक�तान म� घिटत लू ट को मु ं बई का बताया जा रहा है । Very Similar பாகிஸ் தானில் உள் ள இந் திய �தர் உடன�யாக ெடல் லி தி�ம் ப மத் திய அர� உத் தர� ெசய் திகள் 24/7 FLASH பாகிஸ் தானில் உள் ள இந் திய �தர் ெடல் லி தி�ம் ப மத் திய அர� உத் தர� என தகவல் ..