EMNLP2021
Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News
Michael Bugert, Iryna Gurevych
4 citations
Abstract
Cross-document event coreference resolution (CDCR) is the task of identifying which event mentions refer to the same events throughout a collection of documents. Annotating CDCR data is an arduous and expensive process, explaining why existing corpora are small and lack domain coverage. To overcome this bottleneck, we automatically extract event coreference data from hyperlinks in online news: When referring to a significant real-world event, writers often add a hyperlink to another article covering this event. We demonstrate that collecting hyperlinks which point to the same article(s) produces extensive and highquality CDCR data and create a corpus of 2M documents and 2.7M silver-standard event mentions called HyperCoref. We evaluate a state-of-the-art system on three CDCR corpora and find that models trained on small subsets of HyperCoref are highly competitive, with performance similar to models trained on goldstandard data. With our work, we free CDCR research from depending on costly humanannotated training data and open up possibilities for research beyond English CDCR, as our data extraction approach can be easily adapted to other languages. 1