ACL2025

NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution

Meihan Tong, Shuai Wang

1 citation

Abstract

Coreference resolution (CR) links pronouns and noun phrases to their referent entities, serving as a key step in deep text understanding. Presently available CR datasets are either small in scale or restrict coreference resolution to a limited text span. In this paper, we present NovelCR, a large-scale bilingual benchmark designed for long-span coreference resolution. NovelCR features extensive annotations, including 148k mentions in NovelCR-en and 311k mentions in NovelCR-zh. Moreover, the dataset is notably rich in long-span coreference pairs, with 85% of pairs in NovelCR-en and 83% in NovelCR-zh spanning across three or more sentences. Experiments on NovelCR reveal a large gap between state-of-the-art baselines and human performance, highlighting that NovelCR remains an open issue.