ACL2021

Constrained Labeled Data Generation for Low-Resource Named Entity Recognition

Ruohao Guo, Dan Roth

Abstract

We explore whether synthetic datasets generated by large language models using a few high quality seed samples are useful for lowresource named entity recognition, considering 11 languages from three language families. Our results suggest that synthetic data created with such seed data is a reasonable choice when there is no available labeled data, and is better than using entirely automatically labeled data. However, a small amount of high-quality data, coupled with cross-lingual transfer from a related language, always offers better performance. 1