EMNLP2022
Does Corpus Quality Really Matter for Low-Resource Languages?
Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa
2 citations
Abstract
The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling-manually identifying and scraping websites with high-quality content-as an alternative to filtering Common-Crawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with < 33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is not primarily constrained by the quality of the data, and other factors like corpus size and domain coverage can play a more important role. LangID EGOKIA: Dokumentua euskaraz dago. CORRECT: The document is in Basque. ARAZOAK: Dokumentuaren zati esanguratsu bat ez dago euskaraz. PROBLEMATIC: A significant portion of the document is not in Basque. Hizkuntza Lang. variety EGOKIA: Dokumentua hizkuntza estandar eta zuzenean idatzia dago. CORRECT: The document is written in standard and correct language. ARAZOAK: Dokumentua ez dago hizkuntza estandar edo zuzenean idatzia (adb. euskalkiren batean dago ala itzulpen automatikoaren bidez sortua dirudi). PROBLEMATIC: The document is not written in standard and correct language (e.g., it is written in a dialect using non-standard Basque, or it seems to be generated through machine translation). Koherentzia Coherence EGOKIA: Dokumentua koherentea da, eta hasieratik bukaerara unitate bat osatzen du. CORRECT: The document is coherent, and it constitutes a single unit from the beginning to the end. ARAZOAK: Dokumentua ez da koherentea: hutsuneak ditu edota atal batzuk ez dute elkarren artean loturarik (dokumentu ezberdinak dirudite). PROBLEMATIC: The document is not coherent: it has gaps and/or some portions do not seem connected (they seem to come from separate documents). Garbitasuna Noise EGOKIA: Dokumentuko testua garbia da. CORRECT: The text in the document is clean. ARAZOAK: Dokumentua ez da erabat garbia, eta benetako testuaz gain webguneko bestelako elementuak daude (menuetako testua, html kodea...). PROBLEMATIC: The document is not entirely clean, and there are other elements in addition to the real content (text from menus, HTML code...). Edukia Content EGOKIA: Dokumentua pertsona batek sortua dirudi eta gutxieneko mami bat du. CORRECT: The document seems to have been created by a human and has some minimum meat. ARAZOAK: Dokumentuak automatikoki sortua dirudi edota ez du inolako mamirik (adb futbol ligako sailkapen-taula). The document seems to have been generated automatically and/or has no meat at all (e.g., a soccer standing table ). Kalitate orokorra Overall quality ALTUA: Dokumentua kalitatezkoa da, eta corpusean izatea komeniko litzatekeela uste dut. HIGH: The document is of good quality, and I think that it would be good to have it in the corpus. ERTAINA: Dokumentuak arazo batzuk ditu baina ez dira larriak, eta ez nago ziur ea corpusean izatea komeniko litzatekeen. MEDIUM: The document has minor issues, and I am not sure if it would be good to have it in the corpus. BAXUA: Dokumentuak arazo nabarmenak ditu. Ez dut uste corpusean izatea komeniko litzatekeenik. LOW: The document has major issues. I think that it would be better not to have it in the corpus. Table 4 : Annotation instructions used for the qualitative evaluation. We report the original instructions in Basque, as well as the corresponding translation into English.