EMNLP2022
IDK-MRC: Unanswerable Questions for Indonesian Machine Reading Comprehension
Rifki Afina Putri, Alice Oh
11 citations
Abstract
Machine Reading Comprehension (MRC) has become one of the essential tasks in Natural Language Understanding (NLU) as it is often included in several NLU benchmarks (Liang et al., 2020; Wilie et al., 2020) . However, most MRC datasets only have answerable question type, overlooking the importance of unanswerable questions. MRC models trained only on answerable questions will select the span that is most likely to be the answer, even when the answer does not actually exist in the given passage (Rajpurkar et al., 2018) . This problem especially remains in medium-to low-resource languages like Indonesian. Existing Indonesian MRC datasets (Purwarianti et al., 2007; Clark et al., 2020) are still inadequate because of the small size and limited question types, i.e., they only cover answerable questions. To fill this gap, we build a new Indonesian MRC dataset called I(n)don'tKnow-MRC (IDK-MRC) by combining the automatic and manual unanswerable question generation to minimize the cost of manual dataset construction while maintaining the dataset quality. Combined with the existing answerable questions, IDK-MRC consists of more than 10K questions in total. Our analysis shows that our dataset significantly improves the performance of Indonesian MRC models, showing a large improvement for unanswerable questions 1 . Type Description Example Negation Negation word inserted or removed Context Kambing memiliki lemak dalam kandungan susunya. (Goats have fat in their milk.) Ans Q Apakah kandungan yang ada dalam susu kambing? (What are the ingredients in goat's milk?) UnAns Q Apakah kandungan yang tidak ada dalam susu kambing? (What are the ingredients that do not exist in goat's milk?) Antonym Antonym used Context Aristokrasi adalah sebuah kelas sosial yang tertinggi di masyarakat. (Aristocracy is the highest social class in society.) Ans Q Apakah nama kelas sosial tertinggi? (What is the name of the highest social class?) UnAns Q Apa nama kelas sosial terendah? (What is the name of the lowest social class?) Entity Swap Entity, date, number, or term replaced with other entity, date, number, or term Context Salah satu kandidat standar untuk 4G yang dikomersilkan di dunia yaitu standar Long Term Evolution (LTE) (Swedia sejak 2009). (One of the standards for 4G commercialized in the world is the Long Term Evolution (LTE) standard (Sweden since 2009).) Ans Q Di manakah LTE pertama kali diciptakan? (Where was LTE first invented?) UnAns Q Di manakah 3G pertama diciptakan? (Where was 3G first invented?) Question Tag Swap Question tag replaced with other question tag Context Suaka margasatwa Muara Angke adalah sebuah kawasan konservasi di wilayah hutan bakau di pesisir utara Jakarta. (Muara Angke Wildlife Sanctuary is a conservation area in the mangrove forest area on the north coast of Jakarta.) Ans Q Di mana Suaka margasatwa Muara Angke dibangun? (Where was Muara Angke Wildlife Sanctuary built?) UnAns Q Kapan Suaka margasatwa Muara Angke dibangun? (When was Muara Angke Wildlife Sanctuary built?) Specific Condition Asks for specific condition that is not satisfied by the information in the paragraph