NeurIPS2021
Introspective Distillation for Robust Question Answering
Yulei Niu, Hanwang Zhang
被引用 72 次
摘要
Question answering (QA) models are well-known to exploit data bias, e.g., the language prior in visual QA and the position bias in reading comprehension. Recent debiasing methods achieve good out-of-distribution (OOD) generalizability with a considerable sacrifice of the in-distribution (ID) performance. Therefore, they are only applicable in domains where the test distribution is known in advance. In this paper, we present a novel debiasing method called Introspective Distillation (IntroD) to make the best of both worlds for QA. Our key technical contribution is to blend the inductive bias of OOD and ID by introspecting whether a training sample fits in the factual ID world or the counterfactual OOD one. Experiments on visual QA datasets VQA v2, VQA-CP, and reading comprehension dataset SQuAD demonstrate that our proposed IntroD maintains the competitive OOD performance compared to other debiasing methods, while sacrificing little or even achieving better ID performance compared to the non-debiasing ones. Question answering (QA), which requires machines to answer questions given a context, is one of the most fundamental AI tasks. Popular contexts are vision (e.g., image for VQA [10] ) and natural language (e.g., passage for extractive QA [38] ). A common observation is that QA models prefer to over-exploit the training bias, which bypasses the context comprehension for a shortcut answer. For example, by only using the linguistic correlations between questions and answers, VQA models can answer most questions correctly [22, 7, 10, 27] . Similarly, extractive QA models may use the spurious positional cues to locate the answer in the passage [30] . As a result, QA models that have already achieved strong in-distribution (ID) performance may inevitably fail in out-of-distribution (OOD) test scenarios, regardless of the scale of training data and models [20, 30, 50] . Recently, several debiasing methods aim to close the gap between the ID and OOD performances [12, 17, 13, 35] . However, many of them hold the assumption that the training and test distributions are very different or even reversed, e.g., if there are more "yes" answers in training, there must be more "no" answers in testing. As a result, these methods encounter a severe performance drop under the ID evaluation, although they significantly outperform non-debiasing baselines in terms of OOD performance. An interesting observation from Figure 1 is that non-debiasing methods (circles) obtain high ID but low OOD performance, while debiasing methods (squares) achieve high OOD but low ID performance. This observation motivates us to ask: can we make the best of both worlds? 35th Conference on Neural Information Processing Systems (NeurIPS 2021).