EMNLP2022

RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners

Soumya Sanyal, Zeyi Liao, Xiang Ren

6 citations

Abstract

Transformers have been shown to be able to perform deductive reasoning on inputs containing rules and statements written in English natural language. However, it is unclear if these models indeed follow rigorous logical reasoning to arrive at the prediction, or rely on spurious correlation patterns in making decision. A strong deductive reasoning model should consistently understand the semantics of different logical operators. To this end, we present ROBUSTLR, a deductive reasoning-based diagnostic benchmark that evaluates the robustness of language models to minimal logical edits in the inputs and different logical equivalence conditions. In our experiments with RoBERTa, T5, and GPT3, we show that the models trained on deductive reasoning datasets with various logical operations do not perform consistently on the RO-BUSTLR test set, thus showing that the models are not robust to our proposed logical perturbations. Further, we observe that the models find it especially hard to learn logical negation operator. Our results demonstrate the shortcomings of current language models in logical reasoning, and call for the development of better inductive biases to teach the logical semantics to language models. All the datasets and code base have been made publicly available. 1 f1: Charlie is tall. r1: Erin is kind, if Charlie is tall. statement: Erin is kind. Label: True f1: Charlie is tall. r1: Erin is kind, if Charlie is tall and round. statement: Erin is kind. Label: Unknown (a) Original Theory (b) Conjunction Perturbation f1: Charlie is tall. r1: Erin is kind, if Charlie is tall or round.