ACL2024

Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs

Siyuan Wang, Zhongyu Wei, Yejin Choi, Xiang Ren

Abstract

Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. To investigate this, we propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic, comprising both primitive and compositional rules across five domains. Our analysis of GPT-series models over a rule subset reveals significant gaps in LLMs' logic understanding compared to human performance, especially in compositional and structural complex rules with certain bias patterns. We further distill these rules into a smaller-scale inference engine for flexible rule generation and enhancing downstream reasoning. Through a multijudger evaluation, our inference engine proves effective in generating accurate, complex and abstract conclusions and premises, and improve various commonsense reasoning tasks. Overall, our work sheds light on LLMs' limitations in grasping inferential rule and suggests ways to enhance their logical reasoning abilities 1 . Q1: Did Leonardo da Vinci ever use a laptop for drawing pictures? Q2: Jane wrote a novel published by Jimmy, a publisher born in 1750. Did Jane's grandmother often work by car? Underlying Logic: If Person X died before year A and Object Y was invented in year B, and A is earlier than B, then Person X can not access Object Y. year A year B Person X Object Y Figure 1: The underlying logic to answer Q1 and Q2. Humans naturally abstract underlying logic (e.g., 040 inferential rules) from extensive real-world obser-041 vations (Barwise, 1993), which is beneficial for 042 addressing diverse reasoning situations. An infer-043 ential rule is typically defined as a premise with a 044 set of facts (e.g., "Person X died before ... earlier 045 than B") leading to a conclusion (e.g., ''Person 046 X cannot access Object Y") (Boghossian, 2014). 047 Grasping this rule enables the deduction that a per-048 son cannot access an object invented posthumously. 049 This work utilizes symbolic logic as a scaffold to 050 generate challenging reasoning situations for GPT-051 series LLMs. We observe a discernible gap be-052 tween LLMs and humans in understanding inferen-053 tial rules, especially rules with complex premises. 054 However, collecting such inferential rules at 055 scale presents a major challenge. Previous work 056 mainly relies on manual curation (Sap et al., 2019a; 057 Sinha et al., 2019) or inductive logic program-058 ming (Qu et al., 2020), which are either labor-059 intensive or limited in diversity. Besides, manually 060 crafted rules often appear simple and overly speci-061 fied, struggling to move beyond basic intuition or 062 generalize across diverse situations. For example, 063 the rule If X runs out of steam, then X becomes 064 tired from Sap et al. (2019a) has only one premise 065 fact and narrowly specifies exhaustion. 066 1 ≪ Inferential Rules of Different Complexities (Complexity=1) If Person X has an adverse reaction to Food Y, then Person X cannot eat Food Y. (Complexity=2) If Person X has inherited Disease Z2 and Food Y should be avoided by those with Disease Z2, then Person X cannot eat Food Y. (Complexity=3) If Person X earns Money Z1 and Material Y is sold for Money Z2, and Money Z1 is bigger than Money Z2, then Person X can buy Material Y. (Complexity=4) If Person X works at Job A and Job A pays Money Z1, and Material Y is sold for Money Z2, and Money Z1 is bigger than Money Z2, then Person X can purchase Material Y.