EMNLP2023

An Investigation of LLMs' Inefficacy in Understanding Converse Relations

Chengwen Qi, Bowen Li, Binyuan Hui, Bailin Wang, Jinyang Li, Jinwang Wu, Yuanjun Laili

被引用 4 次

摘要

Large Language Models (LLMs) have achieved remarkable success in many formal language oriented tasks, such as structural data-to-text and semantic parsing. However current benchmarks mostly follow the data distribution of the pre-training data of LLMs. Therefore, a natural question rises that do LLMs really understand the structured semantics of formal languages. In this paper, we investigate this problem on a special case, converse binary relation. We introduce a new benchmark ConvRe focusing on converse relations, which contains 17 relations and 1240 triples extracted from popular knowledge graph completion datasets. Our ConvRe features two tasks, Re2Text and Text2Re, which are formulated as multi-choice question answering to evaluate LLMs' ability to determine the matching between relations and associated text. For the evaluation protocol, apart from different prompting methods, we further introduce variants to the test text and few-shot example text. We conduct experiments on three popular LLM families and have observed various scaling trends. The results suggest that LLMs often resort to shortcut learning and still face challenges on our proposed benchmark. Re2Text Task Read the instruction and then answer the question using A or B. Instruction: (x, has part, y) indicates that y has a part called x. Question: (?, has part, hilt) A: Find an entity that has a part called hilt. B: Find an entity that is a part of hilt. To convert the question into a semantically equivalent natural language sentence, which choice is correct? Answer: Re2Text Task (hard) Read the instruction and then answer the question using A or B. Instruction: (x, has part, y) indicates that y has a part called x. Question: (?, has part, hilt) A: Find an entity that has a part called hilt. B: Find an entity that hilt contains. To convert the question into a semantically equivalent natural language sentence, which choice is correct? Answer: Text2Re Task Read the instruction and then answer the question using A or B. Instruction: (x, has part, y) indicates that y has a part called x. Question: Find an entity that possesses a specific component named hilt. A: (?, has part, hilt) B: (hilt, has part, ?) To convert the question into a semantically equivalent triple query, which choice is correct? Answer: Text2Re Task (hard) Read the instruction and then answer the question using A or B. Instruction: (x, has part, y) indicates that y has a part called x. Question: Find an entity that has a part called hilt. A: (?, has part, hilt) B: (hilt, has part, ?) To convert the question into a semantically equivalent triple query, which choice is correct? Answer: Figure 3: Examples of Re2Text and Text2Re tasks on converse relation. We additionally paraphrase the natural language representations (answer candidates for Re2Text, question for Text2Re) to make them differ from the sentences in the Instruction. to the two tasks, which will be evidenced by the 173 empirical results in our experiments (section 4.2). 174 An intuitive explanation is provided in figure 3. De-175 tailed zero-shot prompting methods can be found 176 in table 2. 2 177 Example Variants in Few-shot Prompting Be-178 side the variants on the test text, we additionally 179 introduce variants to the text in examples for the 180 few-shot prompting. Since we have identified the 181 most challenging settings for the two tasks in zero-182 shot, we will employ such settings for the test text 183 and dub them as hard tests in few-shot. Accord-184 ingly, we incorporate text variants to the examples 185 used in the few-shot prompting. Comprehensively, 186 the few-shot prompts used in our benchmark are 187 listed in table 3. Details of arrangement of text