ACL2024

I am a Strange Dataset: Metalinguistic Tests for Language Models

Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, Douwe Kiela

摘要

Statements involving metalinguistic selfreference ("This paper has six sections.") are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present "I am a Strange Dataset", a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like "The penultimate word in this sentence is" (where a correct continuation is "is"). In verification, models judge the truth of statements like "The penultimate word in this sentence is sentence." (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. one paper you read today is bound to contain "In 044 this paper" (Anonymous, 2024). 045 In this paper, we focus on metalinguistic self-046 reference, the complex kind of self-reference in 047 which language is used to make claims about it-048 self, as in "This sentence has five words" and "This 049 paper has six sections". 1 Using such language in-050 volves reasoning about metalinguistic properties 051 (counting words, naming parts of speech, etc.) and 052 resolving self-reference. Humans generally have 053 no trouble with such language, and may even enjoy 054 its playful and sometimes paradoxical nature (Hof-055