ACL2025

Domain Regeneration: How well do LLMs match syntactic properties of text domains?

Da Ju, Hagen Blix, Adina Williams

摘要

Recent improvements in large language model performance have, in all likelihood, been accompanied by improvements in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt commonly used, opensource LLMs to regenerate text from three domains of permissively licensed English text which are often contained in LLM training data-Wikipedia, news text, and ELI5. In a fairly semantically-controlled setting, this regeneration paradigm allows us to investigate whether LLMs can faithfully match original human text domains. We investigate varying levels of syntactic abstraction, from simpler properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.