EMNLP2023

Language Model Quality Correlates with Psychometric Predictive Power in Multiple Languages

Ethan Wilcox, Clara Meister, Ryan Cotterell, Tiago Pimentel

7 citations

Abstract

Surprisal theory (Hale, 2001; Levy, 2008) posits that a word's reading time is proportional to its surprisal (i.e., to its negative log probability given the proceeding context). It has been empirically tested using surprisal estimates from language models (LMs). Under the premise that surprisal theory holds, we would expect that higher quality language models, whose predictions are more accurate, provide more powerful predictors of human reading behavior-a conjecture we dub the quality-power (QP) hypothesis. Unfortunately, empirical support for the QP hypothesis is mixed. Some studies in English have found correlations between LM quality and psychometric predictive power, but other studies using Japanese data, as well as using larger English LMs, find no such correlations. In this work, we conduct a systematic crosslinguistic assessment of the QP hypothesis. We train LMs from scratch on small-and medium-sized datasets from 13 languages (across five language families) and assess their ability to predict eye tracking data. We find correlations between LM quality and psychometric predictive power in eleven of these thirteen languages, suggesting that, within the range of model classes and sizes tested, better language models provide better predictors of human language processing behaviors. https://github.com/rycolab/ quality-power-hypothesis