ACL2024

Temperature-scaling surprisal estimates improve fit to human reading times - but does it do so for the "right reasons"?

Tong Liu, Iza Skrjanec, Vera Demberg

被引用 3 次

摘要

A wide body of evidence shows that human 001 language processing difficulty is predicted by 002 the information-theoretic measure surprisal, 003 a word's negative log probability in context.004 However, it is still unclear how to best estimate 005 these probabilities needed for predicting human 006 processing difficulty -while a long-standing 007 belief held that models with lower perplexity 008 would provide more accurate estimates of word 009 predictability, and therefore lead to better read-010 ing time predictions, recent work has shown 011 that for very large models, psycholinguistic 012 predictive power decreases.One reason could 013 be that language models might be more confi-014 dent of their predictions than humans, because 015 they have had exposure to several magnitudes 016 more data.In this paper, we test what effect 017 temperature-scaling of large language model 018 (LLM) predictions has on surprisal estimates 019 and their predictive power of reading times of 020 English texts.Firstly, we show that calibration 021 of large language models typically improves 022 with model size, i.e. poorer calibration cannot 023 account for poorer fit to reading times.Sec-024 ondly, we find that temperature-scaling prob-025 abilities lead to a systematically better fit to 026 reading times (up to 89% improvement in delta 027 log likelihood), across several reading time cor-028 pora.Finally, we show that this improvement in 029 fit is chiefly driven by words that are composed 030 of multiple subword tokens. 1 031 1 Introduction 032 In psycholinguistics, a key finding is that words 033 with higher surprisal (= negative log probability 034 of the word in context) require more time for pro-035 cessing (Hale, 2001; Levy, 2008).Numerous stud-036 ies provided experimental evidence supporting this 037 theory, demonstrating that surprisal is a powerful 038 predictive measure of processing complexity (e.g.,