ACL2025

Robust Estimation of Population-Level Effects in Repeated-Measures NLP Experimental Designs

Alejandro Benito-Santos, Adrián Ghajari, Víctor Fresno

摘要

NLP research frequently grapples with multiple sources of variability-spanning runs, datasets, annotators, and more-yet conventional analysis methods often neglect these hierarchical structures, threatening the reproducibility of findings. To address this gap, we contribute a case study illustrating how linear mixed-effects models (LMMs) can rigorously capture systematic language-dependent differences (i.e., population-level effects) in a population of monolingual and multilingual language models. In the context of a bilingual hate speech detection task, we demonstrate that LMMs can uncover significant population-level effects-even under low-resource (small-N) experimental designs-while mitigating confounds and random noise. By setting out a transparent blueprint for repeated-measures experimentation, we encourage the NLP community to embrace variability as a feature, rather than a nuisance, in order to advance more robust, reproducible, and ultimately trustworthy results.