ACL2023

Exploring Variation of Results from Different Experimental Conditions

Maja Popovic, Mohammad Arvan, Natalie Parde, Anya Belz

Abstract

It might reasonably be expected that running experiments for the same task using same data and model would yield very results. Recent research has, however, this not to be the case for many NLP . In this paper, we report extensive work by two NLP groups to run training and testing pipeline for three neural simplification models under varying experimental conditions, including different random , run-time environments, and dependency , yielding a large number of results for of the three models using the same data train/dev/test set splits. From one perspective, these results can be interpreted as shedding on the reproducibility of evaluation results the three NTS models, and we present an in-depth analysis of the variation observed for different combinations of experimental conditions. another perspective, the results raise the of whether the averaged score should considered the ‘true’ result for each model.