NeurIPS2022

What are the best Systems? New Perspectives on NLP Benchmarking

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stéphan Clémençon

摘要

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust. How to aggregate performances? The multi-tasks setting has been investigated in recent works that provide benchmark of state-of-the-art models across a great variety of tasks [28, 62, 80, 90, 108] , sometimes with more than fifty [2, 84, 85, 94] . These papers provide tables of scores across the considered tasks, but the only non-qualitative way to compare systems consists in averaging the performances across tasks and then ranking systems according to their mean score values. This is, for instance, done with the GLUE benchmark [91] and its derivatives [92] . However, taking the mean is seriously flawed since the different metrics are usually not on the same scales and can even be unbounded [23, 102] . Even a pre-processing renormalization scheme would fail to capture the intrinsic difficulty of the tasks. Contribution 1. Our first contribution is to provide a reliable tool to rank systems in a multi-tasks setting. We rely on a ranking aggregation procedure which, from a set of rankings induced by each criterion, returns a single ranking that somehow aggregates the former. This procedure, called the Kemeny consensus [52], can be seen as a voting rule and stems from the social choice theory [66] . Aggregation when instance-level information is available. As illustrated by Ruder [83], Zhong et al. [109], a fine-grained understanding of the model performance should include instance-level scores. If taking the mean is quite natural in the classification setting, this is not always the case, as recently pointed out by [73] in the NLG setting. In this article, the authors investigate pairwise comparison of NLG systems for a single metric (e.g. BLEU [71], ROUGE [59], METEOR [5, 35, 49] , CHRF [76, 77] , BertScore [105] ). They prove that a comparison based on the mean or the median of the scores across test utterances can be highly flawed. They rather advise to rely on the Bradley-Terry [10] pairwise comparison method, which consists, for two systems A and B, in computing the proportion of utterances on which A achieves a better score than B. Their work is a significant advance but remains limited to pairwise comparisons. Contribution 2. Our second contribution consists in going one step further than [73] by applying our ranking procedure to an arbitrarily large set of NLG systems with respect to a group of fixed criterion. Our evaluation methodology can be seen as a natural extension of [73] since it coincides with the latter in the particular case of pairwise comparison. In a more realistic multi-criteria scenario, we combine our two contributions and develop a two-stages ranking aggregation procedure which first aggregates along utterances and then along criteria. Experiments. Our two contributions rely on our aggregation procedure which is proved to be effective through several experiments. 1. We explain on a simple synthetic example the superiority of our approach compared to the mean-aggregation procedure and the pairwise-aggregation procedure, both in terms of consistency and robustness. 2. We use our ranking procedure on 10 multi-tasks / multi-criteria benchmarks and observe it leads to different conclusions than mean-and pairwise-aggregation procedures. 3. We argue our procedure is more robust by investigating its stability with respect to the addition of criteria and with respect to the addition of systems. Our code and the collected data will be released to accelerate the adoption of what we think is a reliable evaluation method for multi-tasks and multi-criteria benchmarks. 2 Problem Formula