ICML2022

Understanding Dataset Difficulty with V-Usable Information

Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta

337 citations

Abstract

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty-w.r.t. a model V-as the lack of V-usable information (Xu et al., 2019) , where a lower value indicates a more difficult dataset for V. We further introduce pointwise V-information (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, V-usable information and PVI also permit the converse: for a given model V, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks. Sentence Label PVI Wash you! No -4.616 Who achieved the best result was Angela. No -4.584 Sue gave to Bill a book. No -3.649 Only Churchill remembered Churchill giving the Blood, Sweat and Tears speech.