ACL2025

Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks

Tom Calamai, Oana Balalau, Fabian M. Suchanek

摘要

Significant efforts have been made in the NLP community to facilitate the automatic analysis of climate-related corpora by tasks such as climate-related topic detection, climate risk classification, question answering over climate topics, and many more. In this work, we perform a reproducibility study on 8 tasks and 29 datasets, testing 6 models. We find that many tasks rely heavily on surface-level keyword patterns rather than deeper semantic or contextual understanding. Moreover, we find that 96% of the datasets contain annotation issues, with 16.6% of the sampled wrong predictions of a zero-shot classifier being actually clear annotation mistakes, and 38.8% being ambiguous examples. These results call into question the reliability of current benchmarks to meaningfully compare models and highlight the need for improved annotation practices. We conclude by outlining actionable recommendations to enhance dataset quality and evaluation robustness. 3 https://github.com/prasanthg3/cleantext Dataset Input Labels Climate-Related Topic Detection ClimateBug-data, Yu et al. (2024) sentences from Banks' reports relevant/ irrelevant: Climate change and sustainability (including ESG, SDGs related to the environment, recycling and more) ClimateBERT's climate detection, Bingler et al. (2023) paragraphs from reports 1/0: Climate policy, climate change or an environmental topic Climatext (Wikipedia, 10-K, claims), Varini et al. (2020) sentences from Wikipedia, 10-Ks or web scraping 1/0: Directly related to climate-change Climatext (wiki-doc), Varini et al. (2020) sentences from a Wikipedia page 1/0: Extracted from a Wikipedia page related to climate-change Sustainable signals's reviews, Lin et al. (2023) online product reviews (user comments) relevant/ irrelevant: Contains terms related to sustainability MANAGE, RISK PLAN, MITIGATE, ENGAGE, ASSESS, RISKS: The labels correspond the 8 questions asked in the NAIC questionnaires Classification of Deceptive Techniques LogicClimate, Jin et al. (2022) texts from climatefeedback.org Faulty Generalization, Ad Hominem, Ad Populum, False Causality, etc: Classifies fallacies (Multi-label) Contrarian Claims, Coan et al. (2021) paragraphs from conservative think tank No Claim, Global Warming is not happening, Climate Solutions won't work, Climate impacts are not bad, etc: Classifies arguments into super/sub-categories of climate science denier's arguments Table 1: Description of the datasets we collected.