ACL2023

Chemical Language Understanding Benchmark

Yunsoo Kim, Hyuk Ko, Jane Lee, Hyun Young Heo, Jinyoung Yang, Sungsoo Lee, Kyu-Hwang Lee

2 citations

Abstract

In this paper, we introduce the benchmark 2 datasets named CLUB (Chemical 3 Language Understanding Benchmark) to 4 facilitate NLP research in the chemical 5 industry. We have 4 datasets consisted of 6 text and token classification tasks. As far as 7 we have recognized, it is one of the first 8 examples of chemical language 9 understanding benchmark datasets 10 consisted of tasks for both patent and 11 literature articles provided by industrial 12 organization. All the datasets are internally 13 made by chemists from scratch. Finally, we 14 evaluate the datasets on the various 15 language models based on BERT and 16 RoBERTa, and demonstrate the model 17 performs better when the domain of the pre-18 trained models are closer to chemistry 19 domain. We provide baselines for our 20 benchmark as 0.7818 in average, and we 21 hope this benchmark is used by many 22 researchers in both industry and academia.