ACL2024

MultiLegalPile: A 689GB Multilingual Legal Corpus

Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho

摘要

Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, few datasets are available for specialized critical domains such as law and the available ones are often small and only in English. To fill this gap, we curate and release MULTILEGALPILE, a 689GB corpus in 24 languages from 17 jurisdictions. MULTILE-GALPILE includes diverse legal data sources and allows for pretraining NLP models under fair use, with most of the dataset licensed very permissively. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the languagespecific subsets and evaluate them on LEX-TREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEX-TREME and our English models on LexGLUE. We release the dataset, trained models, and all code under the most open licenses possible.