NeurIPS2024

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje

Abstract

Recent advances in self-supervised models for natural language, vision, and protein sequences have catalyzed the development of genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various downstream genomic prediction, interpretation and design tasks. However, existing benchmarks do not adequately assess the capabilities of DNALMs on an important class of non-coding DNA elements critical for regulating gene activity. Here, we introduce DART-Eval, a suite of representative benchmarks focused on regulatory DNA to evaluate performance of DNALMs across zero-shot, probed, and fine-tuned settings against contemporary ab initio models as baselines. DART-Eval addresses biologically relevant tasks including sequence motif discovery, cell-type specific regulatory activity prediction, and counterfactual prediction of regulatory genetic variants. Our systematic evaluations reveal that current annotation-agnostic DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, despite requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our benchmark datasets and evaluation framework are available at https://github.com/kundajelab/DART-Eval