NeurIPS2023
ProteinShake: Building datasets and benchmarks for deep learning on protein structures
Tim Kucera, Carlos G. Oliver, Dexiong Chen, Karsten M. Borgwardt
23 citations
Abstract
We present ProteinShake, a Python software package that simplifies dataset creation and model evaluation for deep learning on protein structures. Users can create custom datasets or load an extensive set of pre-processed datasets from biological data repositories such as the Protein Data Bank (PDB) and AlphaFoldDB. Each dataset is associated with prediction tasks and evaluation functions covering a broad array of biological challenges. A benchmark on these tasks shows that pretraining almost always improves performance, the optimal data modality (graphs, voxel grids, or point clouds) is task-dependent, and models struggle to generalize to new structures. ProteinShake makes protein structure data easily accessible and comparison among models straightforward, providing challenging benchmark settings with real-world implications. ProteinShake is available at https://proteinshake.ai .