ICLR2025
Self-Supervised Diffusion Models for Electron-Aware Molecular Representation Learning
Gyoung S. Na, Chanyoung Park
Abstract
Characterizing biological and environmental samples at a molecular level primarily uses tandem mass spectroscopy (MS/MS), yet the interpretation of tandem mass spectra from untargeted metabolomics experiments remains a challenge. Existing computational methods for predictions from mass spectra rely on limited spectral libraries and on hard-coded human expertise. Here we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our GNPS Experimental Mass Spectra (GeMS) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks a nd c hr om at og raphic retention orders leads to the emergence of rich representations of molecular structures, which we named Deep Representations Empowering the Annotation of Mass Spectra (DreaMS). Further fine-tuning the neural n e t w ork y i e lds state-of-the-art performance across a variety of tasks. We make our new dataset and model available to the community and release the DreaMS Atlas-a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.