ICML2025

Diffusion on Language Model Encodings for Protein Sequence Generation

Viacheslav Meshchaninov, Pavel V. Strashnov, Andrey Shevtsov, Fedor Nikolaev, Nikita Ivanisenko, Olga L. Kardymon, Dmitry P. Vetrov

摘要

Protein design necessitates a profound understanding of the intricate nature of the protein universe. While approaches based on discrete diffusion and autoregression are actively developing in the field of protein sequence generation, continuous diffusion remains underappreciated and underexplored. To address this gap, this research introduces DiMA, a latent diffusion model that leverages Gaussian diffusion on representations derived from protein language models,such as ESM-2 and CHEAP, to generate amino acid sequences. We quantitatively investigate the impact of various components of the latent diffusion model and protein encoders, revealing their contributions to enhanced protein generation performance. Additionally, we conduct an extensive evaluation of existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, novelty, diversity, and distribution matching of generated proteins. Our findings demonstrate that DiMA consistently produces novel, high-quality, and diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. Furthermore, we show that the proposed model can be easily adapted to address conditional tasks, such as protein family generation and inpainting. This work advances the field of protein design by providing a robust framework for latent diffusion on various protein representations, facilitating high-quality protein sequence generation.