ACL2020

Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification

Pratik Dutta, Sriparna Saha

被引用 11 次

摘要

An in-depth exploration of protein-protein interactions (PPI) is essential to understand the metabolism in addition to the regulations of biological entities like proteins, carbohydrates, and many more. Most of the recent PPI tasks in BioNLP domain have been carried out solely using textual data. In this paper, we argue that incorporation of multimodal cues can improve the automatic identification of PPI. As a first step towards enabling the development of multimodal approaches for PPI identification, we have developed two multimodal datasets which are extensions and multimodal versions of two popular benchmark PPI corpora (BioInfer and HRPD50). Besides, existing textual modalities, two new modalities, 3D protein structure and underlying genomic sequence, are also added to each instance. Further, a novel deep multi-modal architecture is also implemented to efficiently predict the protein interactions from the developed datasets. A detailed experimental analysis reveals the superiority of the multi-modal approach in comparison to the strong baselines including uni-modal approaches and state-of the-art methods over both the generated multimodal datasets. The developed multi-modal datasets are available for use at https:// github.com/sduttap16/MM_PPI_NLP.