ICLR2026

Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization

Jianan Zhao, Zhihao Zhan, Narendra Chaudhary, Xinyu Yuan, Zuobai Zhang, Qian Cong, Jian Zhou, Sanchit Misra, Jian Tang

摘要

Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly all possible protein pairs. For NN proteins of average length LL, exhaustive all-against-all search requires O(N2L2)\mathcal{O}(N^2L^2) computation, rendering conventional approaches computationally impractical. We introduce RaftPPI, a scalable framework that approximates residue-level PPI modeling while enabling efficient large-scale retrieval. RaftPPI represents residue interactions with a Gaussian kernel, approximated efficiently via structured random Fourier features, and applies a low-rank factorized attention mechanism that admits pooling into a compact embedding per protein. Each protein is encoded once into an indexable embedding, allowing approximate nearest-neighbor search to replace exhaustive pairwise scoring, reducing proteome-wide retrieval from months to minutes on a single GPU or CPU. On the human proteome with the D-SCRIPT dataset, RaftPPI retrieves the top 20% pairs from \sim200M candidate pairs in 5.7 minutes on an A100 GPU, or 3.3 minutes on an Intel Xeon 6980P CPU, covering 75.1% of the true interacting pairs, compared to 4.9 GPU months for the best prior method (61.2%). Across seven benchmarks with sequence- and degree-controlled splits, RaftPPI achieves state-of-the-art PPI classification and retrieval performance, while enabling residue-aware, retrieval-friendly screening at proteome scale.