GraphBAN: a DL model for predicting compound-protein interactions in Python

Summary

Problem: Virtual compound‑protein interaction (CPI) screening often suffers from weak generalizability and lacks explainability, especially for unseen molecules or proteins.

Approach: Takes SMILES inputs and protein sequences (plus optional CPI graph), encodes them via GCN + ChemBERTa and CNN + ESM respectively. Uses a Graph Autoencoder (teacher) to embed graph structure, with a student network trained via knowledge distillation (MSE + cosine loss), employing a Bilinear Attention Network (BAN) and domain adaptation (CDAN) for inductive generalization.

Impact: Delivers accurate CPI predictions—active vs. inactive interactions—with interpretable attention maps. Ideal for virtual screening, de novo drug discovery, and cross-domain generalization in molecular design pipelines.

GitHub repo: Embed GitHub

I/O

Inputs:

Compound SMILES strings
Protein amino acid sequences
Optional: bi-partite CPI graph (nodes = compounds/proteins, edges = interactions)

Outputs:

Binary prediction: active vs inactive compound-protein interaction
Attention weights (explainable bindings)

Use Cases:

Virtual drug screening
De novo compound discovery
Cross-domain CPI generalization
Prediction for unseen molecules or proteins (inductive setting)

Architecture

Compound Encoder:

GCN (graph structure)
ChemBERTa (pretrained transformer on SMILES)
Feature fusion with linear layers

Protein Encoder:

CNN (sequence-based)
ESM (LLM pretrained on protein sequences)
Feature fusion as above

Teacher Module:

GAE (Graph Autoencoder) trained on CPI bi-partite graph
Encodes structural/graph-level knowledge

Student Module:

Learns from teacher via knowledge distillation (MSE + cosine loss)
Inputs = compound + protein features
Bilinear Attention Network (BAN)
Domain Adaptation (CDAN)