GraphBAN: a DL model for predicting compound-protein interactions in Python

GraphBAN: a DL model for predicting compound-protein interactions in Python

Summary

Problem: Virtual compound‑protein interaction (CPI) screening often suffers from weak generalizability and lacks explainability, especially for unseen molecules or proteins.

Approach: Takes SMILES inputs and protein sequences (plus optional CPI graph), encodes them via GCN + ChemBERTa and CNN + ESM respectively. Uses a Graph Autoencoder (teacher) to embed graph structure, with a student network trained via knowledge distillation (MSE + cosine loss), employing a Bilinear Attention Network (BAN) and domain adaptation (CDAN) for inductive generalization.

Impact: Delivers accurate CPI predictions—active vs. inactive interactions—with interpretable attention maps. Ideal for virtual screening, de novo drug discovery, and cross-domain generalization in molecular design pipelines.

image

GitHub repo: Embed GitHubEmbed GitHub

I/O

Inputs:

  • Compound SMILES strings
  • Protein amino acid sequences
  • Optional: bi-partite CPI graph (nodes = compounds/proteins, edges = interactions)

Outputs:

  • Binary prediction: active vs inactive compound-protein interaction
  • Attention weights (explainable bindings)

Use Cases:

  • Virtual drug screening
  • De novo compound discovery
  • Cross-domain CPI generalization
  • Prediction for unseen molecules or proteins (inductive setting)

Architecture

Compound Encoder:

  • GCN (graph structure)
  • ChemBERTa (pretrained transformer on SMILES)
  • Feature fusion with linear layers

Protein Encoder:

  • CNN (sequence-based)
  • ESM (LLM pretrained on protein sequences)
  • Feature fusion as above

Teacher Module:

  • GAE (Graph Autoencoder) trained on CPI bi-partite graph
  • Encodes structural/graph-level knowledge

Student Module:

  • Learns from teacher via knowledge distillation (MSE + cosine loss)
  • Inputs = compound + protein features
  • Bilinear Attention Network (BAN)
  • Domain Adaptation (CDAN)