Summary
Problem: Virtual compound‑protein interaction (CPI) screening often suffers from weak generalizability and lacks explainability, especially for unseen molecules or proteins.
Approach: Takes SMILES inputs and protein sequences (plus optional CPI graph), encodes them via GCN + ChemBERTa and CNN + ESM respectively. Uses a Graph Autoencoder (teacher) to embed graph structure, with a student network trained via knowledge distillation (MSE + cosine loss), employing a Bilinear Attention Network (BAN) and domain adaptation (CDAN) for inductive generalization.
Impact: Delivers accurate CPI predictions—active vs. inactive interactions—with interpretable attention maps. Ideal for virtual screening, de novo drug discovery, and cross-domain generalization in molecular design pipelines.
GitHub repo: Embed GitHub
I/O
Inputs:
- Compound SMILES strings
- Protein amino acid sequences
- Optional: bi-partite CPI graph (nodes = compounds/proteins, edges = interactions)
Outputs:
- Binary prediction: active vs inactive compound-protein interaction
- Attention weights (explainable bindings)
Use Cases:
- Virtual drug screening
- De novo compound discovery
- Cross-domain CPI generalization
- Prediction for unseen molecules or proteins (inductive setting)
Architecture
Compound Encoder:
- GCN (graph structure)
- ChemBERTa (pretrained transformer on SMILES)
- Feature fusion with linear layers
Protein Encoder:
- CNN (sequence-based)
- ESM (LLM pretrained on protein sequences)
- Feature fusion as above
Teacher Module:
- GAE (Graph Autoencoder) trained on CPI bi-partite graph
- Encodes structural/graph-level knowledge
Student Module:
- Learns from teacher via knowledge distillation (MSE + cosine loss)
- Inputs = compound + protein features
- Bilinear Attention Network (BAN)
- Domain Adaptation (CDAN)