Business Problem: predict whether an edit (AAV/LNP payload) reaches its intended tissue(s) under a given context (fully computationally)
- Decision unit: p(hit | vector, payload, dose, route, species, tissue)
- Root action:
score()
candidate vectors for a target tissue
Goal Definition: deliver AUROC ≥ 0.80 per tissue, top-k recall@3, starting with human liver, muscle, CNS, eye, brain, IM/IV/IT routes . We are looking for efficient delivery, which depends on:
- Cell entry success: how well the capsid binds and gets inside the target cell
- Intracellular processing: how effectively the vector escapes degradation, uncoats, and delivers its genome to the nucleus
- Expression yield: how much functional RNA/protein is made from that delivered genome
- Target specificity: doing the above processes in the desired cell type + area rather than broadly and off-target
Data Collection and Preparation:
- NCBI GEO/SRA: biodistribution capsid library screens, RNA-seq
- AddGene: AAV serotype/capsid variant metadata
- Human Protein Atlas + GTEx: receptor/entry-factor expression by tissue
- Literature tables (LNP): lipid compositions, doses, routes (manual seed set to bootstrap model before automated large-scale ingestion)
Tools
- Ingestion: Biopython, Entrez, SRA-tools, request
- QC/quant: FastQC, Cutadapt, Salmon (count TPM), MultiQC (checking read quality)
- Orchestration: Nextflow/SnakeMake
- Data versioning: DVC on AWS S3
- Tables: Parquet via Pandas/Polars
Feature Engineering:
AAV: capsid sequence (one-hot/k-mer), known motif flags, edit cassette (the full DNA/RNA payload carried by the vector- promoter, coding sequence, UTR, polyA), dose, route, species
LNP: lipid SMILES → RDKit descriptors, N/P ratio, PEG %, dose, route
Tissue context: HPA/GTEx receptor expression, immune markers
Tools: Pandas/Polars, scikit-learn transformers, RDKit, NumPy, cache with Feather/Parquet
Model Training:
Baseline:
- logistic regression and XGBoost (super quick MVP)
v1: multi-task classifier (per tissue) + calibration
- Seq branch (AAV): small transformer (ESM-mini embeddings model) or 1D CNN
- Chem branch (LNP): MLP on RDKit features
- Fusion: contact + MLP → sigmoid heads per tissue
Stack: PyTorch + Lightning, Optuna (hyperparameter optimization), class imbalance: focal loss, mixup optional
Tracking: Weights and Biases for runs, MLFlow for registry
Model Evaluation:
Protocol: leave-study-out splits to avoid leakage, species/route stratification
Metrics: AUROC/PR-AUC per tissue, ECE, Brier, reliability plots
Data validation: Great Expectations
Model Deployment:
Artifact: TorchScript, ONNX
API: FastAPI + Uvicorn
Container: Docker, GPU enabled since we are training on CUDA
Registry:
Infra: Triton inference server (v1), S3 for artifacts
Model Serving:
Interface: REST: /score
accepts JSON {vector, payload, context}
, returns tissue probabilities + calibrated confidence
Batch: AWS Batch
Model Maintenance + Updates:
CI/CD: Github actions (or HuggingFace for model weights) → Triton
Retraining/updating weights: scheduled Snakemake workflow
Versioning: semantic versions for data, code, model
Model Monitoring:
Live: Prometheus exporters via FastAPI, Grafana dashboard for latency, error rate, QPS
Quality: EvidentlyAI for drift, alert via an agent
Application Security:
Secrets: Hashicorp vault, KMS
Data at rest: S3 SSE-KMS, TLS
MVP tools: Python, Pandas, Biopython, ESM, PyTorch Lightning, W&B, FastAPI, Docker, Hugging Face Hub, GitHub Actions