Modeling Accurate Delivery of RNA + DNA Edits

Business Problem: predict whether an edit (AAV/LNP payload) reaches its intended tissue(s) under a given context (fully computationally)

  • Decision unit: p(hit | vector, payload, dose, route, species, tissue)
  • Root action: score() candidate vectors for a target tissue

Goal Definition: deliver AUROC ≥ 0.80 per tissue, top-k recall@3, starting with human liver, muscle, CNS, eye, brain, IM/IV/IT routes . We are looking for efficient delivery, which depends on:

  1. Cell entry success: how well the capsid binds and gets inside the target cell
  2. Intracellular processing: how effectively the vector escapes degradation, uncoats, and delivers its genome to the nucleus
  3. Expression yield: how much functional RNA/protein is made from that delivered genome
  4. Target specificity: doing the above processes in the desired cell type + area rather than broadly and off-target

Data Collection and Preparation:

  • NCBI GEO/SRA: biodistribution capsid library screens, RNA-seq
  • AddGene: AAV serotype/capsid variant metadata
  • Human Protein Atlas + GTEx: receptor/entry-factor expression by tissue
  • Literature tables (LNP): lipid compositions, doses, routes (manual seed set to bootstrap model before automated large-scale ingestion)

Tools

  • Ingestion: Biopython, Entrez, SRA-tools, request
  • QC/quant: FastQC, Cutadapt, Salmon (count TPM), MultiQC (checking read quality)
  • Orchestration: Nextflow/SnakeMake
  • Data versioning: DVC on AWS S3
  • Tables: Parquet via Pandas/Polars

Feature Engineering:

AAV: capsid sequence (one-hot/k-mer), known motif flags, edit cassette (the full DNA/RNA payload carried by the vector- promoter, coding sequence, UTR, polyA), dose, route, species

LNP: lipid SMILES → RDKit descriptors, N/P ratio, PEG %, dose, route

Tissue context: HPA/GTEx receptor expression, immune markers

Tools: Pandas/Polars, scikit-learn transformers, RDKit, NumPy, cache with Feather/Parquet

Model Training:

Baseline:

  • logistic regression and XGBoost (super quick MVP)

v1: multi-task classifier (per tissue) + calibration

  • Seq branch (AAV): small transformer (ESM-mini embeddings model) or 1D CNN
  • Chem branch (LNP): MLP on RDKit features
  • Fusion: contact + MLP → sigmoid heads per tissue

Stack: PyTorch + Lightning, Optuna (hyperparameter optimization), class imbalance: focal loss, mixup optional

Tracking: Weights and Biases for runs, MLFlow for registry

Model Evaluation:

Protocol: leave-study-out splits to avoid leakage, species/route stratification

Metrics: AUROC/PR-AUC per tissue, ECE, Brier, reliability plots

Data validation: Great Expectations

Model Deployment:

Artifact: TorchScript, ONNX

API: FastAPI + Uvicorn

Container: Docker, GPU enabled since we are training on CUDA

Registry:

Infra: Triton inference server (v1), S3 for artifacts

Model Serving:

Interface: REST: /score accepts JSON {vector, payload, context}, returns tissue probabilities + calibrated confidence

Batch: AWS Batch

Model Maintenance + Updates:

CI/CD: Github actions (or HuggingFace for model weights) → Triton

Retraining/updating weights: scheduled Snakemake workflow

Versioning: semantic versions for data, code, model

Model Monitoring:

Live: Prometheus exporters via FastAPI, Grafana dashboard for latency, error rate, QPS

Quality: EvidentlyAI for drift, alert via an agent

Application Security:

Secrets: Hashicorp vault, KMS

Data at rest: S3 SSE-KMS, TLS

MVP tools: Python, Pandas, Biopython, ESM, PyTorch Lightning, W&B, FastAPI, Docker, Hugging Face Hub, GitHub Actions