Modeling Accurate Delivery of RNA + DNA Edits

Business Problem: predict whether an edit (AAV/LNP payload) reaches its intended tissue(s) under a given context (fully computationally)

Decision unit: p(hit | vector, payload, dose, route, species, tissue)
Root action: score() candidate vectors for a target tissue

Goal Definition: deliver AUROC ≥ 0.80 per tissue, top-k recall@3, starting with human liver, muscle, CNS, eye, brain, IM/IV/IT routes . We are looking for efficient delivery, which depends on:

Cell entry success: how well the capsid binds and gets inside the target cell
Intracellular processing: how effectively the vector escapes degradation, uncoats, and delivers its genome to the nucleus
Expression yield: how much functional RNA/protein is made from that delivered genome
Target specificity: doing the above processes in the desired cell type + area rather than broadly and off-target

Data Collection and Preparation:

NCBI GEO/SRA: biodistribution capsid library screens, RNA-seq
AddGene: AAV serotype/capsid variant metadata
Human Protein Atlas + GTEx: receptor/entry-factor expression by tissue
Literature tables (LNP): lipid compositions, doses, routes (manual seed set to bootstrap model before automated large-scale ingestion)

Tools

Ingestion: Biopython, Entrez, SRA-tools, request
QC/quant: FastQC, Cutadapt, Salmon (count TPM), MultiQC (checking read quality)
Orchestration: Nextflow/SnakeMake
Data versioning: DVC on AWS S3
Tables: Parquet via Pandas/Polars

Feature Engineering:

AAV: capsid sequence (one-hot/k-mer), known motif flags, edit cassette (the full DNA/RNA payload carried by the vector- promoter, coding sequence, UTR, polyA), dose, route, species

LNP: lipid SMILES → RDKit descriptors, N/P ratio, PEG %, dose, route

Tissue context: HPA/GTEx receptor expression, immune markers

Tools: Pandas/Polars, scikit-learn transformers, RDKit, NumPy, cache with Feather/Parquet

Model Training:

Baseline:

logistic regression and XGBoost (super quick MVP)

v1: multi-task classifier (per tissue) + calibration

Seq branch (AAV): small transformer (ESM-mini embeddings model) or 1D CNN
Chem branch (LNP): MLP on RDKit features
Fusion: contact + MLP → sigmoid heads per tissue

Stack: PyTorch + Lightning, Optuna (hyperparameter optimization), class imbalance: focal loss, mixup optional

Tracking: Weights and Biases for runs, MLFlow for registry

Model Evaluation:

Protocol: leave-study-out splits to avoid leakage, species/route stratification

Metrics: AUROC/PR-AUC per tissue, ECE, Brier, reliability plots

Data validation: Great Expectations

Model Deployment:

Artifact: TorchScript, ONNX

API: FastAPI + Uvicorn

Container: Docker, GPU enabled since we are training on CUDA

Registry:

Infra: Triton inference server (v1), S3 for artifacts

Model Serving:

Interface: REST: /score accepts JSON {vector, payload, context}, returns tissue probabilities + calibrated confidence

~~Batch: AWS Batch~~

~~Model Maintenance + Updates:~~

~~CI/CD: Github actions (or HuggingFace for model weights) → Triton~~

~~Retraining/updating weights: scheduled Snakemake workflow~~

~~Versioning: semantic versions for data, code, model~~

~~Model Monitoring:~~

~~Live: Prometheus exporters via FastAPI, Grafana dashboard for latency, error rate, QPS~~

~~Quality: EvidentlyAI for drift, alert via an agent~~

~~Application Security:~~

~~Secrets: Hashicorp vault, KMS~~

~~Data at rest: S3 SSE-KMS, TLS~~

MVP tools: Python, Pandas, Biopython, ESM, PyTorch Lightning, W&B, FastAPI, Docker, Hugging Face Hub, GitHub Actions