BioRAG

I came across this great repo that builds an entire RAG ecosystem from scratch. I wondered if we could do the same for multimodal biological datasets, meaning data from RNA, DNA, imaging, etc. So skip over the English words in papers that describe the data we actually want, which is the real world stuff. What would the components of that bio-RAG pipeline be?

The normal flow of a RAG pipeline is basically: chunking text → embedding chunks → retrieving chunks → generating output (which obviously depends on the input we are asking about). The basic structure would be:

Data ingestion

raw FASTQ, BAM, FASTA files of DNA/RNA sequences
PDB, mmCIF protein data
OME-TIF microscopy data
h5ad single-cell data
SDF, MOL2 molecular data

Domain-specific embedding models

sequence → transformer embeddings (DNABERT, ESM)
structure → geometric graph embeddings (GNN, I’m sure there is one that exists somewhere)
images → ViT/CLIP-like latent space

Indexing

vector database per input modality with Milvus, Pinecone, Weaviate (I think you can just use normal vector databases?)
multi-vector fusion across input data (chromatin + expression)

Retrieval

K-NN search in biological latent space (embeddings represent biology, not language)
cross-model retrieval (through some harmonic ensemble of the models- easier said than done)

Reasoning

retrieval fed to downstream models and physics simulators (like in the case of drug simulations in virtual cells)

recall top-k protein folds → feed into docking engine → generate candidate binders

API for other users

as Paul Graham has said, “just create an API. See what happens.” I see this as perfectly reasonably advice that we should listen to

Why this doesn’t exist as a seamless abstraction layer

cross-modality alignment in biology is still niche, I think because prices are too high
each modality has massive preprocessing, like FASTQ → clean → embedding, but for all different cell types, which are really just across a multidimensional spectrum, rather than English-based language
biology needs RAG that is not word-based. I think Evo2 is a step in this direction, that bridges sequences with relevant information, but the bioRAG needs to push to simulation, statistical inference, or generative models in various organisms
the amount of information is just so, so massive