BioRAG

BioRAG

I came across this great repo that builds an entire RAG ecosystem from scratch. I wondered if we could do the same for multimodal biological datasets, meaning data from RNA, DNA, imaging, etc. So skip over the English words in papers that describe the data we actually want, which is the real world stuff. What would the components of that bio-RAG pipeline be?

The normal flow of a RAG pipeline is basically: chunking text β†’ embedding chunks β†’ retrieving chunks β†’ generating output (which obviously depends on the input we are asking about). The basic structure would be:

  1. Data ingestion
    1. raw FASTQ, BAM, FASTA files of DNA/RNA sequences
    2. PDB, mmCIF protein data
    3. OME-TIF microscopy data
    4. h5ad single-cell data
    5. SDF, MOL2 molecular data
  2. Domain-specific embedding models
    1. sequence β†’ transformer embeddings (DNABERT, ESM)
    2. structure β†’ geometric graph embeddings (GNN, I’m sure there is one that exists somewhere)
    3. images β†’ ViT/CLIP-like latent space
  3. Indexing
    1. vector database per input modality with Milvus, Pinecone, Weaviate (I think you can just use normal vector databases?)
    2. multi-vector fusion across input data (chromatin + expression)
  4. Retrieval
    1. K-NN search in biological latent space (embeddings represent biology, not language)
    2. cross-model retrieval (through some harmonic ensemble of the models- easier said than done)
  5. Reasoning
    1. retrieval fed to downstream models and physics simulators (like in the case of drug simulations in virtual cells)
      1. recall top-k protein folds β†’ feed into docking engine β†’ generate candidate binders
  6. API for other users

Why this doesn’t exist as a seamless abstraction layer

  1. cross-modality alignment in biology is still niche, I think because prices are too high
  2. each modality has massive preprocessing, like FASTQ β†’ clean β†’ embedding, but for all different cell types, which are really just across a multidimensional spectrum, rather than English-based language
  3. biology needs RAG that is not word-based. I think Evo2 is a step in this direction, that bridges sequences with relevant information, but the bioRAG needs to push to simulation, statistical inference, or generative models in various organisms
  4. the amount of information is just so, so massive