Automated Scraping from UniProt, PubMed, Reactome, and KEGG for Human Biology

Automated Scraping from UniProt, PubMed, Reactome, and KEGG for Human Biology

Summary

Problem: Collating structured biological entity data (ie for enzymes or pathways) from diverse databases is manual, inconsistent, and hard to scale.

Approach: BioSheetAgent is a Streamlit-based app with a scraping pipeline that:

  • Lets users select an entity type (e.g., hormones, enzymes, amino acids)
  • Gathers structured metadata from UniProt, PubMed, Reactome, and KEGG
  • Returns outputs as .csv or .xlsx
  • Enables in-browser visualization and (future) report auto-generation

Impact: This simplifies the generation of curated biological data sheets for research and teaching. It facilitates scalable, reproducible retrieval of multi-source bio-entity metadata with minimal coding effort.

Github repo: Embed GitHubEmbed GitHub

Intro

Ok so this project started out as being pretty boring to me, but quickly became interesting and quite useful.

One big difference between programmers in finance, artificial intelligence, smart contracts, + other software domains vs in biology is the immense leverage that they command. It’s obviously heavily skewed to the 3rd standard deviation, where most don’t contribute (like most hierarchies in nature), and the 3rd+ contribute asymmetrically, but biology seems to have very little, if not any, of even the heavily skewed part relative to tech. In my opinion, the primary reason is because it hasn’t become a complete information science where programmers can work their magic in these digital domains to innovate + print cash for the companies that hire them.

That being said, agents do change this leverage, even if they are only one small part of it.

Project

This project demonstrates some of that with automated search + analysis through many databases and entity types. I have 100+ other databases that this exact agent can be scaled to search through on the main homepage of this site (under Curated Databases). One can imagine how powerful this would be on a $100B cluster constantly searching through in-body sensors around the world that translate molecular-level information into matrices that programmers can build DL models off of, using these databases as references (just like AlphaFold).

That is a pretty cool playground for programmers to build and find signal for commercial innovation. I would imagine that in an environment like this, programmers in biology could command the same performance salaries and options as they do in the tech world, if not way more, as shown by Ozempic’s insane returns in such a short period of time

*sidenote: Tech is ~$8-10T of the global economy, with finance at $100T, real estate at $350T, and health around $15T- so still a long way to go on the upside if we assume everything will be encoded into bits, which I think it will be (heavily biased)

Anyways, back to the project.

Biology as Structured Data

After writing the text above, I wanted to test whether that leverage could be made real. So I built a system that programmatically extracts detailed, structured knowledge from biological databases across domains like hormones, enzymes, cells, and amino acids, then writes it to clean spreadsheets.

It’s just a scraping tool. Theoretically, it’s a prototype for what happens when software treats biology as a structured, queryable substrate- which it should be, but isn’t. The exact same engine that powers this can scale to the 100+ biological databases I’ve already indexed, and in theory, could even run continuously in a high-performance environment like a cluster scraping signals from body-level sensor feeds.

image

What It Does

BioSheetAgent is a Streamlit app + backend pipeline that lets you:

  • Select an entity type (e.g., hormones, enzymes, amino acids)
  • Pull structured data about them from online sources (UniProt, PubMed, KEGG)
  • Output .csv, .xlsx, or both formats
  • Visualize the data in-browser
  • Eventually, auto-generate reports and summaries

Here’s what it looks like in action:

Each dataset contains columns like:

Name
Type
Function
Location
Related Molecules
Related Systems
Diseases
Sources

This schema gives us a lightweight ontology that maps form to function to dysfunction — programmable and queryable.

Why This Matters

Imagine if:

  • Every hormone, enzyme, or cell were continuously monitored
  • All their metadata were normalized and queryable
  • The output was vectorized, streamed, and cross-validated against curated databases like AlphaFold, HuBMAP, Gene Ontology, etc.

This gives you an infinite feed of biological knowledge → digital substrate → innovation surface. You could use it to:

  • Build smarter health agents
  • Validate wet-lab experiments
  • Guide therapeutic target discovery
  • Train foundation models grounded in human biology

It's not just for humans either- you can swap the database and species, and you’ve got structured knowledge extraction for plants, microbes, fungi, and synthetic systems.

Tech Stack

  • Frontend: Streamlit
  • Backend: Python (Requests, Pandas, Biopython)
  • Data Sources: UniProt, KEGG, PubMed (Entrez)
  • Output Formats: CSV, Excel, multi-tab reports
  • Scalability: Easily expandable to 100+ other datasets already cataloged

The repo is fully modular, with each domain (hormones, enzymes, etc.) having its own fetch_*.py script and API utilities in utils/.

What’s Next

  • Add async streaming + rate limit handling
  • Expand to support 100+ other scientific databases
  • Integrate summary generation (via LLMs or embedding-based models)
  • Build a CLI agent or LangChain-powered research assistant

Try It

GitHub repo:

To run:

git clone https://github.com/bmwoolf/human_function_search_agent.git
cd human_function_search_agent
pip install -r requirements.txt
streamlit run main.py
image

*ChatGPT couldn’t change testosterone to a shorter hormone that would fit without changing other words, so we are keeping it as an overhang