Molecular Biology
Azimuth: annotated reference dataset for single-cell RNA-seq or ATAC-seq experiments
BiGGs models: gold-standard genome-scale models
BioCyc: pathway/genome databases
Biomodels: A database of models that are used in biology
cBioPortal: portal that allows you to explore cancer genomics data visually
Database Commons: a database of biological databases
Dunbrack backbone-dependent rotamer library: provides side-chain conformations based on backbone dihedral angles, enhancing the accuracy of protein design and modeling
ENCODE: human regulatory DNA
Ensembl: genome data for vertebrates and model organisms
Gene expression omnibus: repository for gene expression datasets
Gene Ontology (GO): directed acyclic graph (DAG) of biological terms
GotEnzymes: database of 25M+ measurements of how fast specific enzymes work
GPCRdb: focused on GPCRs
GTEx: Genotype-Tissue Expression (GTEx) Portal has expression data from 3 NIH projects
GWAS Catalog: database of SNP-trait associations
HGNC (HUGO Gene Nomenclature Committee) – official gene names for human receptors
HuBMAP: modeling the human body at the single cell level
Human Cell Atlas: mapping and annotating every cell in humans
Human Metabolome Database: comprehensive database on human metabolites, biomarkers, quantitative
IEDB (Immune Epitope Database): catalogues experimentally verified immune epitopes (2M of them), which are fragments of antigens recognized by B cells, T cells, and/or MHC molecules
IUPHAR/BPS Guide to Pharmacology – curated list of receptors, ligands, drug targets
KEGG: models molecular interactions, pathways like modeling
Metabolic map: metabolic map of E-Coli and others, eventually will have all humans
OpenHumans: platform for sharing data on many different topics- lots of microbiome, genetics, variants, and viral databases
OpenWetWare.org: experimental protocol information
ORF finder: searches for open reading frames (ORFs) in the DNA sequence you enter
parts.igem.org: standard biological parts, which actually aren’t very standardizes and need to be made engineering-friendly
PathBank: biochemical reactions and interactions within cells
PDB: protein data bank
PHASTER, PHASTEST VIBRANT: phage sequence databases
PhysiCell: virtual laboratory- agent based modeling of cells
Physiome: models organ and tissue function, including circulation, respiration, muscle dynamics (software: https://physiomeproject.org/software)
Reactome: molecular interactions of cellular processes, pathways
Saccharomyces Genome Database (SGD): 12M base pairs of yeast DNA sequence and the annotation of over 6k genes + thousands of experiments
Sequence Read Archive (SRA): repository of high throughput sequencing data
STRING: protein-protein interaction networks, used to identify Ras-associated human proteins
TCGA (The Cancer Genome Atlas Project): molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer type
Whole cell model of Myocoplasma genitalium: a whole cell model of a living organism, like the worm
WikiPathways (NetPath): signal transduction pathways in human cells
Uniprot: protein sequence and functional information, receptors
uORFdb: upstream ORFs for genetic editing
VDJdb: database of T-cell receptor (TCR) sequences with known antigen specificities
Virtual Physiological Human: simulates entire-body physiological interactions, up to disease progression
Biotech
Drugs@FDA: FDA approved drugs by month
Enamine REAL Space: 38B drug combinations
- Is REAL space limited by the Lipinski rule?
FDA Purple Book: approved biologics and biosimilars
Kegg: drug database
Tools
aRNA amplification: linear amplification of single-cell RNA (what is logarithmic?)
RUM (RNA-seq Unified Mapper): alignment tool that maps RNA-seq reads to the genome (very old, don’t recommend using)
Oakvar: collection of genome and variant annotation tools
TODO:
HTSeq + htseq-count: quantification tool that assigns reads to genes
like a neural network?
Outlier-sum statistic: statistical test that detects extreme variation in single-cell expression
what limit for them?
F-statistic (expression noise): variation metric that quantifies biological vs technical variability
DAVID / MGI / Amigo: functional annotation that compares gene lists across species
Jaccard index: similarity metric that compares gene lists across species
FISSEQ: sequencing technology that has fluorescent in situ sequencing of RNA directly inside cells and tissues
principle of in situ: “in its original place”
Rolling Circle Amplification (RCA): molecular biology method that amplifies circularized cDNA in situ to form DNA nanoballs
Partition sequencing: optical trick that controls density of signal by using partially matched sequencing primers
SOLiD Sequencing by Ligation: NGS chemistry that uses fluorescence-based base-calling inside fixed samples
BS(PEG)9: cross-linking reagent that anchors amplified cDNA in place to preserve spatial information
Deconvolved microscopy: imaging method that improves resolution of in situ sequencing signal
Add for binding assay data (data from validated experiments):
- ChEMBL - Most comprehensive
- 2M+ compounds, 13K+ targets, 2M+ assays
- Free API access
- Focus: Drug discovery, medicinal chemistry
- BindingDB - Specialized binding data
- 2M+ binding measurements
- Free web interface
- Focus: Protein-ligand binding affinities
- PubChem BioAssay - NIH database
- 1M+ bioassays, 300K+ compounds
- Free access
- Focus: High-throughput screening data
Specialized Databases:
- PDB (Protein Data Bank) - Structural data
- 200K+ protein structures
- Binding site information
- Focus: 3D structures and binding sites
- UniProt - Protein annotations
- Experimental evidence for protein function
- Binding partner information
- Focus: Protein function and interactions
- IntAct - Molecular interactions
- Protein-protein interactions
- Experimental interaction data
- Focus: Protein interaction networks