Problem: Interpreting raw soil metagenomic data remains challenging without reproducible workflows that convert FASTQ reads into species-level taxonomic reports.
Approach: We construct a Nextflow-powered metagenomic pipeline that:
- Downloads data via SRA Toolkit (
fasterq-dump
) - Executes Kraken2 for k‑mer-based taxonomic classification
- Aggregates results and logs into an interactive HTML report using
MultiQC
- Supports reproducibility and scalability via
Conda
environments and optionalDocker
integration
Impact: We deliver a portable, reproducible pipeline for environmental sequencing analysis. Ideal for researchers studying soil ecosystems, plant–microbe interactions, or any setting where metagenomic insights must be accessible, interpretable, and scalable.
GitHub repo: Embed GitHub
Why Soil Microbiome Data Matters
Soil is alive, with billions of microbes driving plant health, carbon cycling, and ecosystem resilience. But raw sequencing data doesn’t tell that story clearly.
You start with FASTQ files and end up needing species-level classifications, readable summaries, and lightweight tools that don’t require a full bioinformatics team.
This pipeline solves that using Kraken2
for classification and MultiQC
for reporting, tied together in a clean Bash
+ Conda
workflow.
[It would be interesting to know the price of soil data and in what markets…]
Tools Used
- Kraken2: Fast k-mer based taxonomy assignment
- MultiQC: Aggregates tool outputs and logs into one clean HTML report
- SRA Toolkit: Downloads public test data from NCBI
- Nextflow (optional): Future scaling and automation
- Conda: Manages dependencies and keeps the environment portable
Pipeline Structure
FASTQ → Kraken2 → MultiQC → HTML Report
Steps:
- Download sample data from SRA using
fasterq-dump
- Classify reads with
Kraken2
- Summarize outputs with
MultiQC
- All results go into a timestamped
outputs/
directory
How to Run the Soil Microbiome Pipeline
1. Clone the Repository and Set Up the Environment
git clone https://github.com/bmwoolf/Soil_Microbiome_Pipeline.git
cd Soil_Microbiome_Pipeline
conda env create -f environment.yml
conda activate soil-microbiome-pipeline
2. Download the Kraken2 Database
Download the MiniKraken2 database (the one at the very bottom):
Then run:
mkdir -p data/kraken2-db
cd data/kraken2-db
curl -O https://genome-idx.s3.amazonaws.com/kraken/minikraken2_v2_8GB_201904.tgz
tar -xzf minikraken2_v2_8GB_201904.tgz
cd ../../../
3. Download Example Data
You can use the provided script to download a test dataset (if available), or place your own FASTQ files in the data/ directory. You just need to place them in the /data/
directory as FASTQ files.
bash scripts/download_data.sh
4. Run the Pipeline
Make sure Docker is running on your computer, then execute:
nextflow run pipeline/main.nf -c pipeline/nextflow.config \
--kraken2_db "$PWD/data/kraken2-db/minikraken2_v2_8GB_201904_UPDATE" \
--data_dir "$PWD/data" \
-with-docker
5. View the Results
Kraken2 output:
Check the /work/
and /results/
directory for classification and report files.
MultiQC report:
Open reports/multiqc_report.html
directly in your browser (double-click or drag-and-drop the file) for an interactive summary.
Troubleshooting Tips
- If you see errors about missing tools, re-create the environment:
conda env create -f environment.yml
- If Docker errors occur, ensure Docker Desktop is running.
- If the MultiQC report shows raw code, open it directly in your browser, not via a local server.
TLDR; What This Pipeline Does
> This pipeline classifies soil microbiome 16S rRNA sequencing data using Kraken2 and generates a summary report with MultiQC, providing a reproducible workflow for taxonomic analysis.