Building a Soil Microbiome Pipeline with Nextflow, Kraken2, and MultiQC

Problem: Interpreting raw soil metagenomic data remains challenging without reproducible workflows that convert FASTQ reads into species-level taxonomic reports.

Approach: We construct a Nextflow-powered metagenomic pipeline that:

Downloads data via SRA Toolkit (fasterq-dump)
Executes Kraken2 for k‑mer-based taxonomic classification
Aggregates results and logs into an interactive HTML report using MultiQC
Supports reproducibility and scalability via Conda environments and optional Docker integration

Impact: We deliver a portable, reproducible pipeline for environmental sequencing analysis. Ideal for researchers studying soil ecosystems, plant–microbe interactions, or any setting where metagenomic insights must be accessible, interpretable, and scalable.

GitHub repo: Embed GitHub

Why Soil Microbiome Data Matters

Soil is alive, with billions of microbes driving plant health, carbon cycling, and ecosystem resilience. But raw sequencing data doesn’t tell that story clearly.

You start with FASTQ files and end up needing species-level classifications, readable summaries, and lightweight tools that don’t require a full bioinformatics team.

This pipeline solves that using Kraken2 for classification and MultiQC for reporting, tied together in a clean Bash + Conda workflow.

[It would be interesting to know the price of soil data and in what markets…]

Tools Used

Kraken2: Fast k-mer based taxonomy assignment
MultiQC: Aggregates tool outputs and logs into one clean HTML report
SRA Toolkit: Downloads public test data from NCBI
Nextflow (optional): Future scaling and automation
Conda: Manages dependencies and keeps the environment portable

Pipeline Structure

FASTQ → Kraken2 → MultiQC → HTML Report

Steps:

Download sample data from SRA using fasterq-dump
Classify reads with Kraken2
Summarize outputs with MultiQC
All results go into a timestamped outputs/ directory

How to Run the Soil Microbiome Pipeline

1. Clone the Repository and Set Up the Environment

git clone https://github.com/bmwoolf/Soil_Microbiome_Pipeline.git
cd Soil_Microbiome_Pipeline
conda env create -f environment.yml
conda activate soil-microbiome-pipeline

2. Download the Kraken2 Database

Download the MiniKraken2 database (the one at the very bottom):

Then run:

mkdir -p data/kraken2-db
cd data/kraken2-db
curl -O https://genome-idx.s3.amazonaws.com/kraken/minikraken2_v2_8GB_201904.tgz
tar -xzf minikraken2_v2_8GB_201904.tgz
cd ../../../

3. Download Example Data

You can use the provided script to download a test dataset (if available), or place your own FASTQ files in the data/ directory. You just need to place them in the /data/ directory as FASTQ files.

bash scripts/download_data.sh

4. Run the Pipeline

Make sure Docker is running on your computer, then execute:

nextflow run pipeline/main.nf -c pipeline/nextflow.config \
  --kraken2_db "$PWD/data/kraken2-db/minikraken2_v2_8GB_201904_UPDATE" \
  --data_dir "$PWD/data" \
  -with-docker

5. View the Results

Kraken2 output:

Check the /work/ and /results/ directory for classification and report files.

MultiQC report:

Open reports/multiqc_report.html directly in your browser (double-click or drag-and-drop the file) for an interactive summary.

Troubleshooting Tips

If you see errors about missing tools, re-create the environment:

conda env create -f environment.yml

If Docker errors occur, ensure Docker Desktop is running.
If the MultiQC report shows raw code, open it directly in your browser, not via a local server.

TLDR; What This Pipeline Does

> This pipeline classifies soil microbiome 16S rRNA sequencing data using Kraken2 and generates a summary report with MultiQC, providing a reproducible workflow for taxonomic analysis.