Workflow guide

This page covers the analysis workflows in DRAKKAR: the complete pipeline and the module-level commands used to run specific stages independently.

Workflow overview

Command

Purpose

Typical outputs

drakkar complete

Run the main workflow end-to-end from reads to downstream products.

Full output tree across preprocessing, cataloging, profiling, annotation, and optional expression.

drakkar preprocessing

Clean reads and optionally remove host DNA.

Cleaned read files, preprocessing summaries, microbial fraction, and optional Nonpareil outputs.

drakkar cataloging

Assemble reads, bin contigs, and summarize the MAG catalog.

Assemblies, bins, bin metadata, and cataloging.tsv.

drakkar profiling

Dereplicate MAGs and quantify genomes or pangenomes across samples.

Dereplicated genomes, abundance tables, and profiling outputs.

drakkar annotating

Annotate MAGs taxonomically and functionally.

Taxonomy tables plus gene- and cluster-level annotation tables.

drakkar expressing

Map metatranscriptomes to annotated genes.

Gene expression tables under expressing/.

drakkar dereplicating

Run dereplication only, without read mapping.

Dereplicated genomes in dereplicating/final.

drakkar inspecting

Run microdiversity and mapping inspection steps.

Inspection outputs derived from MAGs, coverage tables, and BAM files.

Complete workflow

Run the full pipeline in sequence:

$ drakkar complete -f input_info.tsv -o drakkar_output -m individual -t genomes

Options:

  • -i/--input: input directory for reads.

  • -f/--file: sample info table (TSV), with read pairs provided either as rawreads1/rawreads2 or as an ENA/SRA accession.

  • -o/--output: output directory.

  • -r/--reference: local path or URL to a host reference genome for preprocessing.

  • -x/--reference-index: local path or URL to a tarball containing a host reference FASTA and Bowtie2 index files; incompatible with -r/--reference.

  • -m/--mode: assembly modes such as individual and all.

  • -b/--binners: comma-separated binners for cataloging (metabat, maxbin, semibin, comebin; default: all).

  • -t/--type: profiling type (genomes or pangenomes).

  • --annotation-type: comma-separated annotation targets. See Annotating below for the full set.

  • --annotation-evalue: maximum e-value for merged gene annotation hits (default: 1e-10).

  • --annotation-identity: minimum percent identity for merged gene annotation hits with identity values, currently VFDB/MMseqs hits (default: 50).

  • -c/--multicoverage: enable multicoverage mapping.

  • --fraction: compute microbial fraction with SingleM.

  • --nonpareil: estimate metagenomic coverage and diversity with Nonpareil.

  • -a/--ani: dRep ANI threshold (default: 0.98).

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile (default: slurm).

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark: skip SLURM resource benchmark collection after the run.

  • --memory-multiplier N / --time-multiplier N: scale per-rule resource requests before the configured caps are applied.

  • --snakemake-* / --slurm-*: Snakemake and SLURM override flags. See Snakemake and SLURM management in Operations and troubleshooting.

Module reference

Preprocessing

Quality filters reads, optionally removes host DNA, and writes cleaned reads and preprocessing summaries.

$ drakkar preprocessing -i /path/to/reads -o drakkar_output -r host.fna

Options:

  • -i/--input: input directory for raw reads.

  • -f/--file: sample info table, with read pairs provided either as rawreads1/rawreads2 or as an ENA/SRA accession.

  • -o/--output: output directory.

  • -r/--reference: local path or URL to a host reference genome file.

  • -x/--reference-index: local path or URL to a tarball containing a host reference FASTA and Bowtie2 index files; incompatible with -r/--reference.

  • --fraction: compute microbial fraction with SingleM after preprocessing.

  • --nonpareil: estimate metagenomic coverage and diversity with Nonpareil.

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile.

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark / --memory-multiplier / --time-multiplier / --snakemake-* / --slurm-*: see Snakemake and SLURM management.

Cataloging

Assembles reads, bins contigs into MAGs, generates bin metadata, and writes cataloging.tsv with assembly, mapping, and binning summary statistics.

$ drakkar cataloging -i /path/to/preprocessed -o drakkar_output -m individual

Options:

  • -i/--input: directory with preprocessed reads or compatible workflow input.

  • -f/--file: sample info table. See Read resolution below for how the workflow decides which reads to use for assembly and mapping.

  • -o/--output: output directory.

  • -m/--mode: assembly modes such as individual and all.

  • -b/--binners: comma-separated binners to run (metabat, maxbin, semibin, comebin; default: all).

  • -c/--multicoverage: enable multicoverage mapping.

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile.

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark / --memory-multiplier / --time-multiplier / --snakemake-* / --slurm-*: see Snakemake and SLURM management.

Read resolution when using -f/--file

When a sample info table is provided, cataloging resolves the reads to assemble and map in the following priority order per sample:

  1. Explicit preprocessed columns — if the table contains preprocessedreads1 and preprocessedreads2 columns for a sample, those paths are used directly. Both columns must be present together.

  2. Auto-detected preprocessed reads — if no preprocessedreads1/ preprocessedreads2 columns are supplied, DRAKKAR checks whether preprocessing/final/<sample>_1.fq.gz and the matching R2 file exist inside the output directory. If they do, those quality-filtered reads are used. This covers the standard case of running cataloging after preprocessing in the same output directory without changing the input file.

  3. Raw reads or accession — if neither of the above is available, cataloging falls back to rawreads1/rawreads2 paths or an ENA/SRA accession from the table.

This allows a single input file to carry assembly grouping (assembly) and coverage grouping (coverage) metadata alongside raw read paths, while cataloging automatically upgrades to quality-filtered reads whenever they are available.

Profiling

Dereplicates MAGs and maps reads to estimate abundance, with optional microbial fraction estimation.

$ drakkar profiling -b /path/to/bins -R reads.tsv -o drakkar_output

Options:

  • -b/--bins_dir: directory with MAG/bin FASTA files.

  • -B/--bins_file: file listing MAG/bin paths.

  • -r/--reads_dir: directory with reads.

  • -R/--reads_file: sample info table with reads, using either rawreads1/rawreads2 or an ENA/SRA accession.

  • -o/--output: output directory.

  • -t/--type: profiling type (genomes or pangenomes).

  • -f/--fraction: compute microbial fraction with SingleM.

  • -a/--ani: dRep ANI threshold.

  • -n/--ignore_quality: pass --ignoreGenomeQuality to dRep.

  • -q/--quality: CSV/TSV with genome, completeness, and contamination; use this instead of CheckM2.

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile.

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark / --memory-multiplier / --time-multiplier / --snakemake-* / --slurm-*: see Snakemake and SLURM management.

Annotating

Annotates dereplicated MAGs taxonomically and/or functionally. When taxonomy annotation is enabled, DRAKKAR also writes annotating/bacteria.tree and, when archaeal MAGs are present, annotating/archaea.tree by pruning GTDB-Tk classify trees down to the input genomes only.

$ drakkar annotating -b /path/to/mags -o drakkar_output --annotation-type taxonomy,function
$ drakkar annotating -b /path/to/mags -o drakkar_output --annotation-type genes

Options:

  • -b/--bins_dir: directory with MAG/bin FASTA files.

  • -B/--bins_file: file listing MAG/bin paths.

  • -o/--output: output directory.

  • --annotation-type: comma-separated annotation targets:

    • taxonomy: run GTDB-Tk taxonomy.

    • function: run all functional components below.

    • genes: run only gene-level components (kegg,cazy,pfam,virulence,amr,signalp).

    • kegg: KEGG ortholog HMM annotation.

    • cazy: CAZy HMM annotation.

    • pfam: PFAM HMM annotation.

    • virulence (alias: vfdb): VFDB-based virulence annotation.

    • amr: AMR HMM annotation.

    • signalp: signal peptide prediction.

    • dbcan: dbCAN/CGC annotation.

    • antismash: biosynthetic cluster annotation.

    • defense: DefenseFinder systems and genes.

    • mobile (alias: genomad): geNomad mobile and viral regions.

    • network: metabolic network reconstruction.

  • --gtdb-version: GTDB release number for taxonomy annotation. DRAKKAR uses GTDB_DB_<version> from config.yaml; if omitted, it uses GTDB_DB.

  • --annotation-evalue: maximum e-value for merged gene annotation hits (default: 1e-10).

  • --annotation-identity: minimum percent identity for merged gene annotation hits with identity values, currently VFDB/MMseqs hits (default: 50).

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile.

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark / --memory-multiplier / --time-multiplier / --snakemake-* / --slurm-*: see Snakemake and SLURM management.

Output behavior for partial functional runs:

  • annotating/gene_annotations.tsv.xz is generated when any gene-level source is selected (kegg,cazy,pfam,virulence,amr,signalp,defense).

  • annotating/cluster_annotations.tsv.xz is generated when any cluster-level source is selected (dbcan,antismash,defense,mobile).

  • Merged tables are still generated from the available sources when only a subset of functional components is selected.

Expressing

Maps metatranscriptomic reads to annotated genes to quantify expression.

$ drakkar expressing -b /path/to/mags -R transcriptome.tsv -o drakkar_output

Options:

  • -b/--bins_dir: directory with MAG/bin FASTA files.

  • -B/--bins_file: file listing MAG/bin paths.

  • -r/--reads_dir: directory with transcriptome reads.

  • -R/--reads_file: transcriptome sample table, using either rawreads1/rawreads2 or an ENA/SRA accession.

  • -o/--output: output directory.

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile.

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark / --memory-multiplier / --time-multiplier / --snakemake-* / --slurm-*: see Snakemake and SLURM management.

Dereplicating

Runs only the dereplication step and outputs dereplicated genomes to dereplicating/final.

$ drakkar dereplicating -b /path/to/bins -o drakkar_output

Options:

  • -b/--bins_dir: directory with MAG/bin FASTA files.

  • -B/--bins_file: file listing MAG/bin paths.

  • -o/--output: output directory.

  • -a/--ani: dRep ANI threshold.

  • -n/--ignore_quality: pass --ignoreGenomeQuality to dRep.

  • -q/--quality: CSV/TSV with genome, completeness, and contamination; use this instead of CheckM2.

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile.

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark / --memory-multiplier / --time-multiplier / --snakemake-* / --slurm-*: see Snakemake and SLURM management.

Inspecting

Runs microdiversity and mapping inspection workflows.

$ drakkar inspecting -b /path/to/mags -m /path/to/bams -c coverage.tsv -o drakkar_output

Options:

  • -b/--bins_dir: directory with MAG/bin FASTA files.

  • -B/--bins_file: file listing MAG/bin paths.

  • -m/--mapping_dir: directory with BAM files.

  • -c/--cov_file: coverage table per genome per sample.

  • -o/--output: output directory.

  • -e/--env_path: shared Conda environment directory.

  • -p/--profile: Snakemake profile.

  • --overwrite: delete a locked output directory and rerun from scratch.

  • --skip-benchmark / --memory-multiplier / --time-multiplier / --snakemake-* / --slurm-*: see Snakemake and SLURM management.