Workflow guide
This page covers the analysis workflows in DRAKKAR: the complete pipeline and the module-level commands used to run specific stages independently.
Workflow overview
Command |
Purpose |
Typical outputs |
|---|---|---|
|
Run the main workflow end-to-end from reads to downstream products. |
Full output tree across preprocessing, cataloging, profiling, annotation, and optional expression. |
|
Clean reads and optionally remove host DNA. |
Cleaned read files, preprocessing summaries, microbial fraction, and optional Nonpareil outputs. |
|
Assemble reads, bin contigs, and summarize the MAG catalog. |
Assemblies, bins, bin metadata, and |
|
Dereplicate MAGs and quantify genomes or pangenomes across samples. |
Dereplicated genomes, abundance tables, and profiling outputs. |
|
Annotate MAGs taxonomically and functionally. |
Taxonomy tables plus gene- and cluster-level annotation tables. |
|
Map metatranscriptomes to annotated genes. |
Gene expression tables under |
|
Run dereplication only, without read mapping. |
Dereplicated genomes in |
|
Run microdiversity and mapping inspection steps. |
Inspection outputs derived from MAGs, coverage tables, and BAM files. |
Complete workflow
Run the full pipeline in sequence:
$ drakkar complete -f input_info.tsv -o drakkar_output -m individual -t genomes
Options:
-i/--input: input directory for reads.-f/--file: sample info table (TSV), with read pairs provided either asrawreads1/rawreads2or as an ENA/SRAaccession.-o/--output: output directory.-r/--reference: local path or URL to a host reference genome for preprocessing.-x/--reference-index: local path or URL to a tarball containing a host reference FASTA and Bowtie2 index files; incompatible with-r/--reference.-m/--mode: assembly modes such asindividualandall.-b/--binners: comma-separated binners for cataloging (metabat,maxbin,semibin,comebin; default: all).-t/--type: profiling type (genomesorpangenomes).--annotation-type: comma-separated annotation targets. See Annotating below for the full set.--annotation-evalue: maximum e-value for merged gene annotation hits (default:1e-10).--annotation-identity: minimum percent identity for merged gene annotation hits with identity values, currently VFDB/MMseqs hits (default:50).-c/--multicoverage: enable multicoverage mapping.--fraction: compute microbial fraction with SingleM.--nonpareil: estimate metagenomic coverage and diversity with Nonpareil.-a/--ani: dRep ANI threshold (default:0.98).-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile (default:slurm).--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark: skip SLURM resource benchmark collection after the run.--memory-multiplier N/--time-multiplier N: scale per-rule resource requests before the configured caps are applied.--snakemake-*/--slurm-*: Snakemake and SLURM override flags. See Snakemake and SLURM management in Operations and troubleshooting.
Module reference
Preprocessing
Quality filters reads, optionally removes host DNA, and writes cleaned reads and preprocessing summaries.
$ drakkar preprocessing -i /path/to/reads -o drakkar_output -r host.fna
Options:
-i/--input: input directory for raw reads.-f/--file: sample info table, with read pairs provided either asrawreads1/rawreads2or as an ENA/SRAaccession.-o/--output: output directory.-r/--reference: local path or URL to a host reference genome file.-x/--reference-index: local path or URL to a tarball containing a host reference FASTA and Bowtie2 index files; incompatible with-r/--reference.--fraction: compute microbial fraction with SingleM after preprocessing.--nonpareil: estimate metagenomic coverage and diversity with Nonpareil.-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile.--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark/--memory-multiplier/--time-multiplier/--snakemake-*/--slurm-*: see Snakemake and SLURM management.
Cataloging
Assembles reads, bins contigs into MAGs, generates bin metadata, and writes
cataloging.tsv with assembly, mapping, and binning summary statistics.
$ drakkar cataloging -i /path/to/preprocessed -o drakkar_output -m individual
Options:
-i/--input: directory with preprocessed reads or compatible workflow input.-f/--file: sample info table. See Read resolution below for how the workflow decides which reads to use for assembly and mapping.-o/--output: output directory.-m/--mode: assembly modes such asindividualandall.-b/--binners: comma-separated binners to run (metabat,maxbin,semibin,comebin; default: all).-c/--multicoverage: enable multicoverage mapping.-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile.--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark/--memory-multiplier/--time-multiplier/--snakemake-*/--slurm-*: see Snakemake and SLURM management.
Read resolution when using -f/--file
When a sample info table is provided, cataloging resolves the reads to assemble and map in the following priority order per sample:
Explicit preprocessed columns — if the table contains
preprocessedreads1andpreprocessedreads2columns for a sample, those paths are used directly. Both columns must be present together.Auto-detected preprocessed reads — if no
preprocessedreads1/preprocessedreads2columns are supplied, DRAKKAR checks whetherpreprocessing/final/<sample>_1.fq.gzand the matching R2 file exist inside the output directory. If they do, those quality-filtered reads are used. This covers the standard case of running cataloging after preprocessing in the same output directory without changing the input file.Raw reads or accession — if neither of the above is available, cataloging falls back to
rawreads1/rawreads2paths or an ENA/SRAaccessionfrom the table.
This allows a single input file to carry assembly grouping (assembly) and
coverage grouping (coverage) metadata alongside raw read paths, while
cataloging automatically upgrades to quality-filtered reads whenever they are
available.
Profiling
Dereplicates MAGs and maps reads to estimate abundance, with optional microbial fraction estimation.
$ drakkar profiling -b /path/to/bins -R reads.tsv -o drakkar_output
Options:
-b/--bins_dir: directory with MAG/bin FASTA files.-B/--bins_file: file listing MAG/bin paths.-r/--reads_dir: directory with reads.-R/--reads_file: sample info table with reads, using eitherrawreads1/rawreads2or an ENA/SRAaccession.-o/--output: output directory.-t/--type: profiling type (genomesorpangenomes).-f/--fraction: compute microbial fraction with SingleM.-a/--ani: dRep ANI threshold.-n/--ignore_quality: pass--ignoreGenomeQualityto dRep.-q/--quality: CSV/TSV with genome, completeness, and contamination; use this instead of CheckM2.-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile.--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark/--memory-multiplier/--time-multiplier/--snakemake-*/--slurm-*: see Snakemake and SLURM management.
Annotating
Annotates dereplicated MAGs taxonomically and/or functionally.
When taxonomy annotation is enabled, DRAKKAR also writes
annotating/bacteria.tree and, when archaeal MAGs are present,
annotating/archaea.tree by pruning GTDB-Tk classify trees down to the
input genomes only.
$ drakkar annotating -b /path/to/mags -o drakkar_output --annotation-type taxonomy,function
$ drakkar annotating -b /path/to/mags -o drakkar_output --annotation-type genes
Options:
-b/--bins_dir: directory with MAG/bin FASTA files.-B/--bins_file: file listing MAG/bin paths.-o/--output: output directory.--annotation-type: comma-separated annotation targets:taxonomy: run GTDB-Tk taxonomy.function: run all functional components below.genes: run only gene-level components (kegg,cazy,pfam,virulence,amr,signalp).kegg: KEGG ortholog HMM annotation.cazy: CAZy HMM annotation.pfam: PFAM HMM annotation.virulence(alias:vfdb): VFDB-based virulence annotation.amr: AMR HMM annotation.signalp: signal peptide prediction.dbcan: dbCAN/CGC annotation.antismash: biosynthetic cluster annotation.defense: DefenseFinder systems and genes.mobile(alias:genomad): geNomad mobile and viral regions.network: metabolic network reconstruction.
--gtdb-version: GTDB release number for taxonomy annotation. DRAKKAR usesGTDB_DB_<version>fromconfig.yaml; if omitted, it usesGTDB_DB.--annotation-evalue: maximum e-value for merged gene annotation hits (default:1e-10).--annotation-identity: minimum percent identity for merged gene annotation hits with identity values, currently VFDB/MMseqs hits (default:50).-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile.--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark/--memory-multiplier/--time-multiplier/--snakemake-*/--slurm-*: see Snakemake and SLURM management.
Output behavior for partial functional runs:
annotating/gene_annotations.tsv.xzis generated when any gene-level source is selected (kegg,cazy,pfam,virulence,amr,signalp,defense).annotating/cluster_annotations.tsv.xzis generated when any cluster-level source is selected (dbcan,antismash,defense,mobile).Merged tables are still generated from the available sources when only a subset of functional components is selected.
Expressing
Maps metatranscriptomic reads to annotated genes to quantify expression.
$ drakkar expressing -b /path/to/mags -R transcriptome.tsv -o drakkar_output
Options:
-b/--bins_dir: directory with MAG/bin FASTA files.-B/--bins_file: file listing MAG/bin paths.-r/--reads_dir: directory with transcriptome reads.-R/--reads_file: transcriptome sample table, using eitherrawreads1/rawreads2or an ENA/SRAaccession.-o/--output: output directory.-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile.--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark/--memory-multiplier/--time-multiplier/--snakemake-*/--slurm-*: see Snakemake and SLURM management.
Dereplicating
Runs only the dereplication step and outputs dereplicated genomes to
dereplicating/final.
$ drakkar dereplicating -b /path/to/bins -o drakkar_output
Options:
-b/--bins_dir: directory with MAG/bin FASTA files.-B/--bins_file: file listing MAG/bin paths.-o/--output: output directory.-a/--ani: dRep ANI threshold.-n/--ignore_quality: pass--ignoreGenomeQualityto dRep.-q/--quality: CSV/TSV with genome, completeness, and contamination; use this instead of CheckM2.-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile.--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark/--memory-multiplier/--time-multiplier/--snakemake-*/--slurm-*: see Snakemake and SLURM management.
Inspecting
Runs microdiversity and mapping inspection workflows.
$ drakkar inspecting -b /path/to/mags -m /path/to/bams -c coverage.tsv -o drakkar_output
Options:
-b/--bins_dir: directory with MAG/bin FASTA files.-B/--bins_file: file listing MAG/bin paths.-m/--mapping_dir: directory with BAM files.-c/--cov_file: coverage table per genome per sample.-o/--output: output directory.-e/--env_path: shared Conda environment directory.-p/--profile: Snakemake profile.--overwrite: delete a locked output directory and rerun from scratch.--skip-benchmark/--memory-multiplier/--time-multiplier/--snakemake-*/--slurm-*: see Snakemake and SLURM management.