Workflow guide
==============

This page covers the analysis workflows in DRAKKAR: the complete pipeline and
the module-level commands used to run specific stages independently.

Workflow overview
-----------------

.. list-table::
   :header-rows: 1
   :widths: 18 42 40

   * - Command
     - Purpose
     - Typical outputs
   * - ``drakkar complete``
     - Run the main workflow end-to-end from reads to downstream products.
     - Full output tree across preprocessing, cataloging, profiling,
       annotation, and optional expression.
   * - ``drakkar preprocessing``
     - Clean reads and optionally remove host DNA.
     - Cleaned read files, preprocessing summaries, microbial fraction, and
       optional Nonpareil outputs.
   * - ``drakkar cataloging``
     - Assemble reads, bin contigs, and summarize the MAG catalog.
     - Assemblies, bins, bin metadata, and ``cataloging.tsv``.
   * - ``drakkar profiling``
     - Dereplicate MAGs and quantify genomes or pangenomes across samples.
     - Dereplicated genomes, abundance tables, and profiling outputs.
   * - ``drakkar annotating``
     - Annotate MAGs taxonomically and functionally.
     - Taxonomy tables plus gene- and cluster-level annotation tables.
   * - ``drakkar expressing``
     - Map metatranscriptomes to annotated genes.
     - Gene expression tables under ``expressing/``.
   * - ``drakkar dereplicating``
     - Run dereplication only, without read mapping.
     - Dereplicated genomes in ``dereplicating/final``.
   * - ``drakkar inspecting``
     - Run microdiversity and mapping inspection steps.
     - Inspection outputs derived from MAGs, coverage tables, and BAM files.

Complete workflow
-----------------

Run the full pipeline in sequence:

.. code-block:: console

   $ drakkar complete -f input_info.tsv -o drakkar_output -m individual -t genomes

Options:

- ``-i/--input``: input directory for reads.
- ``-f/--file``: sample info table (TSV), with read pairs provided either as
  ``rawreads1``/``rawreads2`` or as an ENA/SRA ``accession``.
- ``-o/--output``: output directory.
- ``-r/--reference``: local path or URL to a host reference genome for preprocessing.
- ``-x/--reference-index``: local path or URL to a tarball containing a host
  reference FASTA and Bowtie2 index files; incompatible with ``-r/--reference``.
- ``-m/--mode``: assembly modes such as ``individual`` and ``all``.
- ``-b/--binners``: comma-separated binners for cataloging
  (``metabat``, ``maxbin``, ``semibin``, ``comebin``; default: all).
- ``-t/--type``: profiling type (``genomes`` or ``pangenomes``).
- ``--annotation-type``: comma-separated annotation targets. See *Annotating*
  below for the full set.
- ``--annotation-evalue``: maximum e-value for merged gene annotation hits
  (default: ``1e-10``).
- ``--annotation-identity``: minimum percent identity for merged gene
  annotation hits with identity values, currently VFDB/MMseqs hits
  (default: ``50``).
- ``-c/--multicoverage``: enable multicoverage mapping.
- ``--fraction``: compute microbial fraction with SingleM.
- ``--nonpareil``: estimate metagenomic coverage and diversity with Nonpareil.
- ``-a/--ani``: dRep ANI threshold (default: ``0.98``).
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile (default: ``slurm``).
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark``: skip SLURM resource benchmark collection after the run.
- ``--memory-multiplier N`` / ``--time-multiplier N``: scale per-rule resource
  requests before the configured caps are applied.
- ``--snakemake-*`` / ``--slurm-*``: Snakemake and SLURM override flags. See
  :ref:`snakemake-slurm-management` in :doc:`operations`.

Module reference
----------------

Preprocessing
^^^^^^^^^^^^^

Quality filters reads, optionally removes host DNA, and writes cleaned reads
and preprocessing summaries.

.. code-block:: console

   $ drakkar preprocessing -i /path/to/reads -o drakkar_output -r host.fna

Options:

- ``-i/--input``: input directory for raw reads.
- ``-f/--file``: sample info table, with read pairs provided either as
  ``rawreads1``/``rawreads2`` or as an ENA/SRA ``accession``.
- ``-o/--output``: output directory.
- ``-r/--reference``: local path or URL to a host reference genome file.
- ``-x/--reference-index``: local path or URL to a tarball containing a host
  reference FASTA and Bowtie2 index files; incompatible with ``-r/--reference``.
- ``--fraction``: compute microbial fraction with SingleM after preprocessing.
- ``--nonpareil``: estimate metagenomic coverage and diversity with Nonpareil.
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile.
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark`` / ``--memory-multiplier`` / ``--time-multiplier`` /
  ``--snakemake-*`` / ``--slurm-*``: see :ref:`snakemake-slurm-management`.

Cataloging
^^^^^^^^^^

Assembles reads, bins contigs into MAGs, generates bin metadata, and writes
``cataloging.tsv`` with assembly, mapping, and binning summary statistics.

.. code-block:: console

   $ drakkar cataloging -i /path/to/preprocessed -o drakkar_output -m individual

Options:

- ``-i/--input``: directory with preprocessed reads or compatible workflow input.
- ``-f/--file``: sample info table. See *Read resolution* below for how the
  workflow decides which reads to use for assembly and mapping.
- ``-o/--output``: output directory.
- ``-m/--mode``: assembly modes such as ``individual`` and ``all``.
- ``-b/--binners``: comma-separated binners to run
  (``metabat``, ``maxbin``, ``semibin``, ``comebin``; default: all).
- ``-c/--multicoverage``: enable multicoverage mapping.
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile.
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark`` / ``--memory-multiplier`` / ``--time-multiplier`` /
  ``--snakemake-*`` / ``--slurm-*``: see :ref:`snakemake-slurm-management`.

Read resolution when using ``-f/--file``
"""""""""""""""""""""""""""""""""""""""""

When a sample info table is provided, cataloging resolves the reads to assemble
and map in the following priority order per sample:

1. **Explicit preprocessed columns** — if the table contains
   ``preprocessedreads1`` and ``preprocessedreads2`` columns for a sample,
   those paths are used directly. Both columns must be present together.

2. **Auto-detected preprocessed reads** — if no ``preprocessedreads1``/
   ``preprocessedreads2`` columns are supplied, DRAKKAR checks whether
   ``preprocessing/final/<sample>_1.fq.gz`` and the matching R2 file exist
   inside the output directory. If they do, those quality-filtered reads are
   used. This covers the standard case of running cataloging after
   preprocessing in the same output directory without changing the input file.

3. **Raw reads or accession** — if neither of the above is available,
   cataloging falls back to ``rawreads1``/``rawreads2`` paths or an ENA/SRA
   ``accession`` from the table.

This allows a single input file to carry assembly grouping (``assembly``) and
coverage grouping (``coverage``) metadata alongside raw read paths, while
cataloging automatically upgrades to quality-filtered reads whenever they are
available.

Profiling
^^^^^^^^^

Dereplicates MAGs and maps reads to estimate abundance, with optional microbial
fraction estimation.

.. code-block:: console

   $ drakkar profiling -b /path/to/bins -R reads.tsv -o drakkar_output

Options:

- ``-b/--bins_dir``: directory with MAG/bin FASTA files.
- ``-B/--bins_file``: file listing MAG/bin paths.
- ``-r/--reads_dir``: directory with reads.
- ``-R/--reads_file``: sample info table with reads, using either
  ``rawreads1``/``rawreads2`` or an ENA/SRA ``accession``.
- ``-o/--output``: output directory.
- ``-t/--type``: profiling type (``genomes`` or ``pangenomes``).
- ``-f/--fraction``: compute microbial fraction with SingleM.
- ``-a/--ani``: dRep ANI threshold.
- ``-n/--ignore_quality``: pass ``--ignoreGenomeQuality`` to dRep.
- ``-q/--quality``: CSV/TSV with genome, completeness, and contamination; use
  this instead of CheckM2.
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile.
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark`` / ``--memory-multiplier`` / ``--time-multiplier`` /
  ``--snakemake-*`` / ``--slurm-*``: see :ref:`snakemake-slurm-management`.

Annotating
^^^^^^^^^^

Annotates dereplicated MAGs taxonomically and/or functionally.
When taxonomy annotation is enabled, DRAKKAR also writes
``annotating/bacteria.tree`` and, when archaeal MAGs are present,
``annotating/archaea.tree`` by pruning GTDB-Tk classify trees down to the
input genomes only.

.. code-block:: console

   $ drakkar annotating -b /path/to/mags -o drakkar_output --annotation-type taxonomy,function

.. code-block:: console

   $ drakkar annotating -b /path/to/mags -o drakkar_output --annotation-type genes

Options:

- ``-b/--bins_dir``: directory with MAG/bin FASTA files.
- ``-B/--bins_file``: file listing MAG/bin paths.
- ``-o/--output``: output directory.
- ``--annotation-type``: comma-separated annotation targets:

  - ``taxonomy``: run GTDB-Tk taxonomy.
  - ``function``: run all functional components below.
  - ``genes``: run only gene-level components
    (``kegg,cazy,pfam,virulence,amr,signalp``).
  - ``kegg``: KEGG ortholog HMM annotation.
  - ``cazy``: CAZy HMM annotation.
  - ``pfam``: PFAM HMM annotation.
  - ``virulence`` (alias: ``vfdb``): VFDB-based virulence annotation.
  - ``amr``: AMR HMM annotation.
  - ``signalp``: signal peptide prediction.
  - ``dbcan``: dbCAN/CGC annotation.
  - ``antismash``: biosynthetic cluster annotation.
  - ``defense``: DefenseFinder systems and genes.
  - ``mobile`` (alias: ``genomad``): geNomad mobile and viral regions.
  - ``network``: metabolic network reconstruction.
- ``--gtdb-version``: GTDB release number for taxonomy annotation. DRAKKAR uses
  ``GTDB_DB_<version>`` from ``config.yaml``; if omitted, it uses ``GTDB_DB``.
- ``--annotation-evalue``: maximum e-value for merged gene annotation hits
  (default: ``1e-10``).
- ``--annotation-identity``: minimum percent identity for merged gene
  annotation hits with identity values, currently VFDB/MMseqs hits
  (default: ``50``).
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile.
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark`` / ``--memory-multiplier`` / ``--time-multiplier`` /
  ``--snakemake-*`` / ``--slurm-*``: see :ref:`snakemake-slurm-management`.

Output behavior for partial functional runs:

- ``annotating/gene_annotations.tsv.xz`` is generated when any gene-level
  source is selected
  (``kegg,cazy,pfam,virulence,amr,signalp,defense``).
- ``annotating/cluster_annotations.tsv.xz`` is generated when any cluster-level
  source is selected (``dbcan,antismash,defense,mobile``).
- Merged tables are still generated from the available sources when only a
  subset of functional components is selected.

Expressing
^^^^^^^^^^

Maps metatranscriptomic reads to annotated genes to quantify expression.

.. code-block:: console

   $ drakkar expressing -b /path/to/mags -R transcriptome.tsv -o drakkar_output

Options:

- ``-b/--bins_dir``: directory with MAG/bin FASTA files.
- ``-B/--bins_file``: file listing MAG/bin paths.
- ``-r/--reads_dir``: directory with transcriptome reads.
- ``-R/--reads_file``: transcriptome sample table, using either
  ``rawreads1``/``rawreads2`` or an ENA/SRA ``accession``.
- ``-o/--output``: output directory.
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile.
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark`` / ``--memory-multiplier`` / ``--time-multiplier`` /
  ``--snakemake-*`` / ``--slurm-*``: see :ref:`snakemake-slurm-management`.

Dereplicating
^^^^^^^^^^^^^

Runs only the dereplication step and outputs dereplicated genomes to
``dereplicating/final``.

.. code-block:: console

   $ drakkar dereplicating -b /path/to/bins -o drakkar_output

Options:

- ``-b/--bins_dir``: directory with MAG/bin FASTA files.
- ``-B/--bins_file``: file listing MAG/bin paths.
- ``-o/--output``: output directory.
- ``-a/--ani``: dRep ANI threshold.
- ``-n/--ignore_quality``: pass ``--ignoreGenomeQuality`` to dRep.
- ``-q/--quality``: CSV/TSV with genome, completeness, and contamination; use
  this instead of CheckM2.
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile.
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark`` / ``--memory-multiplier`` / ``--time-multiplier`` /
  ``--snakemake-*`` / ``--slurm-*``: see :ref:`snakemake-slurm-management`.

Inspecting
^^^^^^^^^^

Runs microdiversity and mapping inspection workflows.

.. code-block:: console

   $ drakkar inspecting -b /path/to/mags -m /path/to/bams -c coverage.tsv -o drakkar_output

Options:

- ``-b/--bins_dir``: directory with MAG/bin FASTA files.
- ``-B/--bins_file``: file listing MAG/bin paths.
- ``-m/--mapping_dir``: directory with BAM files.
- ``-c/--cov_file``: coverage table per genome per sample.
- ``-o/--output``: output directory.
- ``-e/--env_path``: shared Conda environment directory.
- ``-p/--profile``: Snakemake profile.
- ``--overwrite``: delete a locked output directory and rerun from scratch.
- ``--skip-benchmark`` / ``--memory-multiplier`` / ``--time-multiplier`` /
  ``--snakemake-*`` / ``--slurm-*``: see :ref:`snakemake-slurm-management`.