User guide

This page introduces how DRAKKAR is organized, what kinds of inputs it expects, and where to find the detailed workflow and operations documentation.

Quickstart

Run the complete pipeline with a sample info table:

$ drakkar complete -f input_info.tsv -o drakkar_output

Run the complete pipeline using a directory of reads:

$ drakkar complete -i /path/to/reads -o drakkar_output

Core concepts

  • Modules: DRAKKAR can be run end-to-end with drakkar complete or as independent modules such as preprocessing, cataloging, profiling, annotating, expressing, dereplicating, inspecting, database, status, logging, config, and transfer.

  • Output directory: all outputs are written under -o/--output and organized into predictable module-specific folders.

  • Profiles: use -p/--profile to select a Snakemake profile. The default is slurm.

  • Environments: use -e/--env_path to select a shared Conda environment directory.

  • Run logs: every workflow run writes a metadata file drakkar_YYYYMMDD-HHMMSS.yaml and captures Snakemake stdout/stderr in log/drakkar_<run_id>.snakemake.log.

  • Locked runs: output-writing workflows support --overwrite to delete a locked output directory and rerun after a broken Snakemake session.

Input formats

You can provide inputs as read directories or as a sample info table.

Directory input

Provide a directory with paired-end reads. DRAKKAR expects matching read-pair names such as *_1.fq.gz and *_2.fq.gz.

$ drakkar preprocessing -i /path/to/reads -o drakkar_output

Sample info table (TSV)

A tab-separated table can include any of these columns. Only the columns needed for the chosen workflow are required.

  • sample: sample name.

  • rawreads1: path or URL to R1 reads (raw, before preprocessing).

  • rawreads2: path or URL to R2 reads (raw, before preprocessing).

  • accession: ENA/SRA paired-end run accession such as ERR4303216 or SRR12345678. Use this instead of rawreads1 and rawreads2 when you want DRAKKAR to download the read pair automatically.

  • preprocessedreads1: explicit path to quality-filtered R1 reads for use in cataloging. Takes priority over all other read columns. See Cataloging read resolution below.

  • preprocessedreads2: explicit path to quality-filtered R2 reads for use in cataloging. Must be provided together with preprocessedreads1.

  • reference_name: host reference label for host-removal workflows.

  • reference_path: local path or URL to a host FASTA, or to a tarball containing the FASTA plus Bowtie2 index files.

  • assembly: labels defining assembly groups. Legacy coassembly is still accepted.

  • coverage: labels defining coverage-sharing groups for multicoverage cataloging.

Example:

sample\trawreads1\trawreads2\taccession\treference_name\treference_path\tassembly\tcoverage
sample1\tpath/sample1_1.fq.gz\tpath/sample1_2.fq.gz\t\tref1\tpath/ref1.fna\tassembly1,all\tcoverage1
sample2\t\t\tERR4303216\tref1\tpath/ref1.fna\tassembly2,all\tcoverage2

Input notes

  • Read files can be local paths or remote URLs (http/https/ftp/sftp).

  • Sample tables can also use an accession column with ENA/SRA paired-end run accessions; DRAKKAR resolves and downloads the matching R1 and R2 FASTQ files automatically.

  • -r/--reference, -x/--reference-index, and reference_path values can be local files or remote URLs.

  • Reference inputs may be FASTA files, compressed FASTA files, or tarballs containing a FASTA plus Bowtie2 index files.

  • Genome lists passed through options such as -B/--bins_file can also use remote URLs; DRAKKAR caches them locally before execution.

  • Directory-style inputs such as -i/--input and -b/--bins_dir must be local filesystem paths.

  • Before Snakemake starts, DRAKKAR checks downloaded and local input files for existence and non-zero size. Remote downloads retry up to five times with exponential backoff; sftp:// URLs require curl with SFTP support.

  • The preferred sample-table column name is assembly. The legacy column name coassembly is still accepted.

  • Assembly labels can be any identifiers you choose; they do not need to match sample names.

  • -m individual adds per-sample assemblies alongside grouped assemblies.

  • -b/--binners selects the binners used in cataloging. Use a comma-separated list of metabat, maxbin, semibin, and comebin; the default is all four.

  • --multicoverage maps samples sharing the same coverage label to each other’s individual assemblies.

Cataloging read resolution

When drakkar cataloging (or drakkar complete) loads a sample info table with -f/--file, it resolves the reads to use for assembly and mapping in the following priority order for each sample:

  1. ``preprocessedreads1`` / ``preprocessedreads2`` columns — if both are present the cataloging workflow uses these paths directly. This is the explicit override for cases where preprocessed reads live outside the default output tree.

  2. ``preprocessing/final/<sample>_1.fq.gz`` — if neither preprocessedreads1 nor preprocessedreads2 is supplied but a prior drakkar preprocessing run has already written quality-filtered reads into the output directory, cataloging detects and uses them automatically. This is the typical case when running cataloging as a follow-up step after preprocessing in the same output directory.

  3. ``rawreads1`` / ``rawreads2`` or ``accession`` — fallback to raw input paths. This path is taken when neither preprocessed column is present and no preprocessing/final/ files are found. The assembly will run directly on unfiltered reads.

This means you can keep a single input table that contains raw read paths (or accessions) together with assembly and coverage grouping columns, and cataloging will automatically pick up the quality-filtered reads from a completed preprocessing run without any changes to the file.

Guide map

Use the next pages depending on what you need:

Topic

Where to go next

Running the complete workflow or a specific module

See Workflow guide.

Databases, logging, config, transfer, outputs, and troubleshooting

See Operations and troubleshooting.

Command list only

See CLI Reference.