User guide

This page introduces how DRAKKAR is organized, what kinds of inputs it expects, and where to find the detailed workflow and operations documentation.

Quickstart

Run the complete pipeline with a sample info table:

$ drakkar complete -f input_info.tsv -o drakkar_output

Run the complete pipeline using a directory of reads:

$ drakkar complete -i /path/to/reads -o drakkar_output

Core concepts

Modules: DRAKKAR can be run end-to-end with drakkar complete or as independent modules such as preprocessing, cataloging, profiling, annotating, expressing, dereplicating, inspecting, database, status, logging, config, and transfer.
Output directory: all outputs are written under -o/--output and organized into predictable module-specific folders.
Profiles: use -p/--profile to select a Snakemake profile. The default is slurm.
Environments: use -e/--env_path to select a shared Conda environment directory.
Run logs: every workflow run writes a metadata file drakkar_YYYYMMDD-HHMMSS.yaml and captures Snakemake stdout/stderr in log/drakkar_<run_id>.snakemake.log.
Locked runs: output-writing workflows support --overwrite to delete a locked output directory and rerun after a broken Snakemake session.

Input formats

You can provide inputs as read directories or as a sample info table.

Directory input

Provide a directory with paired-end reads. DRAKKAR expects matching read-pair names such as *_1.fq.gz and *_2.fq.gz.

$ drakkar preprocessing -i /path/to/reads -o drakkar_output

Sample info table (TSV)

A tab-separated table can include any of these columns. Only the columns needed for the chosen workflow are required.

sample: sample name.
rawreads1: path or URL to R1 reads (raw, before preprocessing).
rawreads2: path or URL to R2 reads (raw, before preprocessing).
accession: ENA/SRA paired-end run accession such as ERR4303216 or SRR12345678. Use this instead of rawreads1 and rawreads2 when you want DRAKKAR to download the read pair automatically.
preprocessedreads1: explicit path to quality-filtered R1 reads for use in cataloging. Takes priority over all other read columns. See Cataloging read resolution below.
preprocessedreads2: explicit path to quality-filtered R2 reads for use in cataloging. Must be provided together with preprocessedreads1.
reference_name: host reference label for host-removal workflows.
reference_path: local path or URL to a host FASTA, or to a tarball containing the FASTA plus Bowtie2 index files.
assembly: labels defining assembly groups. Legacy coassembly is still accepted.
coverage: labels defining coverage-sharing groups for multicoverage cataloging.

Example:

sample\trawreads1\trawreads2\taccession\treference_name\treference_path\tassembly\tcoverage
sample1\tpath/sample1_1.fq.gz\tpath/sample1_2.fq.gz\t\tref1\tpath/ref1.fna\tassembly1,all\tcoverage1
sample2\t\t\tERR4303216\tref1\tpath/ref1.fna\tassembly2,all\tcoverage2

Input notes

Read files can be local paths or remote URLs (http/https/ftp/sftp).
Sample tables can also use an accession column with ENA/SRA paired-end run accessions; DRAKKAR resolves and downloads the matching R1 and R2 FASTQ files automatically.
-r/--reference, -x/--reference-index, and reference_path values can be local files or remote URLs.
Reference inputs may be FASTA files, compressed FASTA files, or tarballs containing a FASTA plus Bowtie2 index files.
Genome lists passed through options such as -B/--bins_file can also use remote URLs; DRAKKAR caches them locally before execution.
Directory-style inputs such as -i/--input and -b/--bins_dir must be local filesystem paths.
Before Snakemake starts, DRAKKAR checks downloaded and local input files for existence and non-zero size. Remote downloads retry up to five times with exponential backoff; sftp:// URLs require curl with SFTP support.
The preferred sample-table column name is assembly. The legacy column name coassembly is still accepted.
Assembly labels can be any identifiers you choose; they do not need to match sample names.
-m individual adds per-sample assemblies alongside grouped assemblies.
-b/--binners selects the binners used in cataloging. Use a comma-separated list of metabat, maxbin, semibin, and comebin; the default is all four.
--multicoverage maps samples sharing the same coverage label to each other’s individual assemblies.

Cataloging read resolution

When drakkar cataloging (or drakkar complete) loads a sample info table with -f/--file, it resolves the reads to use for assembly and mapping in the following priority order for each sample:

``preprocessedreads1`` / ``preprocessedreads2`` columns — if both are present the cataloging workflow uses these paths directly. This is the explicit override for cases where preprocessed reads live outside the default output tree.
``preprocessing/final/<sample>_1.fq.gz`` — if neither preprocessedreads1 nor preprocessedreads2 is supplied but a prior drakkar preprocessing run has already written quality-filtered reads into the output directory, cataloging detects and uses them automatically. This is the typical case when running cataloging as a follow-up step after preprocessing in the same output directory.
``rawreads1`` / ``rawreads2`` or ``accession`` — fallback to raw input paths. This path is taken when neither preprocessed column is present and no preprocessing/final/ files are found. The assembly will run directly on unfiltered reads.

This means you can keep a single input table that contains raw read paths (or accessions) together with assembly and coverage grouping columns, and cataloging will automatically pick up the quality-filtered reads from a completed preprocessing run without any changes to the file.

Guide map

Use the next pages depending on what you need:

Topic	Where to go next
Running the complete workflow or a specific module	See Workflow guide.
Databases, logging, config, transfer, outputs, and troubleshooting	See Operations and troubleshooting.
Command list only	See CLI Reference.