How to create a workflow 

PodiumASM allows you to build a workflow using a simple config.yaml configuration file :

First, provide the data paths
Second, activate the requested tools for assembly and correction.
Third, activate the tools for quality checking of assemblies.
And last, manage the tools parameters.

To create this file, just run:

create_config 

Create config.yaml for run

podiumASM create_config [OPTIONS]

Options

-c, --configyaml <configyaml>: Required Path to create config.yaml

Then, edit the relevant sections of the file to customize your flavor of a workflow.

For FASTQ, the naming conventions accepted by PodiumASM are either NAME.fastq.gz or NAME.fq.gz or NAME.fastq or NAME.fq. Use preferentially short names and avoid special characters to avoid report failure. Please do not use the long name provided directly by the sequencing machine.

All fastq files have to be homogeneous on their extension, and can be compressed.

Reference fasta file needs a fasta or fa extension, uncompressed.

2. Parameters for some specific tools 

You can manage tools parameters on the params section in the config.yaml file.

Busco specific options:

If BUSCO is activated, you must provide to PodiumASM the path of a Busco database OR only the database name (See the Busco documentation).This parameter cannot be empty.

The standard parameters used in PodiumASM are shown below. Feel free to adapt it to your own requirements.

TOOLS_PARAM:
    BUSCO_DATABASE: "capnodiales_odb10"
    QUAST : "--fragmented -m 3000"
    BUSCO : ""
    REMOVE_CONTIGS_TRESHOLD: "0.8"
    BWA_MEM: ""
    SAMTOOLS_VIEW: "-bh"
    SAMTOOLS_SORT: ""
    SAMTOOLS_DEPTH: ""
    REPEAT_MASKER: "-no_is -gff -pa 8"
    MINIMAP2_REF: "-ax map-ont"
    SAMTOOLS_VIEW_LONG_READ_REF: ""
    SAMTOOLS_SORT_LONG_READ_REF: ""
    SAMTOOLS_INDEX_LONG_READ: ""
    SNIFFLES: ""
    MINIMAP2_ASSEMBLY: "-ax map-ont"
    SAMTOOLS_VIEW_LONG_READ_ASSEMBLY: ""
    SAMTOOLS_SORT_LONG_READ_ASSEMBLY: ""

Warning

Please check documentation of each tool (outside of PodiumASM, and make sure that the settings are correct!)

How to run the workflow 

Before attempting to run PodiumASM, please verify that you have already modified the config.yaml file as explained in 1. Providing data.

Warning

Due to a bug of CookieCutter before attempting to run PodiumASM you have to go in PodiumASM profile and comment one line in slurm-submit.py :

cd PodiumASM/podiumASM/default_profile
nano slurm-submit.py

Comment this line :

If you installed PodiumASM on a HPC cluster with a job scheduler, you can run:

run_cluster 

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
See: https://snakemake.readthedocs.io/en/stable/executing/cli.html
Example:
podiumASM run_cluster -c config.yaml –dry-run –jobs 200

podiumASM run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>: Required Configuration file for run tool

-pdf, --pdf

Run snakemake with –dag, –rulegraph and –filegraph

Default:: False

Arguments

SNAKEMAKE_OTHER: Optional argument(s)

run_local 

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
See: https://snakemake.readthedocs.io/en/stable/executing/cli.html
Example:
podiumASM run_local -c config.yaml –threads 8 –dry-run
podiumASM run_local -c config.yaml –threads 8 –singularity-args ‘–bind /mnt:/mnt’

podiumASM run_local [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>: Required Configuration file for run tool

-t, --threads <threads>: Required Number of threads

-p, --pdf: Run snakemake with –dag, –rulegraph and –filegraph

Arguments

SNAKEMAKE_OTHER: Optional argument(s)

Advance run 

Providing more resources 

If the cluster default resources are not sufficient, you can edit the cluster_config.yaml file. See 2. Adapting cluster_config.yaml:

edit_cluster_config

Edit cluster_config.yaml use by profile

podiumASM edit_cluster_config [OPTIONS]

Output on PodiumASM 

The architecture of the PodiumASM output is designed as follow:

OUTPUT_PODIUMASM/
├── 1_FASTA_SORTED
|   ├── SAMPLE_1
|   ├── SAMPLE_2
|   ├── ...
├── 2_GENOME_STATS
│   ├── BUSCO
│   │   ├── file_versions.tsv
│   │   ├── lineages
│   │   └── result_busco
│   ├── COVERAGE
|       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
│   ├── QUAST
|       ├── REPORT_QUAST
│   ├── STAT_CSV
|       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
│   └── TAPESTRY
|       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── 3_REPEATMASKER
│       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── 4_STRUCTURAL_VAR
│   ├── csv_variants
│   ├── minimap2
│   └── sniffles
├── 5_FINAL_FASTA
│       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── 6_MAPPING_ILLUMINA
│   ├── BWA_MEM
│   └── STATS
├── 7_ALIGNMENTS
│       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── LOGS
└── FINAL_REPORT

Report 

PodiumASM generates a useful HTML report, including the versions of tools used and, for each fastq, a summary of statistics. Please have a look at example … and enjoy !!

Important

To visualise the report created by PodiumASM, transfer the folder FINAL_RESULTS on your local computer and open it on any web browser.

Input	Description
LONG_READ	Indicates the path to the directory with LongRead sequence data (fastq.gz format) to perform minimap2.
REFERENCE	Only one REFERENCE genome file will be used in each PodiumASM run. This REFERENCE will be used for various quality steps (i.e. ASSEMBLYTICS, QUAST)
ASSEMBLY	Provide your assembly file in one directory
REPEAT_DATABASE	Provide Uniq Repeat element Database of your organism which it be used during the repeatMasker step to annotate and mask ETs in assemblies
ILLUMINA	True or False to active rules using illumina shortread
SHORT_READ	OPTIONAL : Indicates the path to the directory with Illumina sequence data (fastq.gz format) use paired-end data. All fastq files need to be homogeneous in their extension name. Please use run1_R1 and run1_R2 nomenclature.
OUTPUT	output path directory