How to create a workflow

PodiumASM allows you to build a workflow using a simple config.yaml configuration file :

  • First, provide the data paths

  • Second, activate the requested tools for assembly and correction.

  • Third, activate the tools for quality checking of assemblies.

  • And last, manage the tools parameters.

To create this file, just run:

create_config

Create config.yaml for run

podiumASM create_config [OPTIONS]

Options

-c, --configyaml <configyaml>

Required Path to create config.yaml

Then, edit the relevant sections of the file to customize your flavor of a workflow.

1. Providing data

First, indicate the data path in the config.yaml configuration file:

DATA:
    REFERENCE: "./ref/Mycfi2_wg_ed.fasta"
    ASSEMBLY: "./CulebrONT_assembly_SUP/"
    OUTPUT: "./FIJIENSIS_ASSEMBLY_FINDER/RESULTS/"
    LONG_READS: "./fastq_guppy6/"
    REPEAT_DB: "./P_Fijiensis-families.fa"
    SHORT_READS : "./illumina/"

Find here a summary table with the description of each data needed to run PodiumASM :

Input

Description

LONG_READ

Indicates the path to the directory with LongRead sequence data (fastq.gz format) to perform minimap2.

REFERENCE

Only one REFERENCE genome file will be used in each PodiumASM run. This REFERENCE will be used for various quality steps (i.e. ASSEMBLYTICS, QUAST)

ASSEMBLY

Provide your assembly file in one directory

REPEAT_DATABASE

Provide Uniq Repeat element Database of your organism which it be used during the repeatMasker step to annotate and mask ETs in assemblies

ILLUMINA

True or False to active rules using illumina shortread

SHORT_READ

OPTIONAL : Indicates the path to the directory with Illumina sequence data (fastq.gz format) use paired-end data. All fastq files need to be homogeneous in their extension name. Please use run1_R1 and run1_R2 nomenclature.

OUTPUT

output path directory

Warning

For FASTQ, the naming conventions accepted by PodiumASM are either NAME.fastq.gz or NAME.fq.gz or NAME.fastq or NAME.fq. Use preferentially short names and avoid special characters to avoid report failure. Please do not use the long name provided directly by the sequencing machine.

All fastq files have to be homogeneous on their extension, and can be compressed.

Reference fasta file needs a fasta or fa extension, uncompressed.

2. Parameters for some specific tools

You can manage tools parameters on the params section in the config.yaml file.

Busco specific options:

  • If BUSCO is activated, you must provide to PodiumASM the path of a Busco database OR only the database name (See the Busco documentation).This parameter cannot be empty.

The standard parameters used in PodiumASM are shown below. Feel free to adapt it to your own requirements.

TOOLS_PARAM:
    BUSCO_DATABASE: "capnodiales_odb10"
    QUAST : "--fragmented -m 3000"
    BUSCO : ""
    REMOVE_CONTIGS_TRESHOLD: "0.8"
    BWA_MEM: ""
    SAMTOOLS_VIEW: "-bh"
    SAMTOOLS_SORT: ""
    SAMTOOLS_DEPTH: ""
    REPEAT_MASKER: "-no_is -gff -pa 8"
    MINIMAP2_REF: "-ax map-ont"
    SAMTOOLS_VIEW_LONG_READ_REF: ""
    SAMTOOLS_SORT_LONG_READ_REF: ""
    SAMTOOLS_INDEX_LONG_READ: ""
    SNIFFLES: ""
    MINIMAP2_ASSEMBLY: "-ax map-ont"
    SAMTOOLS_VIEW_LONG_READ_ASSEMBLY: ""
    SAMTOOLS_SORT_LONG_READ_ASSEMBLY: ""

Warning

Please check documentation of each tool (outside of PodiumASM, and make sure that the settings are correct!)


How to run the workflow

Before attempting to run PodiumASM, please verify that you have already modified the config.yaml file as explained in 1. Providing data.

Warning

Due to a bug of CookieCutter before attempting to run PodiumASM you have to go in PodiumASM profile and comment one line in slurm-submit.py :

cd PodiumASM/podiumASM/default_profile
nano slurm-submit.py

Comment this line :

If you installed PodiumASM on a HPC cluster with a job scheduler, you can run:

run_cluster

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
Example:
podiumASM run_cluster -c config.yaml –dry-run –jobs 200
podiumASM run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>

Required Configuration file for run tool

-pdf, --pdf

Run snakemake with –dag, –rulegraph and –filegraph

Default:

False

Arguments

SNAKEMAKE_OTHER

Optional argument(s)


run_local

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters using snakemake syntax.
These parameters will take precedence over Snakemake ones, which were defined in the profile.
Example:
podiumASM run_local -c config.yaml –threads 8 –dry-run
podiumASM run_local -c config.yaml –threads 8 –singularity-args ‘–bind /mnt:/mnt’
podiumASM run_local [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>

Required Configuration file for run tool

-t, --threads <threads>

Required Number of threads

-p, --pdf

Run snakemake with –dag, –rulegraph and –filegraph

Arguments

SNAKEMAKE_OTHER

Optional argument(s)


Advance run

Providing more resources

If the cluster default resources are not sufficient, you can edit the cluster_config.yaml file. See 2. Adapting cluster_config.yaml:

edit_cluster_config

Edit cluster_config.yaml use by profile

podiumASM edit_cluster_config [OPTIONS]

Output on PodiumASM

The architecture of the PodiumASM output is designed as follow:

OUTPUT_PODIUMASM/
├── 1_FASTA_SORTED
|   ├── SAMPLE_1
|   ├── SAMPLE_2
|   ├── ...
├── 2_GENOME_STATS
│   ├── BUSCO
│      ├── file_versions.tsv
│      ├── lineages
│      └── result_busco
│   ├── COVERAGE
|       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
│   ├── QUAST
|       ├── REPORT_QUAST
│   ├── STAT_CSV
|       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
│   └── TAPESTRY
|       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── 3_REPEATMASKER
│       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── 4_STRUCTURAL_VAR
│   ├── csv_variants
│   ├── minimap2
│   └── sniffles
├── 5_FINAL_FASTA
│       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── 6_MAPPING_ILLUMINA
│   ├── BWA_MEM
│   └── STATS
├── 7_ALIGNMENTS
│       ├── SAMPLE_1
|       ├── SAMPLE_2
|       ├── ...
├── LOGS
└── FINAL_REPORT

Report

PodiumASM generates a useful HTML report, including the versions of tools used and, for each fastq, a summary of statistics. Please have a look at example … and enjoy !!

Important

To visualise the report created by PodiumASM, transfer the folder FINAL_RESULTS on your local computer and open it on any web browser.