How to create a workflow
PodiumASM allows you to build a workflow using a simple config.yaml
configuration file :
First, provide the data paths
Second, activate the requested tools for assembly and correction.
Third, activate the tools for quality checking of assemblies.
And last, manage the tools parameters.
To create this file, just run:
create_config
Create config.yaml for run
podiumASM create_config [OPTIONS]
Options
- -c, --configyaml <configyaml>
Required Path to create config.yaml
Then, edit the relevant sections of the file to customize your flavor of a workflow.
1. Providing data
First, indicate the data path in the config.yaml
configuration file:
DATA:
REFERENCE: "./ref/Mycfi2_wg_ed.fasta"
ASSEMBLY: "./CulebrONT_assembly_SUP/"
OUTPUT: "./FIJIENSIS_ASSEMBLY_FINDER/RESULTS/"
LONG_READS: "./fastq_guppy6/"
REPEAT_DB: "./P_Fijiensis-families.fa"
SHORT_READS : "./illumina/"
Find here a summary table with the description of each data needed to run PodiumASM :
Input |
Description |
---|---|
LONG_READ |
Indicates the path to the directory with LongRead sequence data (fastq.gz format) to perform minimap2. |
REFERENCE |
Only one REFERENCE genome file will be used in each PodiumASM run. This REFERENCE will be used for various quality steps (i.e. ASSEMBLYTICS, QUAST) |
ASSEMBLY |
Provide your assembly file in one directory |
REPEAT_DATABASE |
Provide Uniq Repeat element Database of your organism which it be used during the repeatMasker step to annotate and mask ETs in assemblies |
ILLUMINA |
True or False to active rules using illumina shortread |
SHORT_READ |
OPTIONAL : Indicates the path to the directory with Illumina sequence data (fastq.gz format) use paired-end data. All fastq files need to be homogeneous in their extension name. Please use run1_R1 and run1_R2 nomenclature. |
OUTPUT |
output path directory |
Warning
For FASTQ, the naming conventions accepted by PodiumASM are either NAME.fastq.gz or NAME.fq.gz or NAME.fastq or NAME.fq. Use preferentially short names and avoid special characters to avoid report failure. Please do not use the long name provided directly by the sequencing machine.
All fastq files have to be homogeneous on their extension, and can be compressed.
Reference fasta file needs a fasta or fa extension, uncompressed.
2. Parameters for some specific tools
You can manage tools parameters on the params section in the config.yaml
file.
Busco
specific options:
If BUSCO is activated, you must provide to PodiumASM the path of a Busco database OR only the database name (See the Busco documentation).This parameter cannot be empty.
The standard parameters used in PodiumASM are shown below. Feel free to adapt it to your own requirements.
TOOLS_PARAM:
BUSCO_DATABASE: "capnodiales_odb10"
QUAST : "--fragmented -m 3000"
BUSCO : ""
REMOVE_CONTIGS_TRESHOLD: "0.8"
BWA_MEM: ""
SAMTOOLS_VIEW: "-bh"
SAMTOOLS_SORT: ""
SAMTOOLS_DEPTH: ""
REPEAT_MASKER: "-no_is -gff -pa 8"
MINIMAP2_REF: "-ax map-ont"
SAMTOOLS_VIEW_LONG_READ_REF: ""
SAMTOOLS_SORT_LONG_READ_REF: ""
SAMTOOLS_INDEX_LONG_READ: ""
SNIFFLES: ""
MINIMAP2_ASSEMBLY: "-ax map-ont"
SAMTOOLS_VIEW_LONG_READ_ASSEMBLY: ""
SAMTOOLS_SORT_LONG_READ_ASSEMBLY: ""
Warning
Please check documentation of each tool (outside of PodiumASM, and make sure that the settings are correct!)
How to run the workflow
Before attempting to run PodiumASM, please verify that you have already modified the config.yaml
file as explained in 1. Providing data.
Warning
Due to a bug of CookieCutter before attempting to run PodiumASM you have to go in PodiumASM profile and comment one line in slurm-submit.py :
cd PodiumASM/podiumASM/default_profile
nano slurm-submit.py
Comment this line :
If you installed PodiumASM on a HPC cluster with a job scheduler, you can run:
run_cluster
podiumASM run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...
Options
- -c, --config <config>
Required Configuration file for run tool
- -pdf, --pdf
Run snakemake with –dag, –rulegraph and –filegraph
- Default:
False
Arguments
- SNAKEMAKE_OTHER
Optional argument(s)
run_local
podiumASM run_local [OPTIONS] [SNAKEMAKE_OTHER]...
Options
- -c, --config <config>
Required Configuration file for run tool
- -t, --threads <threads>
Required Number of threads
- -p, --pdf
Run snakemake with –dag, –rulegraph and –filegraph
Arguments
- SNAKEMAKE_OTHER
Optional argument(s)
Advance run
Providing more resources
If the cluster default resources are not sufficient, you can edit the cluster_config.yaml
file. See 2. Adapting cluster_config.yaml:
edit_cluster_config
Edit cluster_config.yaml use by profile
podiumASM edit_cluster_config [OPTIONS]
Output on PodiumASM
The architecture of the PodiumASM output is designed as follow:
OUTPUT_PODIUMASM/
├── 1_FASTA_SORTED
| ├── SAMPLE_1
| ├── SAMPLE_2
| ├── ...
├── 2_GENOME_STATS
│ ├── BUSCO
│ │ ├── file_versions.tsv
│ │ ├── lineages
│ │ └── result_busco
│ ├── COVERAGE
| ├── SAMPLE_1
| ├── SAMPLE_2
| ├── ...
│ ├── QUAST
| ├── REPORT_QUAST
│ ├── STAT_CSV
| ├── SAMPLE_1
| ├── SAMPLE_2
| ├── ...
│ └── TAPESTRY
| ├── SAMPLE_1
| ├── SAMPLE_2
| ├── ...
├── 3_REPEATMASKER
│ ├── SAMPLE_1
| ├── SAMPLE_2
| ├── ...
├── 4_STRUCTURAL_VAR
│ ├── csv_variants
│ ├── minimap2
│ └── sniffles
├── 5_FINAL_FASTA
│ ├── SAMPLE_1
| ├── SAMPLE_2
| ├── ...
├── 6_MAPPING_ILLUMINA
│ ├── BWA_MEM
│ └── STATS
├── 7_ALIGNMENTS
│ ├── SAMPLE_1
| ├── SAMPLE_2
| ├── ...
├── LOGS
└── FINAL_REPORT
Report
PodiumASM generates a useful HTML report, including the versions of tools used and, for each fastq, a summary of statistics. Please have a look at example … and enjoy !!
Important
To visualise the report created by PodiumASM, transfer the folder FINAL_RESULTS
on your local computer and open it on any web browser.