diff PanExplorer_workflow/README.md @ 1:032f6b3806a3 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:16:08 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/PanExplorer_workflow/README.md	Thu May 30 11:16:08 2024 +0000
@@ -0,0 +1,198 @@
+# PanExplorer_workflow
+
+# About
+
+This workflow is a snakemake worklow that can be run in the backend of the PanExplorer web application.
+
+**Homepage:** [https://panexplorer.southgreen.fr/](https://panexplorer.southgreen.fr/)
+
+It allows to perform a pan-genome analysis using published and annotated bacteria genomes, using different tools that can be invoked: Roary, PGAP, PanACoTA.
+
+It provides a presence/absence matrix of genes, an UpsetR Diagram for synthetizing the matrix information and a COG assignation summary for each strain.
+
+
+## Citation
+
+[https://doi.org/10.1093/bioinformatics/btac504](https://doi.org/10.1093/bioinformatics/btac504)
+
+## Authors
+
+* Alexis Dereeper (IRD)
+
+## Prerequisites - Tool dependencies
+
+Using a singularity container, the only dependency you will need is **singularity**.
+
+This singularity image (panexplorer.sif) already contains all dependencies required for running the workflow:
+
+- Snakemake
+- Roary
+- PGAP
+- Panaroo
+- Panacota
+- Minigraph/cactus
+- PanGenome Graph Builder (PGGB)
+- ncbi-blast+ (version BLAST 2.4.0+)
+- R (version 4.2.0) and following packages:
+  - optparse : ``install.packages("optparse")``
+  - dendextend : ``install.packages("dendextend")``
+  - svglite : ``install.packages("svglite")``
+  - heatmaply : ``install.packages("heatmaply")``
+  - gplots : ``install.packages("gplots")``
+  - UpSetR : ``install.packages("UpSetR")``
+
+## Install
+
+1- Git clone
+
+```
+git clone https://github.com/SouthGreenPlatform/PanExplorer_workflow.git
+```
+
+2- Define the PANEX_PATH environnement variable
+
+```
+cd PanExplorer_workflow
+export PANEX_PATH=$PWD
+```
+
+3- Get preformatted RPS-BLAST+ database of the CDD COG distribution
+
+```
+wget https://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz
+tar -xzvf Cog_LE.tar.gz -C $PANEX_PATH/COG
+```
+
+4- Get the singularity container
+
+```
+wget -P $PANEX_PATH/singularity https://panexplorer.southgreen.fr/singularity/panexplorer.sif
+```
+
+
+## Prepare your list of genomes to be analyzed
+
+Edit the configuration file config.yaml to list the Genbank identifiers of complete assembled and annotated genomes.
+```
+#########################################################
+# Complete one of the following input data
+# Remove the other lines if not needed
+#########################################################
+
+# Genbank accessions of assembly accession (GCA, GCF)
+ids:
+  - GCA_001042775.1
+  - GCA_001021915.1
+  - GCA_022406815.1
+
+# Path of genbank files
+input_genbanks:
+  - data/GCA_001518895.1.gb
+  - data/GCA_001746615.1.gb
+  - data/GCA_003382895.1.gb
+
+# Input genomes as fasta and annotation files in GFF format
+# Only applied when using Orthofinder or PGGB workflows, starting from fasta and GFF
+# To be used preferentially for eukaryotes
+input_genomes:
+  "MSU7":
+    "fasta": "/share/banks/Oryza/sativa/japonica/MSU7/all.con"
+    "gff3": "/share/banks/Oryza/sativa/japonica/MSU7/all.gff3"
+    "name": "MSU7"
+  "kitaake":
+    "fasta": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.assembly.fna"
+    "gff3": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.gff3"
+    "name": "kitaake"
+  "nivara":
+    "fasta": "/share/banks/Oryza/nivara/Oryza_nivara.assembly.fna"
+    "gff3": "/share/banks/Oryza/nivara/Oryza_nivara.gff3"
+    "name": "nivara"
+```
+
+It's best not to mix NCBI genomes with your own annotated genomes, to avoid biaises due to annotation method/software. Keep an homogeneous annotation procedure to feed the workflow.
+
+## Run the workflow
+
+**For prokaryotes**
+
+Creating a pangenome using Roary
+
+```
+singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_roary_heatmap_upset_COG
+```
+
+Creating a pangenome using PanACoTA
+
+```
+singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_panacota_heatmap_upset_COG
+```
+
+Creating a pangenome graph using Minigraph/Cactus and derived pangenes matrix
+
+```
+singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_cactus_heatmap_upset_COG
+```
+
+Creating a pangenome graph using PanGenomeGraph Builder (PGGB) and derived pangenes matrix
+
+```
+singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_pggb_heatmap_upset_COG
+```
+
+**For eukaryotes**
+
+Creating a pangenome using Orthofinder
+
+```
+singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_orthofinder_heatmap_upset
+```
+
+## Graphical outputs
+
+In all cases, you should a new directory named "outputs" containing all output files.
+
+In case of a pangenome graph analysis with PGGB, you will obtain vizualizations of the graph (using ODGI)
+
+* **2D graph visualization** : outputs/pggb_out/all_genomes.fa.lay.draw.png
+
+ <img src="images/all_genomes.fa.lay.draw.png" align="center" width="40%" style="display: block; margin: auto;"/>
+
+* **1D graph visualization** : outputs/pggb_out/all_genomes.fa.og.viz_multiqc.png
+
+ <img src="images/all_genomes.fa.og.viz_multiqc.png" align="center" width="90%" style="display: block; margin: auto;"/>
+
+
+In all cases, it also includes:
+
+* **ANI (Average Nucleotide Identity)** : outputs/fastani.out.svg
+
+The heatmap chart generated from distances calculated based on the ANI values. 
+ANI values are calcultaed using FastANI software.
+
+ <img src="images/fastani.out.svg" align="center" width="90%" style="display: block; margin: auto;"/>
+ 
+* **Presence/absence matrix of accessory genes**: outputs/heatmap.svg.complete.new.svg
+
+Both gene clusters and samples have been ordered using a Hierarchical Clustering.
+ 
+ <img src="images/heatmap.svg.complete.new.svg" align="center" width="90%" style="display: block; margin: auto;"/>
+
+* **Upset plot**: outputs/upsetr.svg
+
+An Upset plot is an alternative to the Venn Diagram used to deal with more than 3 sets.
+The total size of each set is represented on the left barplot.
+Every possible intersection is represented by the bottom plot, and their occurence is shown on the top barplot.
+Each row corresponds to a possible intersection: the filled-in cells show which set is part of an intersection.
+
+ <img src="images/upsetr.svg" align="center" width="90%" style="display: block; margin: auto;"/>
+
+* **Rarefaction curve**: outputs/rarefaction_curves.svg
+
+The rarefaction curve (computed by micropan R package) is the cumulative number of gene clusters we can observe as more and more genomes are being considered.
+
+ <img src="images/rarefaction_curves.svg" align="center" width="70%" style="display: block; margin: auto;"/>
+ 
+
+## License
+
+GNU General Public GPLv3 License