1
|
1 # PanExplorer_workflow
|
|
2
|
|
3 # About
|
|
4
|
|
5 This workflow is a snakemake worklow that can be run in the backend of the PanExplorer web application.
|
|
6
|
|
7 **Homepage:** [https://panexplorer.southgreen.fr/](https://panexplorer.southgreen.fr/)
|
|
8
|
|
9 It allows to perform a pan-genome analysis using published and annotated bacteria genomes, using different tools that can be invoked: Roary, PGAP, PanACoTA.
|
|
10
|
|
11 It provides a presence/absence matrix of genes, an UpsetR Diagram for synthetizing the matrix information and a COG assignation summary for each strain.
|
|
12
|
|
13
|
|
14 ## Citation
|
|
15
|
|
16 [https://doi.org/10.1093/bioinformatics/btac504](https://doi.org/10.1093/bioinformatics/btac504)
|
|
17
|
|
18 ## Authors
|
|
19
|
|
20 * Alexis Dereeper (IRD)
|
|
21
|
|
22 ## Prerequisites - Tool dependencies
|
|
23
|
|
24 Using a singularity container, the only dependency you will need is **singularity**.
|
|
25
|
|
26 This singularity image (panexplorer.sif) already contains all dependencies required for running the workflow:
|
|
27
|
|
28 - Snakemake
|
|
29 - Roary
|
|
30 - PGAP
|
|
31 - Panaroo
|
|
32 - Panacota
|
|
33 - Minigraph/cactus
|
|
34 - PanGenome Graph Builder (PGGB)
|
|
35 - ncbi-blast+ (version BLAST 2.4.0+)
|
|
36 - R (version 4.2.0) and following packages:
|
|
37 - optparse : ``install.packages("optparse")``
|
|
38 - dendextend : ``install.packages("dendextend")``
|
|
39 - svglite : ``install.packages("svglite")``
|
|
40 - heatmaply : ``install.packages("heatmaply")``
|
|
41 - gplots : ``install.packages("gplots")``
|
|
42 - UpSetR : ``install.packages("UpSetR")``
|
|
43
|
|
44 ## Install
|
|
45
|
|
46 1- Git clone
|
|
47
|
|
48 ```
|
|
49 git clone https://github.com/SouthGreenPlatform/PanExplorer_workflow.git
|
|
50 ```
|
|
51
|
|
52 2- Define the PANEX_PATH environnement variable
|
|
53
|
|
54 ```
|
|
55 cd PanExplorer_workflow
|
|
56 export PANEX_PATH=$PWD
|
|
57 ```
|
|
58
|
|
59 3- Get preformatted RPS-BLAST+ database of the CDD COG distribution
|
|
60
|
|
61 ```
|
|
62 wget https://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz
|
|
63 tar -xzvf Cog_LE.tar.gz -C $PANEX_PATH/COG
|
|
64 ```
|
|
65
|
|
66 4- Get the singularity container
|
|
67
|
|
68 ```
|
|
69 wget -P $PANEX_PATH/singularity https://panexplorer.southgreen.fr/singularity/panexplorer.sif
|
|
70 ```
|
|
71
|
|
72
|
|
73 ## Prepare your list of genomes to be analyzed
|
|
74
|
|
75 Edit the configuration file config.yaml to list the Genbank identifiers of complete assembled and annotated genomes.
|
|
76 ```
|
|
77 #########################################################
|
|
78 # Complete one of the following input data
|
|
79 # Remove the other lines if not needed
|
|
80 #########################################################
|
|
81
|
|
82 # Genbank accessions of assembly accession (GCA, GCF)
|
|
83 ids:
|
|
84 - GCA_001042775.1
|
|
85 - GCA_001021915.1
|
|
86 - GCA_022406815.1
|
|
87
|
|
88 # Path of genbank files
|
|
89 input_genbanks:
|
|
90 - data/GCA_001518895.1.gb
|
|
91 - data/GCA_001746615.1.gb
|
|
92 - data/GCA_003382895.1.gb
|
|
93
|
|
94 # Input genomes as fasta and annotation files in GFF format
|
|
95 # Only applied when using Orthofinder or PGGB workflows, starting from fasta and GFF
|
|
96 # To be used preferentially for eukaryotes
|
|
97 input_genomes:
|
|
98 "MSU7":
|
|
99 "fasta": "/share/banks/Oryza/sativa/japonica/MSU7/all.con"
|
|
100 "gff3": "/share/banks/Oryza/sativa/japonica/MSU7/all.gff3"
|
|
101 "name": "MSU7"
|
|
102 "kitaake":
|
|
103 "fasta": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.assembly.fna"
|
|
104 "gff3": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.gff3"
|
|
105 "name": "kitaake"
|
|
106 "nivara":
|
|
107 "fasta": "/share/banks/Oryza/nivara/Oryza_nivara.assembly.fna"
|
|
108 "gff3": "/share/banks/Oryza/nivara/Oryza_nivara.gff3"
|
|
109 "name": "nivara"
|
|
110 ```
|
|
111
|
|
112 It's best not to mix NCBI genomes with your own annotated genomes, to avoid biaises due to annotation method/software. Keep an homogeneous annotation procedure to feed the workflow.
|
|
113
|
|
114 ## Run the workflow
|
|
115
|
|
116 **For prokaryotes**
|
|
117
|
|
118 Creating a pangenome using Roary
|
|
119
|
|
120 ```
|
|
121 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_roary_heatmap_upset_COG
|
|
122 ```
|
|
123
|
|
124 Creating a pangenome using PanACoTA
|
|
125
|
|
126 ```
|
|
127 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_panacota_heatmap_upset_COG
|
|
128 ```
|
|
129
|
|
130 Creating a pangenome graph using Minigraph/Cactus and derived pangenes matrix
|
|
131
|
|
132 ```
|
|
133 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_cactus_heatmap_upset_COG
|
|
134 ```
|
|
135
|
|
136 Creating a pangenome graph using PanGenomeGraph Builder (PGGB) and derived pangenes matrix
|
|
137
|
|
138 ```
|
|
139 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_pggb_heatmap_upset_COG
|
|
140 ```
|
|
141
|
|
142 **For eukaryotes**
|
|
143
|
|
144 Creating a pangenome using Orthofinder
|
|
145
|
|
146 ```
|
|
147 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_orthofinder_heatmap_upset
|
|
148 ```
|
|
149
|
|
150 ## Graphical outputs
|
|
151
|
|
152 In all cases, you should a new directory named "outputs" containing all output files.
|
|
153
|
|
154 In case of a pangenome graph analysis with PGGB, you will obtain vizualizations of the graph (using ODGI)
|
|
155
|
|
156 * **2D graph visualization** : outputs/pggb_out/all_genomes.fa.lay.draw.png
|
|
157
|
|
158 <img src="images/all_genomes.fa.lay.draw.png" align="center" width="40%" style="display: block; margin: auto;"/>
|
|
159
|
|
160 * **1D graph visualization** : outputs/pggb_out/all_genomes.fa.og.viz_multiqc.png
|
|
161
|
|
162 <img src="images/all_genomes.fa.og.viz_multiqc.png" align="center" width="90%" style="display: block; margin: auto;"/>
|
|
163
|
|
164
|
|
165 In all cases, it also includes:
|
|
166
|
|
167 * **ANI (Average Nucleotide Identity)** : outputs/fastani.out.svg
|
|
168
|
|
169 The heatmap chart generated from distances calculated based on the ANI values.
|
|
170 ANI values are calcultaed using FastANI software.
|
|
171
|
|
172 <img src="images/fastani.out.svg" align="center" width="90%" style="display: block; margin: auto;"/>
|
|
173
|
|
174 * **Presence/absence matrix of accessory genes**: outputs/heatmap.svg.complete.new.svg
|
|
175
|
|
176 Both gene clusters and samples have been ordered using a Hierarchical Clustering.
|
|
177
|
|
178 <img src="images/heatmap.svg.complete.new.svg" align="center" width="90%" style="display: block; margin: auto;"/>
|
|
179
|
|
180 * **Upset plot**: outputs/upsetr.svg
|
|
181
|
|
182 An Upset plot is an alternative to the Venn Diagram used to deal with more than 3 sets.
|
|
183 The total size of each set is represented on the left barplot.
|
|
184 Every possible intersection is represented by the bottom plot, and their occurence is shown on the top barplot.
|
|
185 Each row corresponds to a possible intersection: the filled-in cells show which set is part of an intersection.
|
|
186
|
|
187 <img src="images/upsetr.svg" align="center" width="90%" style="display: block; margin: auto;"/>
|
|
188
|
|
189 * **Rarefaction curve**: outputs/rarefaction_curves.svg
|
|
190
|
|
191 The rarefaction curve (computed by micropan R package) is the cumulative number of gene clusters we can observe as more and more genomes are being considered.
|
|
192
|
|
193 <img src="images/rarefaction_curves.svg" align="center" width="70%" style="display: block; margin: auto;"/>
|
|
194
|
|
195
|
|
196 ## License
|
|
197
|
|
198 GNU General Public GPLv3 License
|