comparison PanExplorer_workflow/README.md @ 1:032f6b3806a3 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:16:08 +0000
parents
children
comparison
equal deleted inserted replaced
0:3cbb01081cde 1:032f6b3806a3
1 # PanExplorer_workflow
2
3 # About
4
5 This workflow is a snakemake worklow that can be run in the backend of the PanExplorer web application.
6
7 **Homepage:** [https://panexplorer.southgreen.fr/](https://panexplorer.southgreen.fr/)
8
9 It allows to perform a pan-genome analysis using published and annotated bacteria genomes, using different tools that can be invoked: Roary, PGAP, PanACoTA.
10
11 It provides a presence/absence matrix of genes, an UpsetR Diagram for synthetizing the matrix information and a COG assignation summary for each strain.
12
13
14 ## Citation
15
16 [https://doi.org/10.1093/bioinformatics/btac504](https://doi.org/10.1093/bioinformatics/btac504)
17
18 ## Authors
19
20 * Alexis Dereeper (IRD)
21
22 ## Prerequisites - Tool dependencies
23
24 Using a singularity container, the only dependency you will need is **singularity**.
25
26 This singularity image (panexplorer.sif) already contains all dependencies required for running the workflow:
27
28 - Snakemake
29 - Roary
30 - PGAP
31 - Panaroo
32 - Panacota
33 - Minigraph/cactus
34 - PanGenome Graph Builder (PGGB)
35 - ncbi-blast+ (version BLAST 2.4.0+)
36 - R (version 4.2.0) and following packages:
37 - optparse : ``install.packages("optparse")``
38 - dendextend : ``install.packages("dendextend")``
39 - svglite : ``install.packages("svglite")``
40 - heatmaply : ``install.packages("heatmaply")``
41 - gplots : ``install.packages("gplots")``
42 - UpSetR : ``install.packages("UpSetR")``
43
44 ## Install
45
46 1- Git clone
47
48 ```
49 git clone https://github.com/SouthGreenPlatform/PanExplorer_workflow.git
50 ```
51
52 2- Define the PANEX_PATH environnement variable
53
54 ```
55 cd PanExplorer_workflow
56 export PANEX_PATH=$PWD
57 ```
58
59 3- Get preformatted RPS-BLAST+ database of the CDD COG distribution
60
61 ```
62 wget https://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz
63 tar -xzvf Cog_LE.tar.gz -C $PANEX_PATH/COG
64 ```
65
66 4- Get the singularity container
67
68 ```
69 wget -P $PANEX_PATH/singularity https://panexplorer.southgreen.fr/singularity/panexplorer.sif
70 ```
71
72
73 ## Prepare your list of genomes to be analyzed
74
75 Edit the configuration file config.yaml to list the Genbank identifiers of complete assembled and annotated genomes.
76 ```
77 #########################################################
78 # Complete one of the following input data
79 # Remove the other lines if not needed
80 #########################################################
81
82 # Genbank accessions of assembly accession (GCA, GCF)
83 ids:
84 - GCA_001042775.1
85 - GCA_001021915.1
86 - GCA_022406815.1
87
88 # Path of genbank files
89 input_genbanks:
90 - data/GCA_001518895.1.gb
91 - data/GCA_001746615.1.gb
92 - data/GCA_003382895.1.gb
93
94 # Input genomes as fasta and annotation files in GFF format
95 # Only applied when using Orthofinder or PGGB workflows, starting from fasta and GFF
96 # To be used preferentially for eukaryotes
97 input_genomes:
98 "MSU7":
99 "fasta": "/share/banks/Oryza/sativa/japonica/MSU7/all.con"
100 "gff3": "/share/banks/Oryza/sativa/japonica/MSU7/all.gff3"
101 "name": "MSU7"
102 "kitaake":
103 "fasta": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.assembly.fna"
104 "gff3": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.gff3"
105 "name": "kitaake"
106 "nivara":
107 "fasta": "/share/banks/Oryza/nivara/Oryza_nivara.assembly.fna"
108 "gff3": "/share/banks/Oryza/nivara/Oryza_nivara.gff3"
109 "name": "nivara"
110 ```
111
112 It's best not to mix NCBI genomes with your own annotated genomes, to avoid biaises due to annotation method/software. Keep an homogeneous annotation procedure to feed the workflow.
113
114 ## Run the workflow
115
116 **For prokaryotes**
117
118 Creating a pangenome using Roary
119
120 ```
121 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_roary_heatmap_upset_COG
122 ```
123
124 Creating a pangenome using PanACoTA
125
126 ```
127 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_panacota_heatmap_upset_COG
128 ```
129
130 Creating a pangenome graph using Minigraph/Cactus and derived pangenes matrix
131
132 ```
133 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_cactus_heatmap_upset_COG
134 ```
135
136 Creating a pangenome graph using PanGenomeGraph Builder (PGGB) and derived pangenes matrix
137
138 ```
139 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_pggb_heatmap_upset_COG
140 ```
141
142 **For eukaryotes**
143
144 Creating a pangenome using Orthofinder
145
146 ```
147 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_orthofinder_heatmap_upset
148 ```
149
150 ## Graphical outputs
151
152 In all cases, you should a new directory named "outputs" containing all output files.
153
154 In case of a pangenome graph analysis with PGGB, you will obtain vizualizations of the graph (using ODGI)
155
156 * **2D graph visualization** : outputs/pggb_out/all_genomes.fa.lay.draw.png
157
158 <img src="images/all_genomes.fa.lay.draw.png" align="center" width="40%" style="display: block; margin: auto;"/>
159
160 * **1D graph visualization** : outputs/pggb_out/all_genomes.fa.og.viz_multiqc.png
161
162 <img src="images/all_genomes.fa.og.viz_multiqc.png" align="center" width="90%" style="display: block; margin: auto;"/>
163
164
165 In all cases, it also includes:
166
167 * **ANI (Average Nucleotide Identity)** : outputs/fastani.out.svg
168
169 The heatmap chart generated from distances calculated based on the ANI values.
170 ANI values are calcultaed using FastANI software.
171
172 <img src="images/fastani.out.svg" align="center" width="90%" style="display: block; margin: auto;"/>
173
174 * **Presence/absence matrix of accessory genes**: outputs/heatmap.svg.complete.new.svg
175
176 Both gene clusters and samples have been ordered using a Hierarchical Clustering.
177
178 <img src="images/heatmap.svg.complete.new.svg" align="center" width="90%" style="display: block; margin: auto;"/>
179
180 * **Upset plot**: outputs/upsetr.svg
181
182 An Upset plot is an alternative to the Venn Diagram used to deal with more than 3 sets.
183 The total size of each set is represented on the left barplot.
184 Every possible intersection is represented by the bottom plot, and their occurence is shown on the top barplot.
185 Each row corresponds to a possible intersection: the filled-in cells show which set is part of an intersection.
186
187 <img src="images/upsetr.svg" align="center" width="90%" style="display: block; margin: auto;"/>
188
189 * **Rarefaction curve**: outputs/rarefaction_curves.svg
190
191 The rarefaction curve (computed by micropan R package) is the cumulative number of gene clusters we can observe as more and more genomes are being considered.
192
193 <img src="images/rarefaction_curves.svg" align="center" width="70%" style="display: block; margin: auto;"/>
194
195
196 ## License
197
198 GNU General Public GPLv3 License