annotate PanExplorer_workflow/README.md @ 1:032f6b3806a3 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:16:08 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
1 # PanExplorer_workflow
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
2
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
3 # About
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
4
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
5 This workflow is a snakemake worklow that can be run in the backend of the PanExplorer web application.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
6
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
7 **Homepage:** [https://panexplorer.southgreen.fr/](https://panexplorer.southgreen.fr/)
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
8
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
9 It allows to perform a pan-genome analysis using published and annotated bacteria genomes, using different tools that can be invoked: Roary, PGAP, PanACoTA.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
10
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
11 It provides a presence/absence matrix of genes, an UpsetR Diagram for synthetizing the matrix information and a COG assignation summary for each strain.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
12
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
13
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
14 ## Citation
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
15
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
16 [https://doi.org/10.1093/bioinformatics/btac504](https://doi.org/10.1093/bioinformatics/btac504)
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
17
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
18 ## Authors
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
19
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
20 * Alexis Dereeper (IRD)
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
21
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
22 ## Prerequisites - Tool dependencies
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
23
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
24 Using a singularity container, the only dependency you will need is **singularity**.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
25
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
26 This singularity image (panexplorer.sif) already contains all dependencies required for running the workflow:
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
27
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
28 - Snakemake
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
29 - Roary
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
30 - PGAP
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
31 - Panaroo
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
32 - Panacota
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
33 - Minigraph/cactus
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
34 - PanGenome Graph Builder (PGGB)
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
35 - ncbi-blast+ (version BLAST 2.4.0+)
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
36 - R (version 4.2.0) and following packages:
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
37 - optparse : ``install.packages("optparse")``
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
38 - dendextend : ``install.packages("dendextend")``
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
39 - svglite : ``install.packages("svglite")``
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
40 - heatmaply : ``install.packages("heatmaply")``
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
41 - gplots : ``install.packages("gplots")``
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
42 - UpSetR : ``install.packages("UpSetR")``
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
43
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
44 ## Install
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
45
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
46 1- Git clone
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
47
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
48 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
49 git clone https://github.com/SouthGreenPlatform/PanExplorer_workflow.git
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
50 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
51
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
52 2- Define the PANEX_PATH environnement variable
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
53
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
54 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
55 cd PanExplorer_workflow
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
56 export PANEX_PATH=$PWD
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
57 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
58
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
59 3- Get preformatted RPS-BLAST+ database of the CDD COG distribution
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
60
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
61 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
62 wget https://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
63 tar -xzvf Cog_LE.tar.gz -C $PANEX_PATH/COG
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
64 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
65
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
66 4- Get the singularity container
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
67
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
68 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
69 wget -P $PANEX_PATH/singularity https://panexplorer.southgreen.fr/singularity/panexplorer.sif
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
70 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
71
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
72
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
73 ## Prepare your list of genomes to be analyzed
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
74
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
75 Edit the configuration file config.yaml to list the Genbank identifiers of complete assembled and annotated genomes.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
76 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
77 #########################################################
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
78 # Complete one of the following input data
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
79 # Remove the other lines if not needed
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
80 #########################################################
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
81
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
82 # Genbank accessions of assembly accession (GCA, GCF)
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
83 ids:
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
84 - GCA_001042775.1
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
85 - GCA_001021915.1
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
86 - GCA_022406815.1
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
87
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
88 # Path of genbank files
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
89 input_genbanks:
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
90 - data/GCA_001518895.1.gb
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
91 - data/GCA_001746615.1.gb
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
92 - data/GCA_003382895.1.gb
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
93
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
94 # Input genomes as fasta and annotation files in GFF format
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
95 # Only applied when using Orthofinder or PGGB workflows, starting from fasta and GFF
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
96 # To be used preferentially for eukaryotes
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
97 input_genomes:
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
98 "MSU7":
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
99 "fasta": "/share/banks/Oryza/sativa/japonica/MSU7/all.con"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
100 "gff3": "/share/banks/Oryza/sativa/japonica/MSU7/all.gff3"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
101 "name": "MSU7"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
102 "kitaake":
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
103 "fasta": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.assembly.fna"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
104 "gff3": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.gff3"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
105 "name": "kitaake"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
106 "nivara":
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
107 "fasta": "/share/banks/Oryza/nivara/Oryza_nivara.assembly.fna"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
108 "gff3": "/share/banks/Oryza/nivara/Oryza_nivara.gff3"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
109 "name": "nivara"
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
110 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
111
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
112 It's best not to mix NCBI genomes with your own annotated genomes, to avoid biaises due to annotation method/software. Keep an homogeneous annotation procedure to feed the workflow.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
113
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
114 ## Run the workflow
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
115
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
116 **For prokaryotes**
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
117
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
118 Creating a pangenome using Roary
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
119
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
120 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
121 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_roary_heatmap_upset_COG
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
122 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
123
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
124 Creating a pangenome using PanACoTA
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
125
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
126 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
127 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_panacota_heatmap_upset_COG
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
128 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
129
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
130 Creating a pangenome graph using Minigraph/Cactus and derived pangenes matrix
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
131
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
132 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
133 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_cactus_heatmap_upset_COG
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
134 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
135
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
136 Creating a pangenome graph using PanGenomeGraph Builder (PGGB) and derived pangenes matrix
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
137
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
138 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
139 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_pggb_heatmap_upset_COG
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
140 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
141
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
142 **For eukaryotes**
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
143
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
144 Creating a pangenome using Orthofinder
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
145
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
146 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
147 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_orthofinder_heatmap_upset
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
148 ```
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
149
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
150 ## Graphical outputs
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
151
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
152 In all cases, you should a new directory named "outputs" containing all output files.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
153
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
154 In case of a pangenome graph analysis with PGGB, you will obtain vizualizations of the graph (using ODGI)
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
155
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
156 * **2D graph visualization** : outputs/pggb_out/all_genomes.fa.lay.draw.png
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
157
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
158 <img src="images/all_genomes.fa.lay.draw.png" align="center" width="40%" style="display: block; margin: auto;"/>
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
159
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
160 * **1D graph visualization** : outputs/pggb_out/all_genomes.fa.og.viz_multiqc.png
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
161
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
162 <img src="images/all_genomes.fa.og.viz_multiqc.png" align="center" width="90%" style="display: block; margin: auto;"/>
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
163
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
164
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
165 In all cases, it also includes:
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
166
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
167 * **ANI (Average Nucleotide Identity)** : outputs/fastani.out.svg
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
168
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
169 The heatmap chart generated from distances calculated based on the ANI values.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
170 ANI values are calcultaed using FastANI software.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
171
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
172 <img src="images/fastani.out.svg" align="center" width="90%" style="display: block; margin: auto;"/>
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
173
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
174 * **Presence/absence matrix of accessory genes**: outputs/heatmap.svg.complete.new.svg
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
175
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
176 Both gene clusters and samples have been ordered using a Hierarchical Clustering.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
177
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
178 <img src="images/heatmap.svg.complete.new.svg" align="center" width="90%" style="display: block; margin: auto;"/>
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
179
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
180 * **Upset plot**: outputs/upsetr.svg
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
181
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
182 An Upset plot is an alternative to the Venn Diagram used to deal with more than 3 sets.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
183 The total size of each set is represented on the left barplot.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
184 Every possible intersection is represented by the bottom plot, and their occurence is shown on the top barplot.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
185 Each row corresponds to a possible intersection: the filled-in cells show which set is part of an intersection.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
186
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
187 <img src="images/upsetr.svg" align="center" width="90%" style="display: block; margin: auto;"/>
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
188
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
189 * **Rarefaction curve**: outputs/rarefaction_curves.svg
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
190
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
191 The rarefaction curve (computed by micropan R package) is the cumulative number of gene clusters we can observe as more and more genomes are being considered.
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
192
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
193 <img src="images/rarefaction_curves.svg" align="center" width="70%" style="display: block; margin: auto;"/>
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
194
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
195
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
196 ## License
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
197
032f6b3806a3 Uploaded
dereeper
parents:
diff changeset
198 GNU General Public GPLv3 License