Mercurial > repos > dereeper > pangenome_explorer
comparison PanExplorer_workflow/README.md @ 1:032f6b3806a3 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:16:08 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
0:3cbb01081cde | 1:032f6b3806a3 |
---|---|
1 # PanExplorer_workflow | |
2 | |
3 # About | |
4 | |
5 This workflow is a snakemake worklow that can be run in the backend of the PanExplorer web application. | |
6 | |
7 **Homepage:** [https://panexplorer.southgreen.fr/](https://panexplorer.southgreen.fr/) | |
8 | |
9 It allows to perform a pan-genome analysis using published and annotated bacteria genomes, using different tools that can be invoked: Roary, PGAP, PanACoTA. | |
10 | |
11 It provides a presence/absence matrix of genes, an UpsetR Diagram for synthetizing the matrix information and a COG assignation summary for each strain. | |
12 | |
13 | |
14 ## Citation | |
15 | |
16 [https://doi.org/10.1093/bioinformatics/btac504](https://doi.org/10.1093/bioinformatics/btac504) | |
17 | |
18 ## Authors | |
19 | |
20 * Alexis Dereeper (IRD) | |
21 | |
22 ## Prerequisites - Tool dependencies | |
23 | |
24 Using a singularity container, the only dependency you will need is **singularity**. | |
25 | |
26 This singularity image (panexplorer.sif) already contains all dependencies required for running the workflow: | |
27 | |
28 - Snakemake | |
29 - Roary | |
30 - PGAP | |
31 - Panaroo | |
32 - Panacota | |
33 - Minigraph/cactus | |
34 - PanGenome Graph Builder (PGGB) | |
35 - ncbi-blast+ (version BLAST 2.4.0+) | |
36 - R (version 4.2.0) and following packages: | |
37 - optparse : ``install.packages("optparse")`` | |
38 - dendextend : ``install.packages("dendextend")`` | |
39 - svglite : ``install.packages("svglite")`` | |
40 - heatmaply : ``install.packages("heatmaply")`` | |
41 - gplots : ``install.packages("gplots")`` | |
42 - UpSetR : ``install.packages("UpSetR")`` | |
43 | |
44 ## Install | |
45 | |
46 1- Git clone | |
47 | |
48 ``` | |
49 git clone https://github.com/SouthGreenPlatform/PanExplorer_workflow.git | |
50 ``` | |
51 | |
52 2- Define the PANEX_PATH environnement variable | |
53 | |
54 ``` | |
55 cd PanExplorer_workflow | |
56 export PANEX_PATH=$PWD | |
57 ``` | |
58 | |
59 3- Get preformatted RPS-BLAST+ database of the CDD COG distribution | |
60 | |
61 ``` | |
62 wget https://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz | |
63 tar -xzvf Cog_LE.tar.gz -C $PANEX_PATH/COG | |
64 ``` | |
65 | |
66 4- Get the singularity container | |
67 | |
68 ``` | |
69 wget -P $PANEX_PATH/singularity https://panexplorer.southgreen.fr/singularity/panexplorer.sif | |
70 ``` | |
71 | |
72 | |
73 ## Prepare your list of genomes to be analyzed | |
74 | |
75 Edit the configuration file config.yaml to list the Genbank identifiers of complete assembled and annotated genomes. | |
76 ``` | |
77 ######################################################### | |
78 # Complete one of the following input data | |
79 # Remove the other lines if not needed | |
80 ######################################################### | |
81 | |
82 # Genbank accessions of assembly accession (GCA, GCF) | |
83 ids: | |
84 - GCA_001042775.1 | |
85 - GCA_001021915.1 | |
86 - GCA_022406815.1 | |
87 | |
88 # Path of genbank files | |
89 input_genbanks: | |
90 - data/GCA_001518895.1.gb | |
91 - data/GCA_001746615.1.gb | |
92 - data/GCA_003382895.1.gb | |
93 | |
94 # Input genomes as fasta and annotation files in GFF format | |
95 # Only applied when using Orthofinder or PGGB workflows, starting from fasta and GFF | |
96 # To be used preferentially for eukaryotes | |
97 input_genomes: | |
98 "MSU7": | |
99 "fasta": "/share/banks/Oryza/sativa/japonica/MSU7/all.con" | |
100 "gff3": "/share/banks/Oryza/sativa/japonica/MSU7/all.gff3" | |
101 "name": "MSU7" | |
102 "kitaake": | |
103 "fasta": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.assembly.fna" | |
104 "gff3": "/share/banks/Oryza/sativa/japonica/kitaake/Oryza_sativa_japonica_Kitaake.gff3" | |
105 "name": "kitaake" | |
106 "nivara": | |
107 "fasta": "/share/banks/Oryza/nivara/Oryza_nivara.assembly.fna" | |
108 "gff3": "/share/banks/Oryza/nivara/Oryza_nivara.gff3" | |
109 "name": "nivara" | |
110 ``` | |
111 | |
112 It's best not to mix NCBI genomes with your own annotated genomes, to avoid biaises due to annotation method/software. Keep an homogeneous annotation procedure to feed the workflow. | |
113 | |
114 ## Run the workflow | |
115 | |
116 **For prokaryotes** | |
117 | |
118 Creating a pangenome using Roary | |
119 | |
120 ``` | |
121 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_roary_heatmap_upset_COG | |
122 ``` | |
123 | |
124 Creating a pangenome using PanACoTA | |
125 | |
126 ``` | |
127 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_panacota_heatmap_upset_COG | |
128 ``` | |
129 | |
130 Creating a pangenome graph using Minigraph/Cactus and derived pangenes matrix | |
131 | |
132 ``` | |
133 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_cactus_heatmap_upset_COG | |
134 ``` | |
135 | |
136 Creating a pangenome graph using PanGenomeGraph Builder (PGGB) and derived pangenes matrix | |
137 | |
138 ``` | |
139 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_wget_pggb_heatmap_upset_COG | |
140 ``` | |
141 | |
142 **For eukaryotes** | |
143 | |
144 Creating a pangenome using Orthofinder | |
145 | |
146 ``` | |
147 singularity exec $PANEX_PATH/singularity/panexplorer.sif snakemake --cores 1 -s $PANEX_PATH/Snakemake_files/Snakefile_orthofinder_heatmap_upset | |
148 ``` | |
149 | |
150 ## Graphical outputs | |
151 | |
152 In all cases, you should a new directory named "outputs" containing all output files. | |
153 | |
154 In case of a pangenome graph analysis with PGGB, you will obtain vizualizations of the graph (using ODGI) | |
155 | |
156 * **2D graph visualization** : outputs/pggb_out/all_genomes.fa.lay.draw.png | |
157 | |
158 <img src="images/all_genomes.fa.lay.draw.png" align="center" width="40%" style="display: block; margin: auto;"/> | |
159 | |
160 * **1D graph visualization** : outputs/pggb_out/all_genomes.fa.og.viz_multiqc.png | |
161 | |
162 <img src="images/all_genomes.fa.og.viz_multiqc.png" align="center" width="90%" style="display: block; margin: auto;"/> | |
163 | |
164 | |
165 In all cases, it also includes: | |
166 | |
167 * **ANI (Average Nucleotide Identity)** : outputs/fastani.out.svg | |
168 | |
169 The heatmap chart generated from distances calculated based on the ANI values. | |
170 ANI values are calcultaed using FastANI software. | |
171 | |
172 <img src="images/fastani.out.svg" align="center" width="90%" style="display: block; margin: auto;"/> | |
173 | |
174 * **Presence/absence matrix of accessory genes**: outputs/heatmap.svg.complete.new.svg | |
175 | |
176 Both gene clusters and samples have been ordered using a Hierarchical Clustering. | |
177 | |
178 <img src="images/heatmap.svg.complete.new.svg" align="center" width="90%" style="display: block; margin: auto;"/> | |
179 | |
180 * **Upset plot**: outputs/upsetr.svg | |
181 | |
182 An Upset plot is an alternative to the Venn Diagram used to deal with more than 3 sets. | |
183 The total size of each set is represented on the left barplot. | |
184 Every possible intersection is represented by the bottom plot, and their occurence is shown on the top barplot. | |
185 Each row corresponds to a possible intersection: the filled-in cells show which set is part of an intersection. | |
186 | |
187 <img src="images/upsetr.svg" align="center" width="90%" style="display: block; margin: auto;"/> | |
188 | |
189 * **Rarefaction curve**: outputs/rarefaction_curves.svg | |
190 | |
191 The rarefaction curve (computed by micropan R package) is the cumulative number of gene clusters we can observe as more and more genomes are being considered. | |
192 | |
193 <img src="images/rarefaction_curves.svg" align="center" width="70%" style="display: block; margin: auto;"/> | |
194 | |
195 | |
196 ## License | |
197 | |
198 GNU General Public GPLv3 License |