Mercurial > repos > bioit_sciensano > phagetermvirome
comparison README.txt @ 0:69e8f12c8b31 draft
"planemo upload"
author | bioit_sciensano |
---|---|
date | Fri, 11 Mar 2022 15:06:20 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:69e8f12c8b31 |
---|---|
1 PROGRAM | |
2 ======= | |
3 | |
4 PhageTerm.py - run as command line in a shell | |
5 | |
6 | |
7 VERSION | |
8 ======= | |
9 | |
10 Version 4.0.0 | |
11 Compatible with python 3.7 | |
12 | |
13 | |
14 INTRODUCTION | |
15 ============ | |
16 | |
17 PhageTermVirome software is a tool to determine phage genome termini and genome packaging mode on single phage or multiple contigs at once. | |
18 The software uses phage and virome sequencing reads obtained from libraries prepared with DNA fragmented randomly (e.g. Covaris fragmentation, | |
19 and library preparation using Illumina TruSeq). Phage or virome sequencing reads (fastq files) are aligned to the assembled phage genome or assembled | |
20 virome (fasta or multifasta files) in order to calculate two types of coverage values (whole genome coverage and the Starting Position Coverage (SPC)). The starting position coverage is used to perform a detailed termini and packaging mode analysis. | |
21 | |
22 Mu-type phage analysis : can be done if user suspect the phage genome to be Mu-like type (Only for single phage genome analysis, not possible with multifasta file) : | |
23 User can also provide the host (bacterial) genome sequence. The Mu-type phage analysis will take the reads that does not match the phage | |
24 genome and align them on the bacterial genome using the same mapping function. The analysis to identify Mu-like phages is available only when providing a single phage genome (not possible if user provide a multi-fast file with multiple assembled phage contigs). | |
25 | |
26 | |
27 The previous PhageTerm program (single phage analysis only) is still available at https://sourceforge.net/projects/phageterm/ (for versions <3.0.0) | |
28 | |
29 | |
30 A Galaxy wrapper version is also available for the previous version at https://galaxy.pasteur.fr (only for the first version PhageTerm). | |
31 PhageTermVirome is not implemented on Galaxy yet). | |
32 | |
33 Since version 3.0.0, PhageTerm can work in 2 modes: | |
34 - the usual mono machine mode (parallelization on several cores on the same machine). | |
35 - a new multi machine mode (advanced users) with parallelization on several machines, using intermediate files for data exchange. | |
36 | |
37 The default mode is mono machine. | |
38 Version 3.0.0 up to version 4.0 work with python 2.7 | |
39 | |
40 Since version 4.0, PhageTerm (now PhageTermVirome) works with python 3.7 | |
41 | |
42 | |
43 PREREQUISITES | |
44 ============= | |
45 | |
46 | |
47 For version 4.0 | |
48 | |
49 Unix/Linux | |
50 | |
51 - backports | |
52 - backports.functools_lru_cache | |
53 - backports_abc | |
54 - cycler | |
55 - libwebp-base | |
56 - lz4-c | |
57 - matplotlib-base | |
58 - matplotlib | |
59 - numpy | |
60 - openssl | |
61 - pandas | |
62 - patsy | |
63 - pillow | |
64 - pip | |
65 - pyparsing | |
66 - python=3.7 | |
67 - python-dateutil | |
68 - python_abi | |
69 - pytz | |
70 - readline | |
71 - reportlab | |
72 - scikit-learn | |
73 - scipy | |
74 - setuptools | |
75 - singledispatch | |
76 - statsmodels | |
77 - tk | |
78 - tornado | |
79 | |
80 A conda virtualenv containing python3.7 and all dependencies is provided for convenience so that users | |
81 don't need to install anything else than miniconda or conda. (See below) | |
82 | |
83 | |
84 FOR INPATIENT USERS : INSTALLING PHAGETERMVIROME USING THE CONDA VIRTUALENV (easiest option) | |
85 ============================================================================================ | |
86 | |
87 First install miniconda if you don't have it already (you don't even need to have python 2.7 or python 3.7 installed on your machine for that since | |
88 miniconda contains it): https://docs.conda.io/en/latest/miniconda.html | |
89 | |
90 Download and decompress/extract the PhageTermVirome directory available at https://gitlab.pasteur.fr/vlegrand/ptv. | |
91 | |
92 Then go in the PTV directory, and create the conda environment using the yml file PhageTerm_env_3.yml file for version >=4.0 (python3) | |
93 | |
94 $ conda env create -f PhageTerm_env_3.yml | |
95 | |
96 Then activate the environment so you can launch PhageTermVirome: | |
97 | |
98 $ conda activate PhageTerm_env_py3 | |
99 | |
100 | |
101 NOTE: | |
102 | |
103 You can still use the old PhageTerm under python 2.7 (but no multi-fast analysis possible) using the miniconda environment from the PhageTerm_env.yml file for version<4.0 (python2). Using the following commands. | |
104 | |
105 $ conda env create -f PhageTerm_env.yml | |
106 | |
107 $ conda activate PhageTerm_env | |
108 | |
109 | |
110 | |
111 COMMAND LINE USAGE | |
112 ================== | |
113 | |
114 Basic usage with mandatory options (PhageTermVirome needs at least one read file, but user can provide a second corresponding paired-end read file if available, using the -p option). | |
115 | |
116 ./PhageTerm.py -f reads.fastq -r phage_sequence(s).fasta | |
117 | |
118 | |
119 Help: | |
120 | |
121 ./PhageTerm.py -h | |
122 ./PhageTerm.py --help | |
123 | |
124 | |
125 After installation, we recommend users to perform a software run test, use any of the following: | |
126 -t TEST_VALUE, --test=TEST_VALUE | |
127 TEST_VALUE=C5 : Test run for a 5' cohesive end (e.g. Lambda) | |
128 TEST_VALUE=C3 : Test run for a 3' cohesive end (e.g. HK97) | |
129 TEST_VALUE=DS : Test run for a short Direct Terminal Repeats end (e.g. T7) | |
130 TEST_VALUE=DL : Test run for a long Direct Terminal Repeats end (e.g. T5) | |
131 TEST_VALUE=H : Test run for a Headful packaging (e.g. P1) | |
132 TEST_VALUE=M : Test run for a Mu-like packaging (e.g. Mu) | |
133 | |
134 | |
135 Non-mandatory options | |
136 | |
137 [-p reads_paired -c nbr_core_threads --report_title name_to_write_on_report_outputs -s seed_lenght -d surrounding -g host.fasta -l contig_size_limit_multi-fasta -v virome_run_time_estimation] | |
138 | |
139 | |
140 Additional advanced options (only for multi-machine users) | |
141 | |
142 | |
143 [--mm --dir_cov_mm path_to_coverage_results -c nb_cores --core_id idx_core -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] | |
144 [--mm --dir_cov_mm path_to_coverage_results --dir_seq_mm path_to_sequence_results --DR_path path_to_results --seq_id index_of_sequence --nb_pieces nbr_of_read_chunks -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] [--mm --DR_path path_to_results --dir_seq_mm path_to_sequence_results -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] | |
145 | |
146 | |
147 | |
148 | |
149 Detailed ptions: | |
150 | |
151 | |
152 Raw reads file in fastq format: | |
153 -f INPUT_FILE, --fastq=INPUT_FILE | |
154 Fastq reads | |
155 (NGS sequences from random fragmentation DNA only, | |
156 e.g. Illumina TruSeq) | |
157 | |
158 Phage genome(s) in fasta format: | |
159 -r INPUT_FILE, --ref=INPUT_FILE | |
160 Reference phage genome(s) as unique contig in fasta format | |
161 | |
162 | |
163 | |
164 Other options common to both modes: | |
165 | |
166 Raw reads file in fastq format: | |
167 -p INPUT_FILE, --paired=INPUT_FILE | |
168 Paired fastq reads | |
169 (NGS sequences from random fragmentation DNA only, | |
170 e.g. Illumina TruSeq) | |
171 | |
172 Analysis_name to write on output reports: | |
173 --report_title USER_REPORT_NAME, --report_title=REPORT_NAME | |
174 Manually enter the name you want to have on your report outputs. | |
175 Used as prefix for output files. | |
176 | |
177 Lenght of the seed used for reads in the mapping process: | |
178 -s SEED_LENGHT, --seed=SEED_LENGHT | |
179 Manually enter the lenght of the seed used for reads | |
180 in the mapping process (Default: 20). | |
181 | |
182 Number of nucleotides around the main peak to consider for merging adjacent significant peaks (set to 1 to discover secondary terminus but sites). | |
183 -d SUROUNDING_LENGHT, --surrounding=SUROUNDING_LENGHT | |
184 Manually enter the lenght of the surrounding used to | |
185 merge close peaks in the analysis process (Default: 20). | |
186 | |
187 Host genome in fasta format (option available only for analysis with a single phage genome): | |
188 -g INPUT_FILE, --host=INPUT_FILE | |
189 Genome of reference host (bacterial genome) in fasta format | |
190 Warning: increase drastically process time | |
191 This option can be used only when analyzing a single phage genome (not available for virome contigs as multifasta) | |
192 | |
193 Define phage mean coverage: | |
194 -m MEAN_NBR, --mean=MEAN_NBR | |
195 Phage mean coverage to use (Default: 250). | |
196 | |
197 Define phage mean coverage: | |
198 -l LIMIT_FASTA, —limit=LIMIT_FASTA | |
199 Minimum phage fasta length (Default: 500). | |
200 | |
201 | |
202 Options for mono machine (default) mode only | |
203 | |
204 Software run test: | |
205 -t TEST_VALUE, --test=TEST_VALUE | |
206 TEST_VALUE=C5 : Test run for a 5' cohesive end (e.g. Lambda) | |
207 TEST_VALUE=C3 : Test run for a 3' cohesive end (e.g. HK97) | |
208 TEST_VALUE=DS : Test run for a short Direct Terminal Repeats end (e.g. T7) | |
209 TEST_VALUE=DL : Test run for a long Direct Terminal Repeats end (e.g. T5) | |
210 TEST_VALUE=H : Test run for a Headful packaging (e.g. P1) | |
211 TEST_VALUE=M : Test run for a Mu-like packaging (e.g. Mu) | |
212 | |
213 Core processor number to use: | |
214 -c CORE_NBR, --core=CORE_NBR | |
215 Number of core processor to use (Default: 1). | |
216 | |
217 | |
218 | |
219 Options for multi machine mode only | |
220 | |
221 Indicate that PhageTerm should run on several machines: | |
222 --mm | |
223 | |
224 | |
225 Options for step 1 of multi-machine mode (calculating reads coverage) on several machines | |
226 | |
227 Directory for coverage results: | |
228 --dir_cov_mm=DIR_PATH/DIR_NAME | |
229 Directory where to put coverage results. | |
230 Note: it is up to the user to delete the files in this directory. | |
231 | |
232 Total number of cores to use | |
233 -c CORE_NBR, --core=CORE_NBR | |
234 Total number used accross over all machines. | |
235 | |
236 Index of read chunk to process on current core | |
237 --core_id=IDX | |
238 A number between 0 and CORE_NBR-1 | |
239 | |
240 Directory for checkpoint files: | |
241 --dir_chk=DIR_PATH/DIR_NAME | |
242 Directory where phageTerm will put its ceckpoints. | |
243 Note: the directory must exist before launching phageTerm. | |
244 If the directory already contains a file, phageTerm will start from the results contained in this file. | |
245 | |
246 --chk_freq=FREQUENCY | |
247 The frequency in minutes at which checkpoints must be created. | |
248 Note: default value is 0 which means that no checkpoint is created. | |
249 | |
250 | |
251 | |
252 Options for step 2 of multi-machine mode (calculating per sequence statistics from reads coverage results) on several machines | |
253 | |
254 Directory for coverage results: | |
255 --dir_cov_mm=DIR_PATH/DIR_NAME | |
256 Directory where to put coverage results. | |
257 Note: it is up to the user to delete the files in this directory. | |
258 | |
259 Directory for per sequence results | |
260 --dir_seq_mm=DIR_PATH/DIR_NAME | |
261 Directory where to put the information if no match was found for one/several sequences. | |
262 Note: it is up to the user to delete the files in this directory. | |
263 | |
264 Directory for DR results | |
265 --DR_path=DIR_PATH/DIR_NAME | |
266 Directory where to put the information necessary to step 3 (final report generation). | |
267 This information typically includes names of phage found and per sequence statistics. | |
268 Note: it is up to the user to delete the files in this directory. | |
269 | |
270 Sequence identifier | |
271 --seq_id=IDX | |
272 Index of the sequence to be processed by the current phageTerm process. | |
273 Let N be the number of sequences given at the end of step 1. | |
274 Then IDX is number between 0 and N-1. | |
275 | |
276 Number of pieces | |
277 --nb_pieces=NP | |
278 Number of parts in which the reads were divided. | |
279 Must be the same value as given via -c at step 1 (CORE_NBR). | |
280 | |
281 | |
282 Options for step 3 of multi-machine mode (final report generation) | |
283 | |
284 Directory for DR results | |
285 --DR_path=DIR_PATH/DIR_NAME | |
286 Directory where to read the information necessary to step 3 (final report generation). | |
287 This information typically includes names of phage found and per sequence statistics. | |
288 Note: it is up to the user to delete the files in this directory. | |
289 | |
290 Directory for per sequence results | |
291 --dir_seq_mm=DIR_PATH/DIR_NAME | |
292 Directory where to get the information if no match was found for one/several sequences. | |
293 Note: it is up to the user to delete the files in this directory. | |
294 | |
295 | |
296 | |
297 | |
298 | |
299 | |
300 OUTPUT FILES | |
301 ========== | |
302 | |
303 (i) Report (.pdf) | |
304 | |
305 (ii) Statistical table (.csv) | |
306 | |
307 (iii) File containingg contains re-organized to stat at the predicted termini (.fasta) | |
308 | |
309 | |
310 CONTACT | |
311 ======= | |
312 | |
313 Julian Garneau <julian.garneau@usherbrooke.ca> | |
314 Marc Monot <marc.monot@pasteur.fr> | |
315 David Bikard <david.bikard@pasteur.fr> | |
316 Véronique Legrand <vlegrand@pasteur.fr> |