comparison README.md @ 0:a5f1638b73be draft

Uploaded
author petr-novak
date Wed, 26 Jun 2019 08:01:42 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:a5f1638b73be
1 # REPEATS ANNOTATION TOOLS FOR ASSEMBLIES #
2
3
4 ## 1. PROFREP ##
5 *- **PROF**iles of **REP**eats -*
6
7 The ProfRep main tool engages outputs of RepeatExplorer for repeats annotation in DNA sequences (typically assemblies but not necessarily). Moreover, it provides repetitive profiles of the sequence, pointing out quantitative representation of individual repeats along the sequence as well as the overall repetitiveness.
8
9 ### DEPENDENCIES ###
10
11 * python 3.4 or higher with packages:
12 * numpy
13 * matplotlib
14 * biopython
15 * [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279690/) or higher
16 * [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)
17 * [cd-hit](http://weizhongli-lab.org/cd-hit/)
18 * [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**
19
20 * ProfRep Modules:
21 * gff.py
22 * visualization.py
23 * configuration.py
24 * protein_domains.py
25 * domains_filtering.py
26
27 * Profrep databases
28
29 There are precompiled profrep annotation dataset for limited number of species. List of species can be find in file [prepared_datasets.txt](tool_data/prepared_datasets). Databases include large files and must be downloaded from our website:
30
31 cd tool_data
32 wget http://repeatexplorer.org/repeatexplorer/wp-content/uploads/profrep.tar.gz
33 tar xzvf profrep.tar.gz
34
35
36 #### INPUTS ####
37
38 * **DNA sequence(s) to annotate** [multiFASTA]
39
40 * **Species specific dataset** available from RepeatExplorer archive consisting of:
41
42 * NGS reads sequences [multiFASTA]
43 * In RE archive: *seqclust -> sequences -> sequences.fasta*
44 * CLS file of clusters and belonging reads [multiFASTA]
45 * in RE archive: *seqclust -> clustering -> hitsort.cls*
46 * Classification table [TSV, CSV]
47 * in RE archive: *PROFREP_CLASSIFICATION_TEMPLATE.csv* (automatic classification)
48
49
50 #### OUTPUTS ####
51
52 * **HTML summary report,JBrowse Data Directory** showing basic information and repetitive profile graphs as well as protein domains (optional) for individual sequences (up to 50). This output also serves as an data directory for [JBrowse](https://jbrowse.org/) genome browser. You can create a standalone JBrowse instance for further detailed visualization of the output tracks using Galaxy-integrated tool. This output can also be downloaded as an archive containing all relevant data for visualization via locally installed JBrowse server (see more about visualization in OUTPUT VISUALIZATION below)
53 * **Ns GFF** - reports unspecified (N) bases regions in the sequence
54 * **Repeats GFF** - reports repetitive regions of a certain length (defaultly **80**) and above hits/copy numbers threshold (defaultly **3**)
55 * **Domains GFF** - reports protein domains, classification of domain, chain orientation and alignment sequences
56 * Log file
57
58
59 ### Running ProfRep ###
60
61 usage: profrep.py [-h] -q QUERY -rdb READS -a ANN_TBL -c CLS [-id DB_ID]
62 [-bs BIT_SCORE] [-m MAX_ALIGNMENTS] [-e E_VALUE]
63 [-df DUST_FILTER] [-ws WORD_SIZE] [-t TASK] [-n NEW_DB]
64 [-w WINDOW] [-o OVERLAP] [-pd PROTEIN_DOMAINS]
65 [-pdb PROTEIN_DATABASE] [-cs CLASSIFICATION] [-wd WIN_DOM]
66 [-od OVERLAP_DOM] [-thsc THRESHOLD_SCORE]
67 [-thl {float range 0.0..1.0}] [-thi {float range 0.0..1.0}]
68 [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]
69 [-mlen MAX_LEN_PROPORTION] [-lg LOG_FILE] [-ouf OUTPUT_GFF]
70 [-oug DOMAIN_GFF] [-oun N_GFF] [-hf HTML_FILE]
71 [-hp HTML_PATH] [-cn COPY_NUMBERS] [-gs GENOME_SIZE]
72 [-thr THRESHOLD_REPEAT] [-thsg THRESHOLD_SEGMENT]
73 [-jb JBROWSE_BIN]
74
75
76 optional arguments:
77 -h, --help show this help message and exit
78
79 required arguments:
80 -q QUERY, --query QUERY
81 input DNA sequence in (multi)fasta format (default:
82 None)
83 -rdb READS, --reads READS
84 blast database of all sequencing reads (default: None)
85 -a ANN_TBL, --ann_tbl ANN_TBL
86 clusters annotation table, tab-separated number of
87 cluster and its classification (default: None)
88 -c CLS, --cls CLS cls file containing reads assigned to clusters
89 (hitsort.cls) (default: None)
90
91 alternative required arguments - prepared datasets:
92 -id DB_ID, --db_id DB_ID
93 annotation dataset ID (first column of datasets table)
94 (default: None)
95
96 optional arguments - BLAST Search:
97 -bs BIT_SCORE, --bit_score BIT_SCORE
98 bitscore threshold (default: 50)
99 -m MAX_ALIGNMENTS, --max_alignments MAX_ALIGNMENTS
100 blast filtering option: maximal number of alignments
101 in the output (default: 10000000)
102 -e E_VALUE, --e_value E_VALUE
103 blast setting option: e-value (default: 0.1)
104 -df DUST_FILTER, --dust_filter DUST_FILTER
105 dust filters low-complexity regions during BLAST
106 search (default: '20 64 1')
107 -ws WORD_SIZE, --word_size WORD_SIZE
108 blast search option: initial word size for alignment
109 (default: 11)
110 -t TASK, --task TASK type of blast to be triggered (default: blastn)
111 -n NEW_DB, --new_db NEW_DB
112 create a new blast database, USE THIS OPTION IF YOU
113 RUN PROFREP WITH NEW DATABASE FOR THE FIRST TIME
114 (default: True)
115
116 optional arguments - Parallel Processing:
117 -w WINDOW, --window WINDOW
118 sliding window size for parallel processing (default:
119 5000)
120 -o OVERLAP, --overlap OVERLAP
121 overlap for parallely processed regions, set greater
122 than a read size (default: 150)
123
124 optional arguments - Protein Domains:
125 -pd PROTEIN_DOMAINS, --protein_domains PROTEIN_DOMAINS
126 use module for protein domains (default: False)
127 -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
128 protein domains database (default: None)
129 -cs CLASSIFICATION, --classification CLASSIFICATION
130 protein domains classification file (default: None)
131 -wd WIN_DOM, --win_dom WIN_DOM
132 protein domains module: sliding window to process
133 large input sequences sequentially (default: 10000000)
134 -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM
135 protein domains module: overlap of sequences in two
136 consecutive windows (default: 10000)
137 -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE
138 protein domains module: percentage of the best score
139 within the cluster to significant domains (default:
140 80)
141 -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}
142 proportion of alignment length threshold (default:
143 0.8)
144 -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}
145 proportion of alignment identity threshold (default:
146 0.35)
147 -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}
148 threshold for alignment proportional similarity
149 (default: 0.45)
150 -ir INTERRUPTIONS, --interruptions INTERRUPTIONS
151 interruptions (frameshifts + stop codons) tolerance
152 threshold per 100 AA (default: 3)
153 -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION
154 maximal proportion of alignment length to the original
155 length of protein domain from database (default: 1.2)
156
157 optional arguments - Output Paths:
158 -lg LOG_FILE, --log_file LOG_FILE
159 path to log file (default: log.txt)
160 -ouf OUTPUT_GFF, --output_gff OUTPUT_GFF
161 path to output gff of repetitive regions (default:
162 output_repeats.gff)
163 -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF
164 path to output gff of protein domains (default:
165 output_domains.gff)
166 -oun N_GFF, --n_gff N_GFF
167 path to output gff of N regions (default:
168 N_regions.gff)
169 -hf HTML_FILE, --html_file HTML_FILE
170 path to output html file (default: output.html)
171 -hp HTML_PATH, --html_path HTML_PATH
172 path to html extra files (default: profrep_output_dir)
173
174 optional arguments - Copy Numbers/Hits :
175 -cn COPY_NUMBERS, --copy_numbers COPY_NUMBERS
176 convert hits to copy numbers (default: False)
177 -gs GENOME_SIZE, --genome_size GENOME_SIZE
178 genome size is required when converting hits to copy
179 numbers and you use custom data (default: None)
180 -thr THRESHOLD_REPEAT, --threshold_repeat THRESHOLD_REPEAT
181 threshold for hits/copy numbers per position to be
182 considered repetitive (default: 3)
183 -thsg THRESHOLD_SEGMENT, --threshold_segment THRESHOLD_SEGMENT
184 threshold for the length of repetitive segment to be
185 reported (default: 80)
186
187 optional arguments - Enviroment Variables:
188 -jb JBROWSE_BIN, --jbrowse_bin JBROWSE_BIN
189 path to JBrowse bin directory (default: None)
190
191
192 #### HOW TO RUN EXAMPLE ####
193
194 ./protein.py --query PATH_TO_DNA_SEQ --reads PATH_TO_READS --ann_tbl PATH_TO_CLUSTERS_CLASSIFICATION --cls PATH_TO_hitsort.cls
195
196 When running for the first time with a new reads database use:
197
198 --new_db True
199
200
201 ### ProfRep Data Preparation ###
202
203 In case of using custom input datasets these tools can be used for easy obtaining the correct files and to prepare the reduced datasets to speed up the main ProfRep analysis:
204
205 * Extract Data For ProfRep (extract_data_for_profrep.py)
206 * ProfRep DB Reducing (profrep_db_reducing.py)
207
208 ### ProfRep Supplementary Tools ###
209
210 These additional tools can be used for further work with the ProfRep outputs:
211
212 * ProfRep Refiner (profrep_refining.py)
213 * ProfRep Masker (profrep_masking.py)
214 * GFF Region Selector (gff_selection.py)
215
216 ### FOR MORE INFO ABOUT PREPARATION AND SUPPLEMENTARY TOOLS PLEASE READ PROFREP WIKI ###
217
218 ## 2. DANTE ##
219 *- **D**omain based **AN**notation of **T**ransposable **E**lements -*
220
221
222 * Protein Domains Finder [protein_domains.py]
223 * Script performs scanning of given DNA sequence(s) in (multi)fasta format in order to discover protein domains using our protein domains database.
224 * Domains searching is accomplished engaging LASTAL alignment tool.
225 * Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them.
226
227 * Proteins Domains Filter [domains_filtering.py]
228 * filters GFF3 output from previous step to obtain certain kind of domain and/or allows to adjust quality filtering
229
230
231 ### DEPENDENCIES ###
232
233 * python3.4 or higher with packages:
234 * numpy
235 * biopython
236 * [lastal](http://last.cbrc.jp/doc/last.html) 744 or higher
237 * ProfRep/DANTE modules:
238 * configuration.py
239
240
241 ### Protein Domains Finder ###
242
243 This tool provides **preliminary** output of all domains types which are not filtered for quality.
244
245 #### INPUTS ####
246
247 * DNA sequence [multiFasta]
248
249 #### OUTPUTS ####
250
251 * **All protein domains GFF3** - individual domains are reported per line as regions (start-end) on the original DNA sequence including the seq ID and strand orientation. The last "Attributes" column contains several comma-separated information related to the domain annotation, alignment and its quality. This file can undergo further filtering using Protein Domain Filter tool.
252
253 #### USAGE ####
254 usage: protein_domains.py [-h] -q QUERY -pdb PROTEIN_DATABASE -cs
255 CLASSIFICATION [-oug DOMAIN_GFF] [-nld NEW_LDB]
256 [-dir OUTPUT_DIR] [-thsc THRESHOLD_SCORE]
257 [-wd WIN_DOM] [-od OVERLAP_DOM]
258
259 optional arguments:
260 -h, --help show this help message and exit
261 -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF
262 output domains gff format (default: None)
263 -nld NEW_LDB, --new_ldb NEW_LDB
264 create indexed database files for lastal in case of
265 working with new protein db (default: False)
266 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
267 specify if you want to change the output directory
268 (default: None)
269 -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE
270 percentage of the best score in the cluster to be
271 tolerated when assigning annotations per base
272 (default: 80)
273 -wd WIN_DOM, --win_dom WIN_DOM
274 window to process large input sequences sequentially
275 (default: 10000000)
276 -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM
277 overlap of sequences in two consecutive windows
278 (default: 10000)
279
280 required named arguments:
281 -q QUERY, --query QUERY
282 input DNA sequence to search for protein domains in a
283 fasta format. Multifasta format allowed. (default:
284 None)
285 -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
286 protein domains database file (default: None)
287 -cs CLASSIFICATION, --classification CLASSIFICATION
288 protein domains classification file (default: None)
289
290
291
292 #### HOW TO RUN EXAMPLE ####
293 ./protein_domains.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE
294
295 When running for the first time with a new database use -nld option allowing lastal to create indexed database files:
296
297 -nld True
298
299 use other arguments if you wish to rename your outputs or they will be created automatically with standard names
300
301 ### Protein Domains Filter ###
302
303 The script performs Protein Domains Finder output filtering for quality and/or extracting specific type of protein domain or mobile elements of origin. For the filtered domains it reports their translated protein sequence of original DNA.
304
305 WHEN NO PARAMETERS GIVEN, IT PERFORMS QUALITY FILTERING USING THE DEFAULT PARAMETRES (optimized for Viridiplantae species)
306
307 #### INPUTS ####
308 * GFF3 file produced by protein_domains.py OR already filtered GFF3
309
310 #### Filtering options ####
311 * QUALITY:
312 - Min relative length of alignemnt to the protein domain from DB (without gaps)
313 - Identity
314 - Similarity (scoring matrix: BLOSUM80)
315 - Interruption in the reading frame (frameshifts + stop codons) per every starting 100 AA
316 - Max alignment proportion to the original length of database domain sequence
317 * DOMAIN TYPE: 'Name' attribute in GFF - see choices bellow
318 Records for ambiguous domain type (e.g. INT/RH) are filtered out automatically
319
320 * MOBILE ELEMENT TYPE:
321 arbitrary substring of the element classification ('Final_Classification' attribute in GFF)
322
323 #### OUTPUTS ####
324 * filtered GFF3 file
325 * fasta file of translated protein sequences for the aligned domains that match the filtering criteria
326 ! as it is taken from the best hit alignment reported by LAST, it does not neccessary cover the whole region reported as domain in GFF
327
328 #### USAGE ####
329 usage: domains_filtering.py [-h] -dg DOM_GFF [-ouf DOMAINS_FILTERED]
330 [-dps DOMAINS_PROT_SEQ]
331 [-thl {float range 0.0..1.0}]
332 [-thi {float range 0.0..1.0}]
333 [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]
334 [-mlen MAX_LEN_PROPORTION]
335 [-sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}]
336 [-el ELEMENT_TYPE] [-dir OUTPUT_DIR]
337
338
339
340 optional arguments:
341 -h, --help show this help message and exit
342 -ouf DOMAINS_FILTERED, --domains_filtered DOMAINS_FILTERED
343 output filtered domains gff file (default: None)
344 -dps DOMAINS_PROT_SEQ, --domains_prot_seq DOMAINS_PROT_SEQ
345 output file containg domains protein sequences
346 (default: None)
347 -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}
348 proportion of alignment length threshold (default:
349 0.8)
350 -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}
351 proportion of alignment identity threshold (default:
352 0.35)
353 -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}
354 threshold for alignment proportional similarity
355 (default: 0.45)
356 -ir INTERRUPTIONS, --interruptions INTERRUPTIONS
357 interruptions (frameshifts + stop codons) tolerance
358 threshold per 100 AA (default: 3)
359 -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION
360 maximal proportion of alignment length to the original
361 length of protein domain from database (default: 1.2)
362 -sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}, --selected_dom {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}
363 filter output domains based on the domain type
364 (default: All)
365 -el ELEMENT_TYPE, --element_type ELEMENT_TYPE
366 filter output domains by typing substring from
367 classification (default: )
368 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
369 specify if you want to change the output directory
370 (default: None)
371
372 required named arguments:
373 -dg DOM_GFF, --dom_gff DOM_GFF
374 basic unfiltered gff file of all domains (default:
375 None)
376
377
378
379 #### HOW TO RUN EXAMPLE ####
380 e.g. getting quality filtered integrase(INT) domains of all gypsy transposable elements:
381
382 ./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy
383
384
385 ### Extract Domains Nucleotide Sequences ###
386
387 This tool extracts nucleotide sequences of protein domains from reference DNA based on DANTE's output. It can be used e.g. for deriving phylogenetic relations of individual mobile elements classes within a species.
388
389 #### INPUTS ####
390
391 * original DNA sequence in multifasta format to extract the domains from
392 * GFF3 file of protein domains (**DANTE's output** - preferably filtered for quality and specific domain type)
393 * Domains database classification table (to check the classification level)
394
395 #### OUTPUTS ####
396
397 * fasta files of domains nucleotide sequences for individual transposons lineages
398 * txt file of domains counts extracted for individual lineages
399
400 **- For GALAXY usage all concatenated in a single fasta file**
401
402 #### USAGE ####
403 usage: extract_domains_seqs.py [-h] -i INPUT_DNA -d DOMAINS_GFF -cs
404 CLASSIFICATION [-out OUT_DIR] [-ex EXTENDED]
405
406 optional arguments:
407 -h, --help show this help message and exit
408 -i INPUT_DNA, --input_dna INPUT_DNA
409 path to input DNA sequence
410 -d DOMAINS_GFF, --domains_gff DOMAINS_GFF
411 GFF file of protein domains
412 -cs CLASSIFICATION, --classification CLASSIFICATION
413 protein domains classification file
414 -out OUT_DIR, --out_dir OUT_DIR
415 output directory
416 -ex EXTENDED, --extended EXTENDED
417 extend the domains edges if not the whole datatabase
418 sequence was aligned
419
420 #### HOW TO RUN EXAMPLE ####
421 ./extract_domains_seqs.py --domains_gff PATH_PROTEIN_DOMAINS_GFF --input_dna PATH_TO_INPUT_DNA --classification PROTEIN_DOMAINS_DB_CLASS_TBL --extended True
422
423 ### GALAXY implementation ###
424
425 #### Dependencies ####
426
427 * python3.4 or higher with packages:
428 * numpy
429 * matplotlib
430 * biopython
431 * [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279671/) or higher
432 * [LAST](http://last.cbrc.jp/doc/last.html) 744 or higher:
433 * [download](http://last.cbrc.jp/)
434 * [install](http://last.cbrc.jp/doc/last.html)
435 * [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)
436 * [cd-hit](http://weizhongli-lab.org/cd-hit/)
437 * [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**
438
439 #### Source ######
440
441 https://nina_h@bitbucket.org/nina_h/profrep.git
442
443 branch "cerit" --> only Pisum Sativum Terno in preparad annotation datasets
444
445 branch "develop"/"master" --> extended internal database of species (not published, or for internal purposes)
446
447 #### Configuration #####
448
449 Add tools
450
451 <section name="Assembly annotation" id="annotation">
452 <label id="profrep_prepare" text="ProfRep Data Preparation" />
453 <tool file="profrep/extract_data_for_profrep.xml" />
454 <tool file="profrep/db_reducing.xml" />
455 <label id="profrep_main" text="Profrep" />
456 <tool file="profrep/profrep.xml" />
457 <label id="profrep_supplementary" text="Profrep Supplementary" />
458 <tool file="profrep/profrep_refine.xml" />
459 <tool file="profrep/profrep_masking.xml" />
460 <tool file="profrep/gff_select_region.xml" />
461 <label id="domains" text="DANTE" />
462 <tool file="profrep/protein_domains.xml" />
463 <tool file="profrep/domains_filtering.xml" />
464 <tool file="profrep/extract_domains_seqs.xml" />
465 </section>
466
467
468 to
469
470 $__root_dir__/config/tool_conf.xml
471
472 ------------------------------------------------------------------------
473
474 Place PROFREP_DB files to
475
476 $__tool_data_path__/profrep
477
478 *REMARK* PROFREP_DB files contain prepared annotation data for species in the roll-up menu:
479
480 * sequences.fasta - including BLAST database files which was created by:
481 makeblastdb -in >sequences.fasta -dbtype nucl
482 * hitosort.cls file
483 * classification table table
484
485 Place DANTE_DB files to
486
487 $__tool_data_path__/protein_domains
488
489 *REMARK* DANTE_DB files contain protein domains database files:
490 * protein domains database including LASTAL database files which was created by:
491 lastdb -p -cR01 >database_name< >database_name<
492 (lastal database files are actually enough, original datatabse table does not have to be present)
493 * classification table
494
495 ------------------------------------------------------------------------
496
497 Create
498
499 $__root_dir__/database/dependencies/profrep/1.0.0/env.sh
500
501 containing:
502
503 export JBROWSE_BIN=PATH_TO_JBROWSE_DIR/bin
504
505 ------------------------------------------------------------------------
506
507 Link the following files into galaxy tool-data dir
508
509 ln -s $__tool_directory__/profrep/domains_data/select_domain.txt $__tool_data_path__
510 ln -s $__tool_directory__/profrep/profrep_data/prepared_datasets.txt $__tool_data_path__
511
512
513
514
515
516
517
518
519