0
|
1 # REPEATS ANNOTATION TOOLS FOR ASSEMBLIES #
|
|
2
|
|
3
|
|
4 ## 1. PROFREP ##
|
|
5 *- **PROF**iles of **REP**eats -*
|
|
6
|
|
7 The ProfRep main tool engages outputs of RepeatExplorer for repeats annotation in DNA sequences (typically assemblies but not necessarily). Moreover, it provides repetitive profiles of the sequence, pointing out quantitative representation of individual repeats along the sequence as well as the overall repetitiveness.
|
|
8
|
|
9 ### DEPENDENCIES ###
|
|
10
|
|
11 * python 3.4 or higher with packages:
|
|
12 * numpy
|
|
13 * matplotlib
|
|
14 * biopython
|
|
15 * [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279690/) or higher
|
|
16 * [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)
|
|
17 * [cd-hit](http://weizhongli-lab.org/cd-hit/)
|
|
18 * [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**
|
|
19
|
|
20 * ProfRep Modules:
|
|
21 * gff.py
|
|
22 * visualization.py
|
|
23 * configuration.py
|
|
24 * protein_domains.py
|
|
25 * domains_filtering.py
|
|
26
|
|
27 * Profrep databases
|
|
28
|
|
29 There are precompiled profrep annotation dataset for limited number of species. List of species can be find in file [prepared_datasets.txt](tool_data/prepared_datasets). Databases include large files and must be downloaded from our website:
|
|
30
|
|
31 cd tool_data
|
|
32 wget http://repeatexplorer.org/repeatexplorer/wp-content/uploads/profrep.tar.gz
|
|
33 tar xzvf profrep.tar.gz
|
|
34
|
|
35
|
|
36 #### INPUTS ####
|
|
37
|
|
38 * **DNA sequence(s) to annotate** [multiFASTA]
|
|
39
|
|
40 * **Species specific dataset** available from RepeatExplorer archive consisting of:
|
|
41
|
|
42 * NGS reads sequences [multiFASTA]
|
|
43 * In RE archive: *seqclust -> sequences -> sequences.fasta*
|
|
44 * CLS file of clusters and belonging reads [multiFASTA]
|
|
45 * in RE archive: *seqclust -> clustering -> hitsort.cls*
|
|
46 * Classification table [TSV, CSV]
|
|
47 * in RE archive: *PROFREP_CLASSIFICATION_TEMPLATE.csv* (automatic classification)
|
|
48
|
|
49
|
|
50 #### OUTPUTS ####
|
|
51
|
|
52 * **HTML summary report,JBrowse Data Directory** showing basic information and repetitive profile graphs as well as protein domains (optional) for individual sequences (up to 50). This output also serves as an data directory for [JBrowse](https://jbrowse.org/) genome browser. You can create a standalone JBrowse instance for further detailed visualization of the output tracks using Galaxy-integrated tool. This output can also be downloaded as an archive containing all relevant data for visualization via locally installed JBrowse server (see more about visualization in OUTPUT VISUALIZATION below)
|
|
53 * **Ns GFF** - reports unspecified (N) bases regions in the sequence
|
|
54 * **Repeats GFF** - reports repetitive regions of a certain length (defaultly **80**) and above hits/copy numbers threshold (defaultly **3**)
|
|
55 * **Domains GFF** - reports protein domains, classification of domain, chain orientation and alignment sequences
|
|
56 * Log file
|
|
57
|
|
58
|
|
59 ### Running ProfRep ###
|
|
60
|
|
61 usage: profrep.py [-h] -q QUERY -rdb READS -a ANN_TBL -c CLS [-id DB_ID]
|
|
62 [-bs BIT_SCORE] [-m MAX_ALIGNMENTS] [-e E_VALUE]
|
|
63 [-df DUST_FILTER] [-ws WORD_SIZE] [-t TASK] [-n NEW_DB]
|
|
64 [-w WINDOW] [-o OVERLAP] [-pd PROTEIN_DOMAINS]
|
|
65 [-pdb PROTEIN_DATABASE] [-cs CLASSIFICATION] [-wd WIN_DOM]
|
|
66 [-od OVERLAP_DOM] [-thsc THRESHOLD_SCORE]
|
|
67 [-thl {float range 0.0..1.0}] [-thi {float range 0.0..1.0}]
|
|
68 [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]
|
|
69 [-mlen MAX_LEN_PROPORTION] [-lg LOG_FILE] [-ouf OUTPUT_GFF]
|
|
70 [-oug DOMAIN_GFF] [-oun N_GFF] [-hf HTML_FILE]
|
|
71 [-hp HTML_PATH] [-cn COPY_NUMBERS] [-gs GENOME_SIZE]
|
|
72 [-thr THRESHOLD_REPEAT] [-thsg THRESHOLD_SEGMENT]
|
|
73 [-jb JBROWSE_BIN]
|
|
74
|
|
75
|
|
76 optional arguments:
|
|
77 -h, --help show this help message and exit
|
|
78
|
|
79 required arguments:
|
|
80 -q QUERY, --query QUERY
|
|
81 input DNA sequence in (multi)fasta format (default:
|
|
82 None)
|
|
83 -rdb READS, --reads READS
|
|
84 blast database of all sequencing reads (default: None)
|
|
85 -a ANN_TBL, --ann_tbl ANN_TBL
|
|
86 clusters annotation table, tab-separated number of
|
|
87 cluster and its classification (default: None)
|
|
88 -c CLS, --cls CLS cls file containing reads assigned to clusters
|
|
89 (hitsort.cls) (default: None)
|
|
90
|
|
91 alternative required arguments - prepared datasets:
|
|
92 -id DB_ID, --db_id DB_ID
|
|
93 annotation dataset ID (first column of datasets table)
|
|
94 (default: None)
|
|
95
|
|
96 optional arguments - BLAST Search:
|
|
97 -bs BIT_SCORE, --bit_score BIT_SCORE
|
|
98 bitscore threshold (default: 50)
|
|
99 -m MAX_ALIGNMENTS, --max_alignments MAX_ALIGNMENTS
|
|
100 blast filtering option: maximal number of alignments
|
|
101 in the output (default: 10000000)
|
|
102 -e E_VALUE, --e_value E_VALUE
|
|
103 blast setting option: e-value (default: 0.1)
|
|
104 -df DUST_FILTER, --dust_filter DUST_FILTER
|
|
105 dust filters low-complexity regions during BLAST
|
|
106 search (default: '20 64 1')
|
|
107 -ws WORD_SIZE, --word_size WORD_SIZE
|
|
108 blast search option: initial word size for alignment
|
|
109 (default: 11)
|
|
110 -t TASK, --task TASK type of blast to be triggered (default: blastn)
|
|
111 -n NEW_DB, --new_db NEW_DB
|
|
112 create a new blast database, USE THIS OPTION IF YOU
|
|
113 RUN PROFREP WITH NEW DATABASE FOR THE FIRST TIME
|
|
114 (default: True)
|
|
115
|
|
116 optional arguments - Parallel Processing:
|
|
117 -w WINDOW, --window WINDOW
|
|
118 sliding window size for parallel processing (default:
|
|
119 5000)
|
|
120 -o OVERLAP, --overlap OVERLAP
|
|
121 overlap for parallely processed regions, set greater
|
|
122 than a read size (default: 150)
|
|
123
|
|
124 optional arguments - Protein Domains:
|
|
125 -pd PROTEIN_DOMAINS, --protein_domains PROTEIN_DOMAINS
|
|
126 use module for protein domains (default: False)
|
|
127 -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
|
|
128 protein domains database (default: None)
|
|
129 -cs CLASSIFICATION, --classification CLASSIFICATION
|
|
130 protein domains classification file (default: None)
|
|
131 -wd WIN_DOM, --win_dom WIN_DOM
|
|
132 protein domains module: sliding window to process
|
|
133 large input sequences sequentially (default: 10000000)
|
|
134 -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM
|
|
135 protein domains module: overlap of sequences in two
|
|
136 consecutive windows (default: 10000)
|
|
137 -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE
|
|
138 protein domains module: percentage of the best score
|
|
139 within the cluster to significant domains (default:
|
|
140 80)
|
|
141 -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}
|
|
142 proportion of alignment length threshold (default:
|
|
143 0.8)
|
|
144 -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}
|
|
145 proportion of alignment identity threshold (default:
|
|
146 0.35)
|
|
147 -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}
|
|
148 threshold for alignment proportional similarity
|
|
149 (default: 0.45)
|
|
150 -ir INTERRUPTIONS, --interruptions INTERRUPTIONS
|
|
151 interruptions (frameshifts + stop codons) tolerance
|
|
152 threshold per 100 AA (default: 3)
|
|
153 -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION
|
|
154 maximal proportion of alignment length to the original
|
|
155 length of protein domain from database (default: 1.2)
|
|
156
|
|
157 optional arguments - Output Paths:
|
|
158 -lg LOG_FILE, --log_file LOG_FILE
|
|
159 path to log file (default: log.txt)
|
|
160 -ouf OUTPUT_GFF, --output_gff OUTPUT_GFF
|
|
161 path to output gff of repetitive regions (default:
|
|
162 output_repeats.gff)
|
|
163 -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF
|
|
164 path to output gff of protein domains (default:
|
|
165 output_domains.gff)
|
|
166 -oun N_GFF, --n_gff N_GFF
|
|
167 path to output gff of N regions (default:
|
|
168 N_regions.gff)
|
|
169 -hf HTML_FILE, --html_file HTML_FILE
|
|
170 path to output html file (default: output.html)
|
|
171 -hp HTML_PATH, --html_path HTML_PATH
|
|
172 path to html extra files (default: profrep_output_dir)
|
|
173
|
|
174 optional arguments - Copy Numbers/Hits :
|
|
175 -cn COPY_NUMBERS, --copy_numbers COPY_NUMBERS
|
|
176 convert hits to copy numbers (default: False)
|
|
177 -gs GENOME_SIZE, --genome_size GENOME_SIZE
|
|
178 genome size is required when converting hits to copy
|
|
179 numbers and you use custom data (default: None)
|
|
180 -thr THRESHOLD_REPEAT, --threshold_repeat THRESHOLD_REPEAT
|
|
181 threshold for hits/copy numbers per position to be
|
|
182 considered repetitive (default: 3)
|
|
183 -thsg THRESHOLD_SEGMENT, --threshold_segment THRESHOLD_SEGMENT
|
|
184 threshold for the length of repetitive segment to be
|
|
185 reported (default: 80)
|
|
186
|
|
187 optional arguments - Enviroment Variables:
|
|
188 -jb JBROWSE_BIN, --jbrowse_bin JBROWSE_BIN
|
|
189 path to JBrowse bin directory (default: None)
|
|
190
|
|
191
|
|
192 #### HOW TO RUN EXAMPLE ####
|
|
193
|
|
194 ./protein.py --query PATH_TO_DNA_SEQ --reads PATH_TO_READS --ann_tbl PATH_TO_CLUSTERS_CLASSIFICATION --cls PATH_TO_hitsort.cls
|
|
195
|
|
196 When running for the first time with a new reads database use:
|
|
197
|
|
198 --new_db True
|
|
199
|
|
200
|
|
201 ### ProfRep Data Preparation ###
|
|
202
|
|
203 In case of using custom input datasets these tools can be used for easy obtaining the correct files and to prepare the reduced datasets to speed up the main ProfRep analysis:
|
|
204
|
|
205 * Extract Data For ProfRep (extract_data_for_profrep.py)
|
|
206 * ProfRep DB Reducing (profrep_db_reducing.py)
|
|
207
|
|
208 ### ProfRep Supplementary Tools ###
|
|
209
|
|
210 These additional tools can be used for further work with the ProfRep outputs:
|
|
211
|
|
212 * ProfRep Refiner (profrep_refining.py)
|
|
213 * ProfRep Masker (profrep_masking.py)
|
|
214 * GFF Region Selector (gff_selection.py)
|
|
215
|
|
216 ### FOR MORE INFO ABOUT PREPARATION AND SUPPLEMENTARY TOOLS PLEASE READ PROFREP WIKI ###
|
|
217
|
|
218 ## 2. DANTE ##
|
|
219 *- **D**omain based **AN**notation of **T**ransposable **E**lements -*
|
|
220
|
|
221
|
|
222 * Protein Domains Finder [protein_domains.py]
|
|
223 * Script performs scanning of given DNA sequence(s) in (multi)fasta format in order to discover protein domains using our protein domains database.
|
|
224 * Domains searching is accomplished engaging LASTAL alignment tool.
|
|
225 * Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them.
|
|
226
|
|
227 * Proteins Domains Filter [domains_filtering.py]
|
|
228 * filters GFF3 output from previous step to obtain certain kind of domain and/or allows to adjust quality filtering
|
|
229
|
|
230
|
|
231 ### DEPENDENCIES ###
|
|
232
|
|
233 * python3.4 or higher with packages:
|
|
234 * numpy
|
|
235 * biopython
|
|
236 * [lastal](http://last.cbrc.jp/doc/last.html) 744 or higher
|
|
237 * ProfRep/DANTE modules:
|
|
238 * configuration.py
|
|
239
|
|
240
|
|
241 ### Protein Domains Finder ###
|
|
242
|
|
243 This tool provides **preliminary** output of all domains types which are not filtered for quality.
|
|
244
|
|
245 #### INPUTS ####
|
|
246
|
|
247 * DNA sequence [multiFasta]
|
|
248
|
|
249 #### OUTPUTS ####
|
|
250
|
|
251 * **All protein domains GFF3** - individual domains are reported per line as regions (start-end) on the original DNA sequence including the seq ID and strand orientation. The last "Attributes" column contains several comma-separated information related to the domain annotation, alignment and its quality. This file can undergo further filtering using Protein Domain Filter tool.
|
|
252
|
|
253 #### USAGE ####
|
|
254 usage: protein_domains.py [-h] -q QUERY -pdb PROTEIN_DATABASE -cs
|
|
255 CLASSIFICATION [-oug DOMAIN_GFF] [-nld NEW_LDB]
|
|
256 [-dir OUTPUT_DIR] [-thsc THRESHOLD_SCORE]
|
|
257 [-wd WIN_DOM] [-od OVERLAP_DOM]
|
|
258
|
|
259 optional arguments:
|
|
260 -h, --help show this help message and exit
|
|
261 -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF
|
|
262 output domains gff format (default: None)
|
|
263 -nld NEW_LDB, --new_ldb NEW_LDB
|
|
264 create indexed database files for lastal in case of
|
|
265 working with new protein db (default: False)
|
|
266 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
|
|
267 specify if you want to change the output directory
|
|
268 (default: None)
|
|
269 -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE
|
|
270 percentage of the best score in the cluster to be
|
|
271 tolerated when assigning annotations per base
|
|
272 (default: 80)
|
|
273 -wd WIN_DOM, --win_dom WIN_DOM
|
|
274 window to process large input sequences sequentially
|
|
275 (default: 10000000)
|
|
276 -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM
|
|
277 overlap of sequences in two consecutive windows
|
|
278 (default: 10000)
|
|
279
|
|
280 required named arguments:
|
|
281 -q QUERY, --query QUERY
|
|
282 input DNA sequence to search for protein domains in a
|
|
283 fasta format. Multifasta format allowed. (default:
|
|
284 None)
|
|
285 -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
|
|
286 protein domains database file (default: None)
|
|
287 -cs CLASSIFICATION, --classification CLASSIFICATION
|
|
288 protein domains classification file (default: None)
|
|
289
|
|
290
|
|
291
|
|
292 #### HOW TO RUN EXAMPLE ####
|
|
293 ./protein_domains.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE
|
|
294
|
|
295 When running for the first time with a new database use -nld option allowing lastal to create indexed database files:
|
|
296
|
|
297 -nld True
|
|
298
|
|
299 use other arguments if you wish to rename your outputs or they will be created automatically with standard names
|
|
300
|
|
301 ### Protein Domains Filter ###
|
|
302
|
|
303 The script performs Protein Domains Finder output filtering for quality and/or extracting specific type of protein domain or mobile elements of origin. For the filtered domains it reports their translated protein sequence of original DNA.
|
|
304
|
|
305 WHEN NO PARAMETERS GIVEN, IT PERFORMS QUALITY FILTERING USING THE DEFAULT PARAMETRES (optimized for Viridiplantae species)
|
|
306
|
|
307 #### INPUTS ####
|
|
308 * GFF3 file produced by protein_domains.py OR already filtered GFF3
|
|
309
|
|
310 #### Filtering options ####
|
|
311 * QUALITY:
|
|
312 - Min relative length of alignemnt to the protein domain from DB (without gaps)
|
|
313 - Identity
|
|
314 - Similarity (scoring matrix: BLOSUM80)
|
|
315 - Interruption in the reading frame (frameshifts + stop codons) per every starting 100 AA
|
|
316 - Max alignment proportion to the original length of database domain sequence
|
|
317 * DOMAIN TYPE: 'Name' attribute in GFF - see choices bellow
|
|
318 Records for ambiguous domain type (e.g. INT/RH) are filtered out automatically
|
|
319
|
|
320 * MOBILE ELEMENT TYPE:
|
|
321 arbitrary substring of the element classification ('Final_Classification' attribute in GFF)
|
|
322
|
|
323 #### OUTPUTS ####
|
|
324 * filtered GFF3 file
|
|
325 * fasta file of translated protein sequences for the aligned domains that match the filtering criteria
|
|
326 ! as it is taken from the best hit alignment reported by LAST, it does not neccessary cover the whole region reported as domain in GFF
|
|
327
|
|
328 #### USAGE ####
|
|
329 usage: domains_filtering.py [-h] -dg DOM_GFF [-ouf DOMAINS_FILTERED]
|
|
330 [-dps DOMAINS_PROT_SEQ]
|
|
331 [-thl {float range 0.0..1.0}]
|
|
332 [-thi {float range 0.0..1.0}]
|
|
333 [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]
|
|
334 [-mlen MAX_LEN_PROPORTION]
|
|
335 [-sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}]
|
|
336 [-el ELEMENT_TYPE] [-dir OUTPUT_DIR]
|
|
337
|
|
338
|
|
339
|
|
340 optional arguments:
|
|
341 -h, --help show this help message and exit
|
|
342 -ouf DOMAINS_FILTERED, --domains_filtered DOMAINS_FILTERED
|
|
343 output filtered domains gff file (default: None)
|
|
344 -dps DOMAINS_PROT_SEQ, --domains_prot_seq DOMAINS_PROT_SEQ
|
|
345 output file containg domains protein sequences
|
|
346 (default: None)
|
|
347 -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}
|
|
348 proportion of alignment length threshold (default:
|
|
349 0.8)
|
|
350 -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}
|
|
351 proportion of alignment identity threshold (default:
|
|
352 0.35)
|
|
353 -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}
|
|
354 threshold for alignment proportional similarity
|
|
355 (default: 0.45)
|
|
356 -ir INTERRUPTIONS, --interruptions INTERRUPTIONS
|
|
357 interruptions (frameshifts + stop codons) tolerance
|
|
358 threshold per 100 AA (default: 3)
|
|
359 -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION
|
|
360 maximal proportion of alignment length to the original
|
|
361 length of protein domain from database (default: 1.2)
|
|
362 -sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}, --selected_dom {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}
|
|
363 filter output domains based on the domain type
|
|
364 (default: All)
|
|
365 -el ELEMENT_TYPE, --element_type ELEMENT_TYPE
|
|
366 filter output domains by typing substring from
|
|
367 classification (default: )
|
|
368 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
|
|
369 specify if you want to change the output directory
|
|
370 (default: None)
|
|
371
|
|
372 required named arguments:
|
|
373 -dg DOM_GFF, --dom_gff DOM_GFF
|
|
374 basic unfiltered gff file of all domains (default:
|
|
375 None)
|
|
376
|
|
377
|
|
378
|
|
379 #### HOW TO RUN EXAMPLE ####
|
|
380 e.g. getting quality filtered integrase(INT) domains of all gypsy transposable elements:
|
|
381
|
|
382 ./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy
|
|
383
|
|
384
|
|
385 ### Extract Domains Nucleotide Sequences ###
|
|
386
|
|
387 This tool extracts nucleotide sequences of protein domains from reference DNA based on DANTE's output. It can be used e.g. for deriving phylogenetic relations of individual mobile elements classes within a species.
|
|
388
|
|
389 #### INPUTS ####
|
|
390
|
|
391 * original DNA sequence in multifasta format to extract the domains from
|
|
392 * GFF3 file of protein domains (**DANTE's output** - preferably filtered for quality and specific domain type)
|
|
393 * Domains database classification table (to check the classification level)
|
|
394
|
|
395 #### OUTPUTS ####
|
|
396
|
|
397 * fasta files of domains nucleotide sequences for individual transposons lineages
|
|
398 * txt file of domains counts extracted for individual lineages
|
|
399
|
|
400 **- For GALAXY usage all concatenated in a single fasta file**
|
|
401
|
|
402 #### USAGE ####
|
|
403 usage: extract_domains_seqs.py [-h] -i INPUT_DNA -d DOMAINS_GFF -cs
|
|
404 CLASSIFICATION [-out OUT_DIR] [-ex EXTENDED]
|
|
405
|
|
406 optional arguments:
|
|
407 -h, --help show this help message and exit
|
|
408 -i INPUT_DNA, --input_dna INPUT_DNA
|
|
409 path to input DNA sequence
|
|
410 -d DOMAINS_GFF, --domains_gff DOMAINS_GFF
|
|
411 GFF file of protein domains
|
|
412 -cs CLASSIFICATION, --classification CLASSIFICATION
|
|
413 protein domains classification file
|
|
414 -out OUT_DIR, --out_dir OUT_DIR
|
|
415 output directory
|
|
416 -ex EXTENDED, --extended EXTENDED
|
|
417 extend the domains edges if not the whole datatabase
|
|
418 sequence was aligned
|
|
419
|
|
420 #### HOW TO RUN EXAMPLE ####
|
|
421 ./extract_domains_seqs.py --domains_gff PATH_PROTEIN_DOMAINS_GFF --input_dna PATH_TO_INPUT_DNA --classification PROTEIN_DOMAINS_DB_CLASS_TBL --extended True
|
|
422
|
|
423 ### GALAXY implementation ###
|
|
424
|
|
425 #### Dependencies ####
|
|
426
|
|
427 * python3.4 or higher with packages:
|
|
428 * numpy
|
|
429 * matplotlib
|
|
430 * biopython
|
|
431 * [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279671/) or higher
|
|
432 * [LAST](http://last.cbrc.jp/doc/last.html) 744 or higher:
|
|
433 * [download](http://last.cbrc.jp/)
|
|
434 * [install](http://last.cbrc.jp/doc/last.html)
|
|
435 * [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)
|
|
436 * [cd-hit](http://weizhongli-lab.org/cd-hit/)
|
|
437 * [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**
|
|
438
|
|
439 #### Source ######
|
|
440
|
|
441 https://nina_h@bitbucket.org/nina_h/profrep.git
|
|
442
|
|
443 branch "cerit" --> only Pisum Sativum Terno in preparad annotation datasets
|
|
444
|
|
445 branch "develop"/"master" --> extended internal database of species (not published, or for internal purposes)
|
|
446
|
|
447 #### Configuration #####
|
|
448
|
|
449 Add tools
|
|
450
|
|
451 <section name="Assembly annotation" id="annotation">
|
|
452 <label id="profrep_prepare" text="ProfRep Data Preparation" />
|
|
453 <tool file="profrep/extract_data_for_profrep.xml" />
|
|
454 <tool file="profrep/db_reducing.xml" />
|
|
455 <label id="profrep_main" text="Profrep" />
|
|
456 <tool file="profrep/profrep.xml" />
|
|
457 <label id="profrep_supplementary" text="Profrep Supplementary" />
|
|
458 <tool file="profrep/profrep_refine.xml" />
|
|
459 <tool file="profrep/profrep_masking.xml" />
|
|
460 <tool file="profrep/gff_select_region.xml" />
|
|
461 <label id="domains" text="DANTE" />
|
|
462 <tool file="profrep/protein_domains.xml" />
|
|
463 <tool file="profrep/domains_filtering.xml" />
|
|
464 <tool file="profrep/extract_domains_seqs.xml" />
|
|
465 </section>
|
|
466
|
|
467
|
|
468 to
|
|
469
|
|
470 $__root_dir__/config/tool_conf.xml
|
|
471
|
|
472 ------------------------------------------------------------------------
|
|
473
|
|
474 Place PROFREP_DB files to
|
|
475
|
|
476 $__tool_data_path__/profrep
|
|
477
|
|
478 *REMARK* PROFREP_DB files contain prepared annotation data for species in the roll-up menu:
|
|
479
|
|
480 * sequences.fasta - including BLAST database files which was created by:
|
|
481 makeblastdb -in >sequences.fasta -dbtype nucl
|
|
482 * hitosort.cls file
|
|
483 * classification table table
|
|
484
|
|
485 Place DANTE_DB files to
|
|
486
|
|
487 $__tool_data_path__/protein_domains
|
|
488
|
|
489 *REMARK* DANTE_DB files contain protein domains database files:
|
|
490 * protein domains database including LASTAL database files which was created by:
|
|
491 lastdb -p -cR01 >database_name< >database_name<
|
|
492 (lastal database files are actually enough, original datatabse table does not have to be present)
|
|
493 * classification table
|
|
494
|
|
495 ------------------------------------------------------------------------
|
|
496
|
|
497 Create
|
|
498
|
|
499 $__root_dir__/database/dependencies/profrep/1.0.0/env.sh
|
|
500
|
|
501 containing:
|
|
502
|
|
503 export JBROWSE_BIN=PATH_TO_JBROWSE_DIR/bin
|
|
504
|
|
505 ------------------------------------------------------------------------
|
|
506
|
|
507 Link the following files into galaxy tool-data dir
|
|
508
|
|
509 ln -s $__tool_directory__/profrep/domains_data/select_domain.txt $__tool_data_path__
|
|
510 ln -s $__tool_directory__/profrep/profrep_data/prepared_datasets.txt $__tool_data_path__
|
|
511
|
|
512
|
|
513
|
|
514
|
|
515
|
|
516
|
|
517
|
|
518
|
|
519
|