annotate profrep_refine.xml @ 5:ad3bbf392135 draft

Uploaded
author petr-novak
date Wed, 26 Jun 2019 11:14:05 -0400
parents a5f1638b73be
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
1
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
2 <tool id="profrep_refine" name="ProfRep Refiner" version="1.0.0">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
3 <requirements>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
4 <requirement type="package" version="1.0.0">profrep</requirement>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
5 <requirement type="package">numpy</requirement>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
6 </requirements>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
7 <description> Tool to polish the raw ProfRep output in order to evaluate overlapping regions of different classifications and to interconnect fragmented parts of individual repeats, optionally supported by domains information </description>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
8 <command>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
9 python3 ${__tool_directory__}/profrep_refining.py --repeats_gff ${repeats_gff} --out_refined ${out_refined} --gap_threshold ${gap_th} --include_dom ${include_domains.domains}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
10 #if $include_domains.domains:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
11 --domains_gff ${include_domains.dom_gff}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
12 --dom_number ${include_domains.dom_num}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
13 --class_tbl ${__tool_data_path__ }/protein_domains/${include_domains.db_type}_class
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
14 #end if
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
15 </command>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
16 <inputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
17 <param format="gff" type="data" name="repeats_gff" label="Repeats GFF" help="Choose repeats GFF3 file from ProfRep output" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
18 <param name="gap_th" type="integer" value="250" label="Gap tolerance" help="Threshold for tolerated gap between two consecutive repeat regions of the same class to be interconnected" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
19 <conditional name="include_domains">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
20 <param name="domains" type="boolean" display="checkbox" truevalue="true" falsevalue="false" checked="true" label="Include protein domains information" help = "This helps to improve the confidence of the regions merging" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
21 <when value="true">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
22 <param format="gff" type="data" name="dom_gff" label="Domains GFF" help="Choose GFF3 file containing protein domains" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
23 <param name="dom_num" type="integer" value="2" min="0" max="10" label="Minimum domains" help="Min number of domains per mobile element to confirm the regions merging" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
24 <param name="db_type" type="select" label="Select taxon and protein domain database version (REXdb)" help="">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
25 <options from_file="rexdb_versions.txt">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
26 <column name="name" index="0"/>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
27 <column name="value" index="1"/>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
28 </options>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
29 </param>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
30
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
31 </when>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
32 </conditional>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
33 </inputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
34 <outputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
35 <data format="gff3" name="out_refined" label="Refined GFF3 file from dataset ${repeats_gff.hid}" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
36 </outputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
37
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
38 <help>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
39
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
40
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
41 **WHAT IT DOES**
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
42
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
43 REFINING PROCESS of repeats annotation runs in two consecutive steps:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
44
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
45 1. REMOVING LOW CONFIDENCE REPEATS REGIONS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
46
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
47 Prior the regions interconnecting, it is necessary to filter out some nested regions of different classification, which might be false positive and disrupt the merging process. However, not all of these overlapping regions of different classification are necessarily wrong and they can reveal inserted or chimeric elements. Thats why we only get rid of the regions with significantly lower quality. At first clusters of overlapping repeat regions are created. Within a cluster, regions are gradually checked based on descending PID. All the other regions occuring inside the current one (with some borders tolerance on each side - defaultly 10 bp), are removed in case their PID is more than 5% lower than the current region. Otherwise it will be preserved. Average PID (percentage of identity) is counted as the mean per each position for equally classified hits and then it is averaged over the whole region reported in repeats GFF.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
48
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
49 2. INTERCONNECTING REGIONS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
50
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
51 These "cleaned" regions are subsequently interconnected in the next step. It searches for consecutive repeats to create segments with the same classification that are not further from each other than a gap threshold (defaultly 250 bp). These segments cannot be corrupted by repeats of different classification. The confidence of merging is by default supported by the protein domains information (optional). In this case, a minimum number of protein domains of equal orientation must be present inside the expanded segment. These domains need to be unambiguously classified until the very last level (checked based on domains classification table) and at the same time the classification must correspond to the repeat classification of the region. Repetitive elements which do not encode protein domains, such as satellites or MITes (i.e. not mobile elements), are not checked for the domains and the regions of same classification are merged only based on the gap criterion.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
52
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
53 </help>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
54 </tool>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
55