Mercurial > repos > padge > clipkit
comparison clipkit_repo/docs/performance_assessment/index.rst @ 0:49b058e85902 draft
"planemo upload for repository https://github.com/jlsteenwyk/clipkit commit cbe1e8577ecb1a46709034a40dff36052e876e7a-dirty"
author | padge |
---|---|
date | Fri, 25 Mar 2022 13:04:31 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:49b058e85902 |
---|---|
1 .. _performance: | |
2 | |
3 | |
4 Performance Assessment | |
5 ====================== | |
6 | |
7 | | |
8 | |
9 Benchmarking | |
10 ------------ | |
11 | |
12 ^^^^^ | |
13 | |
14 In brief, performance assessment and comparison of multiple trimming alignment software | |
15 revealed that ClipKIT is a top-performing software. | |
16 | |
17 .. image:: ../_static/img/Performance_summary.jpg | |
18 | |
19 **ClipKIT is a top-performing software for trimming multiple sequence alignments.** | |
20 Across a total of 138,152 multiple sequence alignments (MSAs) from empirical (left) and | |
21 simulated (right) datasets, desirability-based integration of accuracy and support metrics | |
22 per MSA facilitated the comparison of relative software performance and revealed ClipKIT | |
23 is a top-performing software. MSA trimming approaches are ordered along the x-axis from | |
24 the highest-performing software (left) to the lowest-performing software (right) according to average | |
25 desirability-based rank, which is derived from measures of tree accuracy (i.e., normalized Robinson | |
26 Foulds distance) and tree support (i.e., average bipartition support). | |
27 | |
28 Abbreviations of trimmers and parameters are as follows: | |
29 ClipKIT: g = gappy mode; ClipKIT: kc = kpic; ClipKIT: kcg = kpic-gappy; ClipKIT: k = kpi mode; | |
30 ClipKIT: kg = kpi-gappy mode; BMGE = BMGE default; BMGE 0.3 = 0.3 entropy threshold; | |
31 BMGE 0.7 = 0.7 entropy threshold; trimAl: s = strict; trimAl: sp = strictplus; Noisy = default; | |
32 Gblocks = default; No trim = no trimming. | |
33 | |
34 For additional details about performance assessment, please see *ClipKIT: a multiple sequence | |
35 alignment trimming software for accurate phylogenomic inference*. Steenwyk et al. PLoS Biology. doi: |doiLink|_. | |
36 | |
37 .. _doiLink: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007 | |
38 .. |doiLink| replace:: 10.1371/journal.pbio.3001007 | |
39 | |
40 | | |
41 | |
42 smart-gap | |
43 --------- | |
44 | |
45 ^^^^^ | |
46 | |
47 Starting with version 1.1.0, a dynamic gappyness threshold determination approach (referred to | |
48 as smart-gap) has been introduced into ClipKIT and is now the default trimming approach. The | |
49 motivation of smart-gap stems from excessive trimming among highly divergent sequences. | |
50 | |
51 .. image:: ../_static/img/smart_gaps_trimming.png | |
52 | |
53 For example, in the figure above, we simulated 100 sequences for various trees with 100 tips. | |
54 Each tree had a different total tree length, a measure of total evolutionary divergence (x-axis). | |
55 Differences in total tree length were generated by multiplying the branch lengths of the starting | |
56 random tree (generated using IQTREE2) by a factor ranging from 0.25 to 10. Thus, the same tree | |
57 shape and relative branch lengths were used during the simulations. Simulations were generated using | |
58 INDELible. Examining the percentage of the alignment remaining after trimming revealed using a strict | |
59 gappy threshold of 90% resulted in 'extreme' trimming, which is not recommended (|TanLink|_). | |
60 In contrast, smart-gap retains a large fraction of the alignment and only removes the most | |
61 gappy sites. Thus, smart-gap is a better approach for sequence alignments that span deep and | |
62 shallow evolutionary timescales. | |
63 | |
64 More specifically, when implementing the smart-gap approach, ClipKIT first examines the | |
65 distribution of gaps across the alignment. Next, ClipKIT determines the gap-to-gap slope | |
66 between each gappyness bin. By examining the maximum difference in the slope between each | |
67 adjacent bin, ClipKIT determines what step would correspond to removing a large number | |
68 of sites in comparison to other steps. Of note, ClipKIT only examines the first half of | |
69 slopes calculated so as to not trim too much of the alignment. ClipKIT will then choose | |
70 the threshold that ensures the large number of sites will not be trimmed. | |
71 | |
72 .. _TanLink: https://academic.oup.com/sysbio/article/64/5/778/1685763 | |
73 .. |TanLink| replace:: Tan *et al.* (2015) | |
74 | |
75 For example, in the the following test alignment: | |
76 | |
77 .. code-block:: shell | |
78 | |
79 >1 | |
80 A-GTAT- | |
81 >2 | |
82 A-G-AT- | |
83 >3 | |
84 A-G-TA- | |
85 >4 | |
86 AGA-TA- | |
87 >5 | |
88 ACa-T-G | |
89 | |
90 there are two sites with four gaps, one site with three gaps, and one | |
91 site with one gap. ClipKIT will calculate the slope between sites with | |
92 greater than or equal to 80% gaps and removing 2/7ths of the alignment | |
93 and sites with greater than or equal to 60% gaps and removing 3/7ths | |
94 of the alignment. Next, ClipKIT will determine the slope between sites | |
95 with greater than or equal to 60% gaps and removing 3/7ths of the | |
96 alignment and sites with greater than or equal to 20% gaps and removing | |
97 4/7ths of the alignment and so on and so forth. ClipKIT will then examine | |
98 the first half of slope values and use the less strict gaps threshold | |
99 from the two points that generated the greatest difference between | |
100 consecutive slopes. | |
101 | |
102 |