comparison clipkit_repo/docs/performance_assessment/index.rst @ 0:49b058e85902 draft

"planemo upload for repository https://github.com/jlsteenwyk/clipkit commit cbe1e8577ecb1a46709034a40dff36052e876e7a-dirty"
author padge
date Fri, 25 Mar 2022 13:04:31 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:49b058e85902
1 .. _performance:
2
3
4 Performance Assessment
5 ======================
6
7 |
8
9 Benchmarking
10 ------------
11
12 ^^^^^
13
14 In brief, performance assessment and comparison of multiple trimming alignment software
15 revealed that ClipKIT is a top-performing software.
16
17 .. image:: ../_static/img/Performance_summary.jpg
18
19 **ClipKIT is a top-performing software for trimming multiple sequence alignments.**
20 Across a total of 138,152 multiple sequence alignments (MSAs) from empirical (left) and
21 simulated (right) datasets, desirability-based integration of accuracy and support metrics
22 per MSA facilitated the comparison of relative software performance and revealed ClipKIT
23 is a top-performing software. MSA trimming approaches are ordered along the x-axis from
24 the highest-performing software (left) to the lowest-performing software (right) according to average
25 desirability-based rank, which is derived from measures of tree accuracy (i.e., normalized Robinson
26 Foulds distance) and tree support (i.e., average bipartition support).
27
28 Abbreviations of trimmers and parameters are as follows:
29 ClipKIT: g = gappy mode; ClipKIT: kc = kpic; ClipKIT: kcg = kpic-gappy; ClipKIT: k = kpi mode;
30 ClipKIT: kg = kpi-gappy mode; BMGE = BMGE default; BMGE 0.3 = 0.3 entropy threshold;
31 BMGE 0.7 = 0.7 entropy threshold; trimAl: s = strict; trimAl: sp = strictplus; Noisy = default;
32 Gblocks = default; No trim = no trimming.
33
34 For additional details about performance assessment, please see *ClipKIT: a multiple sequence
35 alignment trimming software for accurate phylogenomic inference*. Steenwyk et al. PLoS Biology. doi: |doiLink|_.
36
37 .. _doiLink: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007
38 .. |doiLink| replace:: 10.1371/journal.pbio.3001007
39
40 |
41
42 smart-gap
43 ---------
44
45 ^^^^^
46
47 Starting with version 1.1.0, a dynamic gappyness threshold determination approach (referred to
48 as smart-gap) has been introduced into ClipKIT and is now the default trimming approach. The
49 motivation of smart-gap stems from excessive trimming among highly divergent sequences.
50
51 .. image:: ../_static/img/smart_gaps_trimming.png
52
53 For example, in the figure above, we simulated 100 sequences for various trees with 100 tips.
54 Each tree had a different total tree length, a measure of total evolutionary divergence (x-axis).
55 Differences in total tree length were generated by multiplying the branch lengths of the starting
56 random tree (generated using IQTREE2) by a factor ranging from 0.25 to 10. Thus, the same tree
57 shape and relative branch lengths were used during the simulations. Simulations were generated using
58 INDELible. Examining the percentage of the alignment remaining after trimming revealed using a strict
59 gappy threshold of 90% resulted in 'extreme' trimming, which is not recommended (|TanLink|_).
60 In contrast, smart-gap retains a large fraction of the alignment and only removes the most
61 gappy sites. Thus, smart-gap is a better approach for sequence alignments that span deep and
62 shallow evolutionary timescales.
63
64 More specifically, when implementing the smart-gap approach, ClipKIT first examines the
65 distribution of gaps across the alignment. Next, ClipKIT determines the gap-to-gap slope
66 between each gappyness bin. By examining the maximum difference in the slope between each
67 adjacent bin, ClipKIT determines what step would correspond to removing a large number
68 of sites in comparison to other steps. Of note, ClipKIT only examines the first half of
69 slopes calculated so as to not trim too much of the alignment. ClipKIT will then choose
70 the threshold that ensures the large number of sites will not be trimmed.
71
72 .. _TanLink: https://academic.oup.com/sysbio/article/64/5/778/1685763
73 .. |TanLink| replace:: Tan *et al.* (2015)
74
75 For example, in the the following test alignment:
76
77 .. code-block:: shell
78
79 >1
80 A-GTAT-
81 >2
82 A-G-AT-
83 >3
84 A-G-TA-
85 >4
86 AGA-TA-
87 >5
88 ACa-T-G
89
90 there are two sites with four gaps, one site with three gaps, and one
91 site with one gap. ClipKIT will calculate the slope between sites with
92 greater than or equal to 80% gaps and removing 2/7ths of the alignment
93 and sites with greater than or equal to 60% gaps and removing 3/7ths
94 of the alignment. Next, ClipKIT will determine the slope between sites
95 with greater than or equal to 60% gaps and removing 3/7ths of the
96 alignment and sites with greater than or equal to 20% gaps and removing
97 4/7ths of the alignment and so on and so forth. ClipKIT will then examine
98 the first half of slope values and use the less strict gaps threshold
99 from the two points that generated the greatest difference between
100 consecutive slopes.
101
102