diff clipkit_repo/docs/performance_assessment/index.rst @ 0:49b058e85902 draft

"planemo upload for repository https://github.com/jlsteenwyk/clipkit commit cbe1e8577ecb1a46709034a40dff36052e876e7a-dirty"
author padge
date Fri, 25 Mar 2022 13:04:31 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/clipkit_repo/docs/performance_assessment/index.rst	Fri Mar 25 13:04:31 2022 +0000
@@ -0,0 +1,102 @@
+.. _performance:
+
+
+Performance Assessment
+======================
+
+|
+
+Benchmarking
+------------
+
+^^^^^
+
+In brief, performance assessment and comparison of multiple trimming alignment software
+revealed that ClipKIT is a top-performing software.
+
+.. image:: ../_static/img/Performance_summary.jpg
+
+**ClipKIT is a top-performing software for trimming multiple sequence alignments.** 
+Across a total of 138,152 multiple sequence alignments (MSAs) from empirical (left) and
+simulated (right) datasets, desirability-based integration of accuracy and support metrics
+per MSA facilitated the comparison of relative software performance and revealed ClipKIT
+is a top-performing software. MSA trimming approaches are ordered along the x-axis from
+the highest-performing software (left) to the lowest-performing software (right) according to average
+desirability-based rank, which is derived from measures of tree accuracy (i.e., normalized Robinson
+Foulds distance) and tree support (i.e., average bipartition support). 
+
+Abbreviations of trimmers and parameters are as follows: 
+ClipKIT: g = gappy mode; ClipKIT: kc = kpic; ClipKIT: kcg = kpic-gappy; ClipKIT: k = kpi mode;
+ClipKIT: kg = kpi-gappy mode; BMGE = BMGE default; BMGE 0.3 = 0.3 entropy threshold;
+BMGE 0.7 = 0.7 entropy threshold; trimAl: s = strict; trimAl: sp = strictplus; Noisy = default;
+Gblocks = default; No trim = no trimming.
+
+For additional details about performance assessment, please see *ClipKIT: a multiple sequence
+alignment trimming software for accurate phylogenomic inference*. Steenwyk et al. PLoS Biology. doi: |doiLink|_.
+
+.. _doiLink: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007
+.. |doiLink| replace:: 10.1371/journal.pbio.3001007 
+
+|
+
+smart-gap
+---------
+
+^^^^^
+
+Starting with version 1.1.0, a dynamic gappyness threshold determination approach (referred to 
+as smart-gap) has been introduced into ClipKIT and is now the default trimming approach. The 
+motivation of smart-gap stems from excessive trimming among highly divergent sequences.
+
+.. image:: ../_static/img/smart_gaps_trimming.png
+
+For example, in the figure above, we simulated 100 sequences for various trees with 100 tips. 
+Each tree had a different total tree length, a measure of total evolutionary divergence (x-axis).
+Differences in total tree length were generated by multiplying the branch lengths of the starting
+random tree (generated using IQTREE2) by a factor ranging from 0.25 to 10. Thus, the same tree
+shape and relative branch lengths were used during the simulations. Simulations were generated using
+INDELible. Examining the percentage of the alignment remaining after trimming revealed using a strict 
+gappy threshold of 90% resulted in 'extreme' trimming, which is not recommended (|TanLink|_).
+In contrast, smart-gap retains a large fraction of the alignment and only removes the most
+gappy sites. Thus, smart-gap is a better approach for sequence alignments that span deep and
+shallow evolutionary timescales.
+
+More specifically, when implementing the smart-gap approach, ClipKIT first examines the 
+distribution of gaps across the alignment. Next, ClipKIT determines the gap-to-gap slope
+between each gappyness bin. By examining the maximum difference in the slope between each
+adjacent bin, ClipKIT determines what step would correspond to removing a large number
+of sites in comparison to other steps. Of note, ClipKIT only examines the first half of
+slopes calculated so as to not trim too much of the alignment. ClipKIT will then choose
+the threshold that ensures the large number of sites will not be trimmed.
+
+.. _TanLink: https://academic.oup.com/sysbio/article/64/5/778/1685763
+.. |TanLink| replace:: Tan *et al.* (2015)
+
+For example, in the the following test alignment:
+
+.. code-block:: shell
+
+    >1
+    A-GTAT-
+    >2
+    A-G-AT-
+    >3
+    A-G-TA-
+    >4
+    AGA-TA-
+    >5
+    ACa-T-G
+
+there are two sites with four gaps, one site with three gaps, and one
+site with one gap. ClipKIT will calculate the slope between sites with
+greater than or equal to 80% gaps and removing 2/7ths of the alignment
+and sites with greater than or equal to 60% gaps and removing 3/7ths
+of the alignment. Next, ClipKIT will determine the slope between sites
+with greater than or equal to 60% gaps and removing 3/7ths of the
+alignment and sites with greater than or equal to 20% gaps and removing 
+4/7ths of the alignment and so on and so forth. ClipKIT will then examine
+the first half of slope values and use the less strict gaps threshold
+from the two points that generated the greatest difference between 
+consecutive slopes.
+
+