view clipkit_repo/docs/performance_assessment/index.rst @ 0:49b058e85902 draft

"planemo upload for repository https://github.com/jlsteenwyk/clipkit commit cbe1e8577ecb1a46709034a40dff36052e876e7a-dirty"
author padge
date Fri, 25 Mar 2022 13:04:31 +0000
parents
children
line wrap: on
line source

.. _performance:


Performance Assessment
======================

|

Benchmarking
------------

^^^^^

In brief, performance assessment and comparison of multiple trimming alignment software
revealed that ClipKIT is a top-performing software.

.. image:: ../_static/img/Performance_summary.jpg

**ClipKIT is a top-performing software for trimming multiple sequence alignments.** 
Across a total of 138,152 multiple sequence alignments (MSAs) from empirical (left) and
simulated (right) datasets, desirability-based integration of accuracy and support metrics
per MSA facilitated the comparison of relative software performance and revealed ClipKIT
is a top-performing software. MSA trimming approaches are ordered along the x-axis from
the highest-performing software (left) to the lowest-performing software (right) according to average
desirability-based rank, which is derived from measures of tree accuracy (i.e., normalized Robinson
Foulds distance) and tree support (i.e., average bipartition support). 

Abbreviations of trimmers and parameters are as follows: 
ClipKIT: g = gappy mode; ClipKIT: kc = kpic; ClipKIT: kcg = kpic-gappy; ClipKIT: k = kpi mode;
ClipKIT: kg = kpi-gappy mode; BMGE = BMGE default; BMGE 0.3 = 0.3 entropy threshold;
BMGE 0.7 = 0.7 entropy threshold; trimAl: s = strict; trimAl: sp = strictplus; Noisy = default;
Gblocks = default; No trim = no trimming.

For additional details about performance assessment, please see *ClipKIT: a multiple sequence
alignment trimming software for accurate phylogenomic inference*. Steenwyk et al. PLoS Biology. doi: |doiLink|_.

.. _doiLink: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007
.. |doiLink| replace:: 10.1371/journal.pbio.3001007 

|

smart-gap
---------

^^^^^

Starting with version 1.1.0, a dynamic gappyness threshold determination approach (referred to 
as smart-gap) has been introduced into ClipKIT and is now the default trimming approach. The 
motivation of smart-gap stems from excessive trimming among highly divergent sequences.

.. image:: ../_static/img/smart_gaps_trimming.png

For example, in the figure above, we simulated 100 sequences for various trees with 100 tips. 
Each tree had a different total tree length, a measure of total evolutionary divergence (x-axis).
Differences in total tree length were generated by multiplying the branch lengths of the starting
random tree (generated using IQTREE2) by a factor ranging from 0.25 to 10. Thus, the same tree
shape and relative branch lengths were used during the simulations. Simulations were generated using
INDELible. Examining the percentage of the alignment remaining after trimming revealed using a strict 
gappy threshold of 90% resulted in 'extreme' trimming, which is not recommended (|TanLink|_).
In contrast, smart-gap retains a large fraction of the alignment and only removes the most
gappy sites. Thus, smart-gap is a better approach for sequence alignments that span deep and
shallow evolutionary timescales.

More specifically, when implementing the smart-gap approach, ClipKIT first examines the 
distribution of gaps across the alignment. Next, ClipKIT determines the gap-to-gap slope
between each gappyness bin. By examining the maximum difference in the slope between each
adjacent bin, ClipKIT determines what step would correspond to removing a large number
of sites in comparison to other steps. Of note, ClipKIT only examines the first half of
slopes calculated so as to not trim too much of the alignment. ClipKIT will then choose
the threshold that ensures the large number of sites will not be trimmed.

.. _TanLink: https://academic.oup.com/sysbio/article/64/5/778/1685763
.. |TanLink| replace:: Tan *et al.* (2015)

For example, in the the following test alignment:

.. code-block:: shell

    >1
    A-GTAT-
    >2
    A-G-AT-
    >3
    A-G-TA-
    >4
    AGA-TA-
    >5
    ACa-T-G

there are two sites with four gaps, one site with three gaps, and one
site with one gap. ClipKIT will calculate the slope between sites with
greater than or equal to 80% gaps and removing 2/7ths of the alignment
and sites with greater than or equal to 60% gaps and removing 3/7ths
of the alignment. Next, ClipKIT will determine the slope between sites
with greater than or equal to 60% gaps and removing 3/7ths of the
alignment and sites with greater than or equal to 20% gaps and removing 
4/7ths of the alignment and so on and so forth. ClipKIT will then examine
the first half of slope values and use the less strict gaps threshold
from the two points that generated the greatest difference between 
consecutive slopes.