annotate test-data/2021-04-21/supporting_information/data_prep_description.md @ 6:437e28791761 draft

"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit f53dc92d3cf6997da9ad21f01ad3fd3477aae068"
author iuc
date Wed, 07 Jul 2021 09:24:31 +0000
parents 42126b414951
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
4
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
1 # Data preparation
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
2
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
3 ### Source
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
4
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
5 All GISAID data is downloaded and run through [`grapevine`](https://github.com/cov-ert/grapevine) which excludes records without proper dates, removes duplicate sequences (taking the earliest sample of the duplicates), omits some sequences with known issues, filters by length and coverage, and trims the sequences to CDS.
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
6
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
7 It also aligns the sequences using `mafft` and builds an ML tree using `iqtree`. A lineages is assigned to each sequence using `pangolin` with the previous data release.
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
8
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
9 ### Lineage Curation
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
10
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
11 The phylogeny is annotated with lineage and then in `FigTree` the lineages are manually curated, drawing together a number of pieces of information including monophyly in the ML phylogeny (generally a bootstrap > 70 is required) and epidemiological data such as country and travel history. Any changes to lineage definitions and new lineages are documented during this process.
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
12
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
13 - The lineage may have been defined earlier in the outbreak and with added sequence data, there is less support for that lineage. In these cases the associated epidemiological metadata is examined and the lineage may be refined or even dropped entirely. The lineage number will not be 'recycled', but the members will get reassigned the parent lineage designation.
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
14 - The lineage may have very clear epidemiological support and ambiguities or homoplasies in the sequences/ tree could contribute to low bootstrap values. In these cases, if the support is strong, the lineages are called. Recall rates for these lingeages within `pangolin` may be lower however.
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
15
42126b414951 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff changeset
16