Mercurial > repos > iuc > pangolin
annotate test-data/2021-04-21/supporting_information/data_prep_description.md @ 6:437e28791761 draft
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit f53dc92d3cf6997da9ad21f01ad3fd3477aae068"
author | iuc |
---|---|
date | Wed, 07 Jul 2021 09:24:31 +0000 |
parents | 42126b414951 |
children |
rev | line source |
---|---|
4
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
1 # Data preparation |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
2 |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
3 ### Source |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
4 |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
5 All GISAID data is downloaded and run through [`grapevine`](https://github.com/cov-ert/grapevine) which excludes records without proper dates, removes duplicate sequences (taking the earliest sample of the duplicates), omits some sequences with known issues, filters by length and coverage, and trims the sequences to CDS. |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
6 |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
7 It also aligns the sequences using `mafft` and builds an ML tree using `iqtree`. A lineages is assigned to each sequence using `pangolin` with the previous data release. |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
8 |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
9 ### Lineage Curation |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
10 |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
11 The phylogeny is annotated with lineage and then in `FigTree` the lineages are manually curated, drawing together a number of pieces of information including monophyly in the ML phylogeny (generally a bootstrap > 70 is required) and epidemiological data such as country and travel history. Any changes to lineage definitions and new lineages are documented during this process. |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
12 |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
13 - The lineage may have been defined earlier in the outbreak and with added sequence data, there is less support for that lineage. In these cases the associated epidemiological metadata is examined and the lineage may be refined or even dropped entirely. The lineage number will not be 'recycled', but the members will get reassigned the parent lineage designation. |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
14 - The lineage may have very clear epidemiological support and ambiguities or homoplasies in the sequences/ tree could contribute to low bootstrap values. In these cases, if the support is strong, the lineages are called. Recall rates for these lingeages within `pangolin` may be lower however. |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
15 |
42126b414951
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/pangolin commit ab174c9f8cbfc741501068dfa4f6ccf229a54489"
iuc
parents:
diff
changeset
|
16 |