annotate README.md @ 1:d1c88b118a3f draft

Uploaded
author damion
date Fri, 13 Mar 2015 20:59:28 -0400
parents
children 671667722d3d
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
d1c88b118a3f Uploaded
damion
parents:
diff changeset
1 Feature Frequency Profile Phylogenies
d1c88b118a3f Uploaded
damion
parents:
diff changeset
2 =====================================
d1c88b118a3f Uploaded
damion
parents:
diff changeset
3
d1c88b118a3f Uploaded
damion
parents:
diff changeset
4
d1c88b118a3f Uploaded
damion
parents:
diff changeset
5 Introduction
d1c88b118a3f Uploaded
damion
parents:
diff changeset
6 ------------
d1c88b118a3f Uploaded
damion
parents:
diff changeset
7
d1c88b118a3f Uploaded
damion
parents:
diff changeset
8 FFP (Feature frequency profile) is an alignment free comparison tool for phylogenetic analysis and text comparison. It can be applied to nucleotide sequences, complete genomes, proteomes and even used for text comparison. This software is a Galaxy (http://galaxyproject.org) tool for calculating FFP on one or more fasta sequence or text datasets.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
9
d1c88b118a3f Uploaded
damion
parents:
diff changeset
10 The original command line ffp-phylogeny code is at http://ffp-phylogeny.sourceforge.net/ . This tool uses Aaron Petkau's modified version: https://github.com/apetkau/ffp-3.19-custom . Aaron has quite a good writeup of the technique as well at https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny .
d1c88b118a3f Uploaded
damion
parents:
diff changeset
11
d1c88b118a3f Uploaded
damion
parents:
diff changeset
12 This Galaxy tool prepares a mini-pipeline consisting of **[ffpry | ffpaa | ffptxt] > [ ffpfilt | ffpcol > ffprwn] > ffpjsd > ffptree** . The last step is optional - by deselecting the "Generate Tree Phylogeny" checkbox, the tool will output a distance matrix rather than a Newick (.nhx) formatted tree file.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
13
d1c88b118a3f Uploaded
damion
parents:
diff changeset
14 Each sequence or text file has a profile containing tallies of each feature found. A feature is a string of valid characters of given length.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
15
d1c88b118a3f Uploaded
damion
parents:
diff changeset
16 For nucleotide data, by default each character (ATGC) is grouped as either purine(R) or pyrmidine(Y) before being counted.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
17
d1c88b118a3f Uploaded
damion
parents:
diff changeset
18 For amino acid data, by default each character is grouped into one of the following: (ST),(DE),(KQR),(IVLM),(FWY),C,G,A,N,H,P. Each group is represented by the first character in its series.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
19
d1c88b118a3f Uploaded
damion
parents:
diff changeset
20 One other key concept is that a given feature, e.g. "TAA" is counted in forward AND reverse directions, mirroring the idea that a feature's orientation is not so important to distinguish when it comes to alignment-free comparison. The counts for "TAA" and "AAT" are merged.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
21
d1c88b118a3f Uploaded
damion
parents:
diff changeset
22 The labeling of the resulting counted feature items is perhaps the trickiest concept to master. Due to computational efficiency measures taken by the developers, a feature that we see on paper as "TAC" may be stored and labeled internally as "GTA", its reverse compliment. One must look for the alternative if one does not find the original.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
23
d1c88b118a3f Uploaded
damion
parents:
diff changeset
24 Also note that in amino acid sequences the stop codon "*" (or any other character that is not in the Amino acid alphabet) causes that character frame not to be counted. Also, character frames never span across fasta entries.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
25
d1c88b118a3f Uploaded
damion
parents:
diff changeset
26 A few tutorials:
d1c88b118a3f Uploaded
damion
parents:
diff changeset
27 * http://sourceforge.net/projects/ffp-phylogeny/files/Documentation/tutorial.pdf
d1c88b118a3f Uploaded
damion
parents:
diff changeset
28 * https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny
d1c88b118a3f Uploaded
damion
parents:
diff changeset
29
d1c88b118a3f Uploaded
damion
parents:
diff changeset
30 -------
d1c88b118a3f Uploaded
damion
parents:
diff changeset
31 **Note**
d1c88b118a3f Uploaded
damion
parents:
diff changeset
32
d1c88b118a3f Uploaded
damion
parents:
diff changeset
33 Taxonomy label details: If each file contains one profile, the file's name is used to label the profile. If each file contains fasta sequences to profile individually, their fasta identifiers will be used to label them. The "short labels" option will find the shortest label that uniquely identifies each profile. Either way, there are some quirks: ffpjsd clips labels to 10 characters if they are greater than 50 characters, so all labels are trimmed to 50 characters first. Also "id" is prefixed to any numeric label since some tree visualizers won't show purely numeric labels. In the accidental case where a Fasta sequence label is a duplicate of a previous one it will be prefixed by "DupLabel-".
d1c88b118a3f Uploaded
damion
parents:
diff changeset
34
d1c88b118a3f Uploaded
damion
parents:
diff changeset
35 The command line ffpjsd can hang if one provides an l-mer length greater than the length of file content. One must identify its process id ("ps aux | grep ffpjsd") and kill it ("kill [process id]").
d1c88b118a3f Uploaded
damion
parents:
diff changeset
36
d1c88b118a3f Uploaded
damion
parents:
diff changeset
37 Finally, it is possible for the ffptree program to generate a tree where some of the branch distances are negative. See https://www.biostars.org/p/45597/
d1c88b118a3f Uploaded
damion
parents:
diff changeset
38 -------
d1c88b118a3f Uploaded
damion
parents:
diff changeset
39 **References**
d1c88b118a3f Uploaded
damion
parents:
diff changeset
40
d1c88b118a3f Uploaded
damion
parents:
diff changeset
41 The development of the ffp-phylogeny command line software should be attributed to:
d1c88b118a3f Uploaded
damion
parents:
diff changeset
42
d1c88b118a3f Uploaded
damion
parents:
diff changeset
43 Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 2009;106(8):2677-2682. doi:10.1073/pnas.0813249106.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
44