annotate README.md @ 3:79a4a86981d3 draft default tip

Uploaded
author damion
date Thu, 23 Apr 2015 17:47:39 -0400
parents 671667722d3d
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
d1c88b118a3f Uploaded
damion
parents:
diff changeset
1 Feature Frequency Profile Phylogenies
d1c88b118a3f Uploaded
damion
parents:
diff changeset
2 =====================================
d1c88b118a3f Uploaded
damion
parents:
diff changeset
3
d1c88b118a3f Uploaded
damion
parents:
diff changeset
4
d1c88b118a3f Uploaded
damion
parents:
diff changeset
5 Introduction
d1c88b118a3f Uploaded
damion
parents:
diff changeset
6 ------------
d1c88b118a3f Uploaded
damion
parents:
diff changeset
7
d1c88b118a3f Uploaded
damion
parents:
diff changeset
8 FFP (Feature frequency profile) is an alignment free comparison tool for phylogenetic analysis and text comparison. It can be applied to nucleotide sequences, complete genomes, proteomes and even used for text comparison. This software is a Galaxy (http://galaxyproject.org) tool for calculating FFP on one or more fasta sequence or text datasets.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
9
d1c88b118a3f Uploaded
damion
parents:
diff changeset
10 The original command line ffp-phylogeny code is at http://ffp-phylogeny.sourceforge.net/ . This tool uses Aaron Petkau's modified version: https://github.com/apetkau/ffp-3.19-custom . Aaron has quite a good writeup of the technique as well at https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny .
d1c88b118a3f Uploaded
damion
parents:
diff changeset
11
3
79a4a86981d3 Uploaded
damion
parents: 2
diff changeset
12 **Installation Note** : Your Galaxy server will need the groff package to be installed on it first (to generate ffp-phylogeny man pages). A cryptic error will occur if it isn't: "troff: fatal error: can't find macro file s". This is different from the "groff-base" package.
79a4a86981d3 Uploaded
damion
parents: 2
diff changeset
13
1
d1c88b118a3f Uploaded
damion
parents:
diff changeset
14 This Galaxy tool prepares a mini-pipeline consisting of **[ffpry | ffpaa | ffptxt] > [ ffpfilt | ffpcol > ffprwn] > ffpjsd > ffptree** . The last step is optional - by deselecting the "Generate Tree Phylogeny" checkbox, the tool will output a distance matrix rather than a Newick (.nhx) formatted tree file.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
15
d1c88b118a3f Uploaded
damion
parents:
diff changeset
16 Each sequence or text file has a profile containing tallies of each feature found. A feature is a string of valid characters of given length.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
17
d1c88b118a3f Uploaded
damion
parents:
diff changeset
18 For nucleotide data, by default each character (ATGC) is grouped as either purine(R) or pyrmidine(Y) before being counted.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
19
d1c88b118a3f Uploaded
damion
parents:
diff changeset
20 For amino acid data, by default each character is grouped into one of the following: (ST),(DE),(KQR),(IVLM),(FWY),C,G,A,N,H,P. Each group is represented by the first character in its series.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
21
d1c88b118a3f Uploaded
damion
parents:
diff changeset
22 One other key concept is that a given feature, e.g. "TAA" is counted in forward AND reverse directions, mirroring the idea that a feature's orientation is not so important to distinguish when it comes to alignment-free comparison. The counts for "TAA" and "AAT" are merged.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
23
d1c88b118a3f Uploaded
damion
parents:
diff changeset
24 The labeling of the resulting counted feature items is perhaps the trickiest concept to master. Due to computational efficiency measures taken by the developers, a feature that we see on paper as "TAC" may be stored and labeled internally as "GTA", its reverse compliment. One must look for the alternative if one does not find the original.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
25
d1c88b118a3f Uploaded
damion
parents:
diff changeset
26 Also note that in amino acid sequences the stop codon "*" (or any other character that is not in the Amino acid alphabet) causes that character frame not to be counted. Also, character frames never span across fasta entries.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
27
d1c88b118a3f Uploaded
damion
parents:
diff changeset
28 A few tutorials:
d1c88b118a3f Uploaded
damion
parents:
diff changeset
29 * http://sourceforge.net/projects/ffp-phylogeny/files/Documentation/tutorial.pdf
d1c88b118a3f Uploaded
damion
parents:
diff changeset
30 * https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny
d1c88b118a3f Uploaded
damion
parents:
diff changeset
31
d1c88b118a3f Uploaded
damion
parents:
diff changeset
32 -------
d1c88b118a3f Uploaded
damion
parents:
diff changeset
33 **Note**
d1c88b118a3f Uploaded
damion
parents:
diff changeset
34
d1c88b118a3f Uploaded
damion
parents:
diff changeset
35 Taxonomy label details: If each file contains one profile, the file's name is used to label the profile. If each file contains fasta sequences to profile individually, their fasta identifiers will be used to label them. The "short labels" option will find the shortest label that uniquely identifies each profile. Either way, there are some quirks: ffpjsd clips labels to 10 characters if they are greater than 50 characters, so all labels are trimmed to 50 characters first. Also "id" is prefixed to any numeric label since some tree visualizers won't show purely numeric labels. In the accidental case where a Fasta sequence label is a duplicate of a previous one it will be prefixed by "DupLabel-".
d1c88b118a3f Uploaded
damion
parents:
diff changeset
36
d1c88b118a3f Uploaded
damion
parents:
diff changeset
37 The command line ffpjsd can hang if one provides an l-mer length greater than the length of file content. One must identify its process id ("ps aux | grep ffpjsd") and kill it ("kill [process id]").
d1c88b118a3f Uploaded
damion
parents:
diff changeset
38
d1c88b118a3f Uploaded
damion
parents:
diff changeset
39 Finally, it is possible for the ffptree program to generate a tree where some of the branch distances are negative. See https://www.biostars.org/p/45597/
2
671667722d3d fix: ffptree taxonomy name file convert () to _
damion
parents: 1
diff changeset
40
1
d1c88b118a3f Uploaded
damion
parents:
diff changeset
41 -------
d1c88b118a3f Uploaded
damion
parents:
diff changeset
42 **References**
d1c88b118a3f Uploaded
damion
parents:
diff changeset
43
d1c88b118a3f Uploaded
damion
parents:
diff changeset
44 The development of the ffp-phylogeny command line software should be attributed to:
d1c88b118a3f Uploaded
damion
parents:
diff changeset
45
d1c88b118a3f Uploaded
damion
parents:
diff changeset
46 Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 2009;106(8):2677-2682. doi:10.1073/pnas.0813249106.
d1c88b118a3f Uploaded
damion
parents:
diff changeset
47