annotate rapidcluster.xml @ 0:12f2dd9ac1fd draft

Uploaded
author hathkul
date Mon, 26 Dec 2016 11:04:51 -0500
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
1 <tool id="rapidcluster_2" name="RapidCluster" version="2">
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
2
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
3 <description>Cluster closely-related sequences using Levenshtein edit distance filtering.</description>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
4
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
5 <version_command>rapidcluster -v</version_command>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
6
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
7 <command interpreter="perl">rapidcluster -i $input -o $output -d $distance -f $filter -c $max_clusters > $report
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
8 </command>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
9
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
10 <inputs>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
11 <param name="input" type="data" format="fasta" label="Input file" help="Must use FASTA output from FASTAptamer-Count"></param>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
12 <param name="distance" type="integer" label="Levenshtein Edit Distance" value="1" help="Minimum number of insertions, deletions, or substitutions required to transfer a sequence into another"></param>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
13 <param name="filter" type="integer" label="Read Filter" optional="true" value="1" help="Only sequences with total reads greater than the value supplied will be clustered."></param>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
14 <param name="max_clusters" type="integer" label="Maximum number of clusters to find" optional="true" value="500" help="Script will stop after finding this much clusters"></param>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
15 </inputs>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
16
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
17 <outputs>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
18 <data name="output" format="fasta" label="$input RapidCluster output"></data>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
19 <data name="report" format="txt" label="$input RapidCluster Report"></data>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
20 </outputs>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
21
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
22 <help>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
23
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
24 .. class:: warningmark
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
25
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
26 RapidCluster requires a FASTA formatted input file generated by FASTAptamer-Count.
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
27
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
28 .. class:: warningmark
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
29
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
30 RapidCluster uses an exhaustive approach to clustering and can take *several* hours to process. For faster processing utilize the "Read Filter" option to exclude low read sequences and define a reasonable number of clusters to find.
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
31
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
32 ------
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
33
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
34 This version does not calculate exact Levenshtein distance for each pair of sequences, instead it simply checks if this distance is lower or greater than user-defined value. This makes script much faster for clustering highly-similar sequences.
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
35
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
36 RapidCluster begins with the most abundant sequence in a population, referred to as the "seed sequence," and clusters with it every sequence in the file within an edit distance less than or equal to the specified edit distance (Cluster #1). The next most abundant unclustered sequence then serves as the next seed sequence for assembling the second cluster from the remaining sequences (Cluster #2), followed by the next most abundant unclustered sequence (Cluster #3), and so on. This process is iterated until every sequence is clustered.
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
37
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
38 Output is FASTA formatted with the following information on the FASTA identifier line:
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
39
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
40 >Rank-Reads-RPM-Cluster#-RankWithinCluster-EditDistanceFromSeedSequence
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
41
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
42 .. class:: infomark
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
43
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
44 The "Read Filter" excludes from the clustering process sequences with a total number of reads less than or equal to the integer supplied. Because of the computational complexity of clustering large datasets, the default filter setting of 1 is designed to eliminate singleton sequences from clustering.
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
45
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
46 ------
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
47
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
48
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
49 </help>
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
50
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
51
12f2dd9ac1fd Uploaded
hathkul
parents:
diff changeset
52 </tool>