comparison mothur/tools/mothur/pre.cluster.xml @ 35:95d75b35e4d2

Updated tools to use Mothur 1.33. Added some misc. fixes and updates (blast repository, tool fixes)
author certain cat
date Fri, 31 Oct 2014 15:09:32 -0400
parents 49058b1f8d3f
children
comparison
equal deleted inserted replaced
34:1be61ceb20d7 35:95d75b35e4d2
1 <tool id="mothur_pre_cluster" name="Pre.cluster" version="1.23.0"> 1 <tool id="mothur_pre_cluster" name="Pre.cluster" version="1.24.0">
2 <description>Remove sequences due to pyrosequencing errors</description> 2 <description>Remove sequences due to pyrosequencing errors</description>
3 <command interpreter="python"> 3 <command interpreter="python">
4 mothur_wrapper.py 4 mothur_wrapper.py
5 #import re, os.path 5 #import re, os.path
6 #set results = ["'^mothur.\S+\.logfile$:'" + $logfile.__str__] 6 #set results = ["'^mothur.\S+\.logfile$:'" + $logfile.__str__]
8 #set results = $results + ["'" + $re.sub(r'(^.*)\.(.*?)',r'\1.precluster.\2',$os.path.basename($fasta.__str__)) + ":'" + $fasta_out.__str__] 8 #set results = $results + ["'" + $re.sub(r'(^.*)\.(.*?)',r'\1.precluster.\2',$os.path.basename($fasta.__str__)) + ":'" + $fasta_out.__str__]
9 #set results = $results + ["'" + $re.sub(r'(^.*)\.(.*?)$',r'\1.precluster.names',$os.path.basename($fasta.__str__)) + ":'" + $names_out.__str__] 9 #set results = $results + ["'" + $re.sub(r'(^.*)\.(.*?)$',r'\1.precluster.names',$os.path.basename($fasta.__str__)) + ":'" + $names_out.__str__]
10 #set results = $results + ["'" + $re.sub(r'(^.*)\.(.*?)$',r'\1.precluster.map',$os.path.basename($fasta.__str__)) + ":'" + $map_out.__str__] 10 #set results = $results + ["'" + $re.sub(r'(^.*)\.(.*?)$',r'\1.precluster.map',$os.path.basename($fasta.__str__)) + ":'" + $map_out.__str__]
11 --cmd='pre.cluster' 11 --cmd='pre.cluster'
12 --outputdir='$logfile.extra_files_path' 12 --outputdir='$logfile.extra_files_path'
13 --fasta=$fasta 13 --fasta=$fasta
14 #if $name.__str__ != "None" and len($name.__str__) > 0: 14 #if isinstance($name.datatype, $__app__.datatypes_registry.get_datatype_by_extension('name').__class__):
15 --name=$name 15 --name=$name
16 #end if 16 #else
17 --count=$name
18 #end if
17 #if $group.__str__ != "None" and len($group.__str__) > 0: 19 #if $group.__str__ != "None" and len($group.__str__) > 0:
18 --group=$group 20 --group=$group
19 #end if 21 #end if
20 #if 20 >= int($diffs.__str__) >= 0: 22 #if 20 >= int($diffs.__str__) >= 0:
21 --diffs=$diffs 23 --diffs=$diffs
22 #end if 24 #end if
23 --result=#echo ','.join($results) 25 --result=#echo ','.join($results)
24 --processors=8 26 --processors=8
27 --topdown
25 </command> 28 </command>
26 <inputs> 29 <inputs>
27 <param name="fasta" type="data" format="fasta" label="fasta - Sequence Fasta"/> 30 <param name="fasta" type="data" format="fasta" label="fasta - Sequence Fasta"/>
28 <param name="name" type="data" format="names" optional="true" label="name - Sequences Name reference"/> 31 <param name="name" type="data" format="names,count_table" optional="true" label="name - Sequences Name reference"/>
29 <param name="group" type="data" format="groups" optional="true" label="group - Sequences Name reference"/> 32 <param name="group" type="data" format="groups" optional="true" label="group - Sequences Name reference"/>
30 <param name="diffs" type="integer" value="1" label="diffs - Number of mismatched bases to allow between sequences in a group (default 1)"/> 33 <param name="diffs" type="integer" value="1" label="diffs - Number of mismatched bases to allow between sequences in a group (default 1)"/>
34 <param name="topdown" type="boolean" truevalue="--topdown=true" falsevalue="" checked="false" label="allows you to specify whether to cluster from largest abundance to smallest or vice versa. Default =T, which is largest to smallest"/>
31 </inputs> 35 </inputs>
32 <outputs> 36 <outputs>
33 <data format="html" name="logfile" label="${tool.name} on ${on_string}: logfile" /> 37 <data format="html" name="logfile" label="${tool.name} on ${on_string}: logfile" />
34 <data format_source="fasta" name="fasta_out" label="${tool.name} on ${on_string}: precluster.fasta" /> 38 <data format_source="fasta" name="fasta_out" label="${tool.name} on ${on_string}: precluster.fasta" />
35 <data format="names" name="names_out" label="${tool.name} on ${on_string}: precluster.names" /> 39 <data format="names" name="names_out" label="${tool.name} on ${on_string}: precluster.names" />
36 <data format="tabular" name="map_out" label="${tool.name} on ${on_string}: precluster.map" /> 40 <data format="tabular" name="map_out" label="${tool.name} on ${on_string}: precluster.map" />
37 </outputs> 41 </outputs>
38 <requirements> 42 <requirements>
39 <requirement type="package" version="1.27">mothur</requirement> 43 <requirement type="package" version="1.33">mothur</requirement>
40 </requirements> 44 </requirements>
41 <tests> 45 <tests>
42 </tests> 46 </tests>
43 <help> 47 <help>
44 **Mothur Overview** 48 **Mothur Overview**
53 57
54 The pre.cluster_ command implements a pseudo-single linkage algorithm with the goal of removing sequences that are likely due to pyrosequencing errors. The basic idea is that abundant sequences are more likely to generate erroneous sequences than rare sequences. With that in mind, the algorithm proceeds by ranking sequences in order of their abundance. Then we walk through the list of sequences looking for rarer sequences that are within some threshold of the original sequence. Those that are within the threshold are merged with the larger sequence. The original Huse method performs this task on a distance matrix, whereas we do it based on the original sequences. The advantage of our approach is that the algorithm works on aligned sequences instead of a distance matrix. This is advantageous because by pre-clustering you remove a large number of sequences making the distance calculation much faster. 58 The pre.cluster_ command implements a pseudo-single linkage algorithm with the goal of removing sequences that are likely due to pyrosequencing errors. The basic idea is that abundant sequences are more likely to generate erroneous sequences than rare sequences. With that in mind, the algorithm proceeds by ranking sequences in order of their abundance. Then we walk through the list of sequences looking for rarer sequences that are within some threshold of the original sequence. Those that are within the threshold are merged with the larger sequence. The original Huse method performs this task on a distance matrix, whereas we do it based on the original sequences. The advantage of our approach is that the algorithm works on aligned sequences instead of a distance matrix. This is advantageous because by pre-clustering you remove a large number of sequences making the distance calculation much faster.
55 59
56 .. _pre.cluster: http://www.mothur.org/wiki/Pre.cluster 60 .. _pre.cluster: http://www.mothur.org/wiki/Pre.cluster
57 61
62 v1.24.0: Updated to mothur 1.33, added count and topdown parameter
58 63
59 </help> 64 </help>
60 </tool> 65 </tool>