comparison tools/get_orfs_or_cdss/get_orfs_or_cdss.xml @ 5:5208c15805ec draft

Uploaded v0.0.5 dependant on Biopython 1.62
author peterjc
date Mon, 28 Oct 2013 05:19:38 -0400
parents
children 705a2e2df7fb
comparison
equal deleted inserted replaced
4:d51819d2d7e2 5:5208c15805ec
1 <tool id="get_orfs_or_cdss" name="Get open reading frames (ORFs) or coding sequences (CDSs)" version="0.0.5">
2 <description>e.g. to get peptides from ESTs</description>
3 <requirements>
4 <requirement type="package" version="1.62">biopython</requirement>
5 <requirement type="python-module">Bio</requirement>
6 </requirements>
7 <version_command interpreter="python">get_orfs_or_cdss.py --version</version_command>
8 <command interpreter="python">
9 get_orfs_or_cdss.py $input_file $input_file.ext $table $ftype $ends $mode $min_len $strand $out_nuc_file $out_prot_file
10 </command>
11 <stdio>
12 <!-- Anything other than zero is an error -->
13 <exit_code range="1:" />
14 <exit_code range=":-1" />
15 </stdio>
16 <inputs>
17 <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file (nucleotides)" help="FASTA, FASTQ, or SFF format." />
18 <param name="table" type="select" label="Genetic code" help="Tables from the NCBI, these determine the start and stop codons">
19 <option value="1">1. Standard</option>
20 <option value="2">2. Vertebrate Mitochondrial</option>
21 <option value="3">3. Yeast Mitochondrial</option>
22 <option value="4">4. Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma</option>
23 <option value="5">5. Invertebrate Mitochondrial</option>
24 <option value="6">6. Ciliate Macronuclear and Dasycladacean</option>
25 <option value="9">9. Echinoderm Mitochondrial</option>
26 <option value="10">10. Euplotid Nuclear</option>
27 <option value="11">11. Bacterial</option>
28 <option value="12">12. Alternative Yeast Nuclear</option>
29 <option value="13">13. Ascidian Mitochondrial</option>
30 <option value="14">14. Flatworm Mitochondrial</option>
31 <option value="15">15. Blepharisma Macronuclear</option>
32 <option value="16">16. Chlorophycean Mitochondrial</option>
33 <option value="21">21. Trematode Mitochondrial</option>
34 <option value="22">22. Scenedesmus obliquus</option>
35 <option value="23">23. Thraustochytrium Mitochondrial</option>
36 </param>
37 <param name="ftype" type="select" value="True" label="Look for ORFs or CDSs">
38 <option value="ORF">Look for ORFs (check for stop codons only, ignore start codons)</option>
39 <option value="CDS">Look for CDSs (with start and stop codons)</option>
40 </param>
41 <param name="ends" type="select" value="open" label="Sequence end treatment">
42 <option value="open">Open ended (will allow missing start/stop codons at the ends)</option>
43 <option value="closed">Complete (will check for start/stop codons at the ends)</option>
44 <!-- TODO? Circular, for using this on finished bacteria etc -->
45 </param>
46 <param name="mode" type="select" label="Selection criteria" help="Suppose a sequence has ORFs/CDSs of lengths 100, 102 and 102 -- which should be taken? These options would return 3, 2 or 1 ORF.">
47 <option value="all">All ORFs/CDSs from each sequence</option>
48 <option value="top">All ORFs/CDSs from each sequence with the maximum length</option>
49 <option value="one">First ORF/CDS from each sequence with the maximum length</option>
50 </param>
51 <param name="min_len" type="integer" size="5" value="30" label="Minimum length ORF/CDS (in amino acids, e.g. 30 aa = 90 bp plus any stop codon)" />
52 <param name="strand" type="select" label="Strand to search" help="Use the forward only option if your sequence directionality is known (e.g. from poly-A tails, or strand specific RNA sequencing.">
53 <option value="both">Search both the forward and reverse strand</option>
54 <option value="forward">Only search the forward strand</option>
55 <option value="reverse">Only search the reverse strand</option>
56 </param>
57 </inputs>
58 <outputs>
59 <data name="out_nuc_file" format="fasta" label="${ftype.value}s (nucleotides)" />
60 <data name="out_prot_file" format="fasta" label="${ftype.value}s (amino acids)" />
61 </outputs>
62 <tests>
63 <test>
64 <param name="input_file" value="get_orf_input.fasta" />
65 <param name="table" value="1" />
66 <param name="ftype" value="CDS" />
67 <param name="ends" value="open" />
68 <param name="mode" value="all" />
69 <param name="min_len" value="10" />
70 <param name="strand" value="forward" />
71 <output name="out_nuc_file" file="get_orf_input.t1_nuc_out.fasta" />
72 <output name="out_prot_file" file="get_orf_input.t1_prot_out.fasta" />
73 </test>
74 <test>
75 <param name="input_file" value="get_orf_input.fasta" />
76 <param name="table" value="11" />
77 <param name="ftype" value="CDS" />
78 <param name="ends" value="closed" />
79 <param name="mode" value="all" />
80 <param name="min_len" value="10" />
81 <param name="strand" value="forward" />
82 <output name="out_nuc_file" file="get_orf_input.t11_nuc_out.fasta" />
83 <output name="out_prot_file" file="get_orf_input.t11_prot_out.fasta" />
84 </test>
85 <test>
86 <param name="input_file" value="get_orf_input.fasta" />
87 <param name="table" value="11" />
88 <param name="ftype" value="CDS" />
89 <param name="ends" value="open" />
90 <param name="mode" value="all" />
91 <param name="min_len" value="10" />
92 <param name="strand" value="forward" />
93 <output name="out_nuc_file" file="get_orf_input.t11_open_nuc_out.fasta" />
94 <output name="out_prot_file" file="get_orf_input.t11_open_prot_out.fasta" />
95 </test>
96 <test>
97 <param name="input_file" value="Ssuis.fasta" />
98 <param name="table" value="11" />
99 <param name="ftype" value="ORF" />
100 <param name="ends" value="open" />
101 <param name="mode" value="all" />
102 <param name="min_len" value="100" />
103 <param name="strand" value="both" />
104 <output name="out_nuc_file" file="get_orf_input.Suis_ORF.nuc.fasta" />
105 <output name="out_prot_file" file="get_orf_input.Suis_ORF.prot.fasta" />
106 </test>
107 </tests>
108 <help>
109 **What it does**
110
111 Takes an input file of nucleotide sequences (typically FASTA, but also FASTQ
112 and Standard Flowgram Format (SFF) are supported), and searches each sequence
113 for open reading frames (ORFs) or potential coding sequences (CDSs) of the
114 given minimum length. These are returned as FASTA files of nucleotides and
115 protein sequences.
116
117 You can choose to have all the ORFs/CDSs above the minimum length for each
118 sequence (similar to the EMBOSS getorf tool), those with the longest length
119 equal, or the first ORF/CDS with the longest length (in the special case
120 where a sequence encodes two or more long ORFs/CDSs of the same length). The
121 last option is a reasonable choice when the input sequences represent EST or
122 mRNA sequences, where only one ORF/CDS is expected.
123
124 Note that if no ORFs/CDSs in a sequence match the criteria, there will be no
125 output for that sequence.
126
127 Also note that the ORFs/CDSs are assigned modified identifiers to distinguish
128 them from the original full length sequences, by appending a suffix.
129
130 The start and stop codons are taken from the `NCBI Genetic Codes
131 &lt;http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi&gt;`_.
132 When searching for ORFs, the sequences will run from stop codon to stop
133 codon, and any start codons are ignored. When searching for CDSs, the first
134 potential start codon will be used, giving the longest possible CDS within
135 each ORF, and thus the longest possible protein sequence. This is useful
136 for things like BLAST or domain searching, but since this may not be the
137 correct start codon may not be appropriate for signal peptide detection
138 etc.
139
140 **Example Usage**
141
142 Given some EST sequences (Sanger capillary reads) assembled into unigenes,
143 or a transcriptome assembly from some RNA-Seq, each of your nucleotide
144 sequences should (barring sequencing, assembly errors, frame-shifts etc)
145 encode one protein as a single ORF/CDS, which you wish to extract (and
146 perhaps translate into amino acids).
147
148 If your RNS-Seq data was strand specific, and assembled taking this into
149 account, you should only search for ORFs/CDSs on the forward strand.
150
151 **Citation**
152
153 If you use this Galaxy tool in work leading to a scientific publication please
154 cite the following paper:
155
156 Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013).
157 Galaxy tools and workflows for sequence analysis with applications
158 in molecular plant pathology. PeerJ 1:e167
159 http://dx.doi.org/10.7717/peerj.167
160
161 This tool uses Biopython, so you may also wish to cite the Biopython
162 application note (and Galaxy too of course):
163
164 Cock et al (2009). Biopython: freely available Python tools for computational
165 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
166 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
167
168 This tool is available to install into other Galaxy Instances via the Galaxy
169 Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/get_orfs_or_cdss
170 </help>
171 </tool>