e.g. to get peptides from ESTsbiopythonget_orfs_or_cdss.py --version
get_orfs_or_cdss.py -i $input_file -f $input_file.ext --table $table -t $ftype -e $ends -m $mode --min_len $min_len -s $strand --on $out_nuc_file --op $out_prot_file --ob $out_bed_file --og $out_gff3_file
**What it does**
Takes an input file of nucleotide sequences (typically FASTA, but also FASTQ
and Standard Flowgram Format (SFF) are supported), and searches each sequence
for open reading frames (ORFs) or potential coding sequences (CDSs) of the
given minimum length. These are returned as FASTA files of nucleotides and
protein sequences.
You can choose to have all the ORFs/CDSs above the minimum length for each
sequence (similar to the EMBOSS getorf tool), those with the longest length
equal, or the first ORF/CDS with the longest length (in the special case
where a sequence encodes two or more long ORFs/CDSs of the same length). The
last option is a reasonable choice when the input sequences represent EST or
mRNA sequences, where only one ORF/CDS is expected.
Note that if no ORFs/CDSs in a sequence match the criteria, there will be no
output for that sequence.
Also note that the ORFs/CDSs are assigned modified identifiers to distinguish
them from the original full length sequences, by appending a suffix.
The start and stop codons are taken from the `NCBI Genetic Codes
<http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi>`_.
When searching for ORFs, the sequences will run from stop codon to stop
codon, and any start codons are ignored. When searching for CDSs, the first
potential start codon will be used, giving the longest possible CDS within
each ORF, and thus the longest possible protein sequence. This is useful
for things like BLAST or domain searching, but since this may not be the
correct start codon, it may not be appropriate for signal peptide detection
etc.
**Example Usage**
Given some EST sequences (Sanger capillary reads) assembled into unigenes,
or a transcriptome assembly from some RNA-Seq, each of your nucleotide
sequences should (barring sequencing, assembly errors, frame-shifts etc)
encode one protein as a single ORF/CDS, which you wish to extract (and
perhaps translate into amino acids).
If your RNA-Seq data was strand specific, and assembled taking this into
account, you should only search for ORFs/CDSs on the forward strand.
**Citation**
If you use this Galaxy tool in work leading to a scientific publication please
cite the following paper:
Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013).
Galaxy tools and workflows for sequence analysis with applications
in molecular plant pathology. PeerJ 1:e167
http://dx.doi.org/10.7717/peerj.167
This tool uses Biopython, so you may also wish to cite the Biopython
application note (and Galaxy too of course):
Cock et al (2009). Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
This tool is available to install into other Galaxy Instances via the Galaxy
Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/get_orfs_or_cdss
10.7717/peerj.16710.1093/bioinformatics/btp163