Mercurial > repos > cpt > cpt_get_orfs
view get_orfs_or_cdss.xml @ 6:f56389d8abda draft default tip
planemo upload commit f33bdf952d796c5d7a240b132af3c4cbd102decc
author | cpt |
---|---|
date | Fri, 05 Jan 2024 05:52:07 +0000 |
parents | eb0ebbfb714a |
children |
line wrap: on
line source
<tool id="get_orfs_or_cdss" name="Get open reading frames (ORFs) or coding sequences (CDSs)" version="19.1.0.0"> <description>e.g. to get peptides from ESTs</description> <macros> <import>macros.xml</import> </macros> <expand macro="requirements"> <requirement type="package" version="3.9.16">python</requirement> <requirement type="package" version="1.2.2">cpt_gffparser</requirement> <requirement type="package" version="1.81">biopython</requirement> <requirement type="package" version="2022.1.18">regex</requirement> </expand> <command interpreter="python" detect_errors="aggressive"><![CDATA[ get_orfs_or_cdss.py '$input_file' -f '$input_file.ext' --table '$table' -t '$ftype' -e "closed" -m "all" --min_len '$min_len' --strand '$strand' --on '$out_nuc_file' --op '$out_prot_file' --ob '$out_bed_file' --og '$out_gff3_file' ]]> </command> <inputs> <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file (nucleotides)" help="FASTA, FASTQ, or SFF format."/> <param name="table" type="select" label="Genetic code" help="Tables from the NCBI, these determine the start and stop codons"> <option value="1">1. Standard</option> <option value="2">2. Vertebrate Mitochondrial</option> <option value="3">3. Yeast Mitochondrial</option> <option value="4">4. Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma</option> <option value="5">5. Invertebrate Mitochondrial</option> <option value="6">6. Ciliate Macronuclear and Dasycladacean</option> <option value="9">9. Echinoderm Mitochondrial</option> <option value="10">10. Euplotid Nuclear</option> <option value="11">11. Bacterial</option> <option value="12">12. Alternative Yeast Nuclear</option> <option value="13">13. Ascidian Mitochondrial</option> <option value="14">14. Flatworm Mitochondrial</option> <option value="15">15. Blepharisma Macronuclear</option> <option value="16">16. Chlorophycean Mitochondrial</option> <option value="21">21. Trematode Mitochondrial</option> <option value="22">22. Scenedesmus obliquus</option> <option value="23">23. Thraustochytrium Mitochondrial</option> <option value="24">24. Pterobranchia Mitochondrial</option> </param> <param name="ftype" type="select" value="True" label="Look for ORFs or CDSs"> <option value="ORF">Look for ORFs (check for stop codons only, ignore start codons)</option> <option value="CDS">Look for CDSs (with start and stop codons)</option> </param> <param name="min_len" type="integer" size="5" value="30" label="Minimum length ORF/CDS (in amino acids, e.g. 30 aa = 90 bp plus any stop codon)"/> <param name="strand" type="select" label="Strand to search" help="Use the forward only option if your sequence directionality is known (e.g. from poly-A tails, or strand specific RNA sequencing)."> <option value="both">Search both the forward and reverse strand</option> <option value="forward">Only search the forward strand</option> <option value="reverse">Only search the reverse strand</option> </param> </inputs> <outputs> <data name="out_nuc_file" format="fasta" label="${ftype.value}s (nucleotides)"/> <data name="out_prot_file" format="fasta" label="${ftype.value}s (amino acids)"/> <data name="out_bed_file" format="bed6" label="${ftype.value}s (bed)"/> <data name="out_gff3_file" format="gff3" label="${ftype.value}s (gff3)"/> </outputs> <tests> <test> <param name="input_file" value="Orf_T7In.fasta"/> <param name="table" value="11"/> <param name="ftype" value="ORF"/> <param name="min_len" value="30"/> <param name="strand" value="both"/> <output name="out_nuc_file" file="Orf_T7Out_Nuc.fasta"/> <output name="out_prot_file" file="Orf_T7Out_AA.fasta"/> <output name="out_bed_file" file="Orf_T7Out_Bed.bed"/> <output name="out_gff3_file" file="Orf_T7Out_Gff.gff3"/> </test> <test> <param name="input_file" value="Orf_In2.fasta"/> <param name="table" value="1"/> <param name="ftype" value="CDS"/> <param name="min_len" value="10"/> <param name="strand" value="forward"/> <output name="out_nuc_file" file="Orf_Out2T1_Nuc.fasta"/> <output name="out_prot_file" file="Orf_Out2T1_AA.fasta"/> <output name="out_bed_file" file="Orf_Out2T1_Bed.bed"/> <output name="out_gff3_file" file="Orf_Out2T1_Gff.gff3"/> </test> <test> <param name="input_file" value="Orf_In2.fasta"/> <param name="table" value="11"/> <param name="ftype" value="CDS"/> <param name="min_len" value="10"/> <param name="strand" value="forward"/> <output name="out_nuc_file" file="Orf_Out2T11_Nuc.fasta"/> <output name="out_prot_file" file="Orf_Out2T11_AA.fasta"/> <output name="out_bed_file" file="Orf_Out2T11_Bed.bed"/> <output name="out_gff3_file" file="Orf_Out2T11_Gff.gff3"/> </test> </tests> <help> **What it does** Takes an input file of nucleotide sequences (typically FASTA, but also FASTQ and Standard Flowgram Format (SFF) are supported), and searches each sequence for open reading frames (ORFs) or potential coding sequences (CDSs) of the given minimum length. These are returned as FASTA files of nucleotides and protein sequences. You can choose to have all the ORFs/CDSs above the minimum length for each sequence (similar to the EMBOSS getorf tool), those with the longest length equal, or the first ORF/CDS with the longest length (in the special case where a sequence encodes two or more long ORFs/CDSs of the same length). The last option is a reasonable choice when the input sequences represent EST or mRNA sequences, where only one ORF/CDS is expected. Note that if no ORFs/CDSs in a sequence match the criteria, there will be no output for that sequence. Also note that the ORFs/CDSs are assigned modified identifiers to distinguish them from the original full length sequences, by appending a suffix. The start and stop codons are taken from the `NCBI Genetic Codes <http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi>`_. When searching for ORFs, the sequences will run from stop codon to stop codon, and any start codons are ignored. When searching for CDSs, the first potential start codon will be used, giving the longest possible CDS within each ORF, and thus the longest possible protein sequence. This is useful for things like BLAST or domain searching, but since this may not be the correct start codon, it may not be appropriate for signal peptide detection etc. **Example Usage** Given some EST sequences (Sanger capillary reads) assembled into unigenes, or a transcriptome assembly from some RNA-Seq, each of your nucleotide sequences should (barring sequencing, assembly errors, frame-shifts etc) encode one protein as a single ORF/CDS, which you wish to extract (and perhaps translate into amino acids). If your RNA-Seq data was strand specific, and assembled taking this into account, you should only search for ORFs/CDSs on the forward strand. **Citation** If you use this Galaxy tool in work leading to a scientific publication please cite the following paper: Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. PeerJ 1:e167 http://dx.doi.org/10.7717/peerj.167 This tool uses Biopython, so you may also wish to cite the Biopython application note (and Galaxy too of course): Cock et al (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878. This tool is available to install into other Galaxy Instances via the Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/get_orfs_or_cdss </help> <citations> <citation type="doi">10.7717/peerj.167</citation> <citation type="doi">10.1093/bioinformatics/btp163</citation> </citations> </tool>