# HG changeset patch # User bgruening # Date 1377240855 14400 # Node ID 9d5515db592046994f3bd3e7926ba5cf5ffa361d # Parent 2405efd751a01b492ee65ef6b90309ec5f8ec8df Uploaded diff -r 2405efd751a0 -r 9d5515db5920 glimmer3_with_trainingset.ga --- a/glimmer3_with_trainingset.ga Mon Aug 12 12:56:19 2013 -0400 +++ b/glimmer3_with_trainingset.ga Fri Aug 23 02:54:15 2013 -0400 @@ -1,16 +1,16 @@ { "a_galaxy_workflow": "true", - "annotation": "", + "annotation": "Runs Glimmer3 to predict genes in microbial DNA using a training set of known genes.
\n
\nThis workflow is available on the Galaxy Tool Shed with a README file giving more information including sample data, and full citation details (Cock et al. 2013).", "format-version": "0.1", "name": "Glimmer gene calling with trainingset.", "steps": { "0": { - "annotation": "The set of genes is used to train the gene calling of glimmer.", + "annotation": "Nucleotide FASTA file of gene sequences used to train the gene calling of glimmer.", "id": 0, "input_connections": {}, "inputs": [ { - "description": "The set of genes is used to train the gene calling of glimmer.", + "description": "Nucleotide FASTA file of gene sequences used to train the gene calling of glimmer.", "name": "Set of Genes" } ], @@ -28,12 +28,12 @@ "user_outputs": [] }, "1": { - "annotation": "", + "annotation": "Nucleotide FASTA file of genomic DNA (e.g. assembled contigs) to find genes in.", "id": 1, "input_connections": {}, "inputs": [ { - "description": "", + "description": "Nucleotide FASTA file of genomic DNA (e.g. assembled contigs) to find genes in.", "name": "Input Genomic Sequence" } ], diff -r 2405efd751a0 -r 9d5515db5920 readme.rst --- a/readme.rst Mon Aug 12 12:56:19 2013 -0400 +++ b/readme.rst Fri Aug 23 02:54:15 2013 -0400 @@ -1,28 +1,108 @@ -============================== -Glimmer3 gene calling workflow -============================== +This package is a Galaxy workflow for gene prediction using Glimmer3. + +It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of +genes to generate gene predictions on a new genome, and then calls EMBOSS +(Rice et al. 2000) to translate the predictions into a FASTA file of +predicted protein sequences. The workflow requires two input files: + +* Nucleotide FASTA file of know gene sequences (training set) +* Nucleotide FASTA file of genome sequence or assembled contigs + +First an interpolated context model (ICM) is built from the set of known +genes, preferably from the closest relative organism(s) available. Next this +ICM model is used to predict genes on the genomic FASTA file. This produces +a FASTA file of the predicted gene nucleotide sequences, which is translated +into protein sequences using the EMBOSS tool transeq. + +Glimmer is intended for finding genes in microbial DNA, especially bacteria, +archaea, and viruses. + +See http://www.galaxyproject.org for information about the Galaxy Project. -This Tool Shed Repository contains a workflow for the gene prediction of from a given nucleotide FASTA file. + +Sample Data +=========== + +As an example, we will use the first public assembly of the 2011 Shiga-toxin +producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the +open-source crowd-sourcing analysis described in Rohde et al. (2011) and here: +https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki -At first an interpolated context model (ICM) is build from a know set of genes, preferable from the closest relative available organism(s). In a following step this ICM model is used to predict genes on the second input. The output is a FASTA file with nucleotide sequences that is further converted to proteins sequences. +You can upload this assembly directly into Galaxy using the "Upload File" tool +with either of these URLs - Galaxy should recognise this is a FASTA file with +3,057 sequences: -To run that worflow glimmer_ und the EMBOSS_ suite is required. Both can be installed from the Tool Shed. +* http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt +* https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt + +This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled +by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the +MIRA 3.2 assembler. It was initially released via his blog, +http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/ -.. _glimmer: http://www.cbcb.umd.edu/software/glimmer/ -.. _EMBOSS: http://emboss.sourceforge.net/ +We will also need a training set of known *E. coli* genes, for example the +model strain *Escherichia coli* str. K-12 substr. MG1655 which is well +annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the +gene nucleotide sequences directly into Galaxy via this URL, which Galaxy +should recognise as a FASTA file with 4,321 sequences: + +* ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn + +Then run the workflow, which should produce 2,333 predicted genes for the +TY2482 assembly (two FASTA files, nucleotide and protein sequences). + -| A. L. Delcher, K.A. Bratke, E.C. Powers, and S.L. Salzberg. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Advance online version) (2007). +Citation +======== + +If you use this workflow directly, or a derivative of it, or the associated +Glimmer wrappers for Galaxy, in work leading to a scientific publication, +please cite: + +Cock, P.J.A., GrĂ¼ning, B., Paszkiewicz, K. and Pritchard, L. (2013) +Galaxy tools and workflows for sequence analysis with applications in +molecular plant pathology. (Submitted). + +For Glimmer3 please cite: -EMBOSS: The European Molecular Biology Open Software Suite (2000) -Rice,P. Longden,I. and Bleasby,A. -Trends in Genetics 16, (6) pp276--277 +Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007) +Identifying bacterial genes and endosymbiont DNA with Glimmer. +Bioinformatics 23(6), 673-679. +http://dx.doi.org/10.1093/bioinformatics/btm009 + +For EMBOSS please cite: + +Rice, P., Longden, I. and Bleasby, A. (2000) +EMBOSS: The European Molecular Biology Open Software Suite +Trends in Genetics 16(6), 276-277. +http://dx.doi.org/10.1016/S0168-9525(00)02024-2 -************ + +Additional References +===================== + +Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011) +Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4. +New England Journal of Medicine 365, 718-724. +http://dx.doi.org/10.1056/NEJMoa1107643 + + Availability -************ +============ This workflow is available on the main Galaxy Tool Shed: + http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow Development is being done on github: + https://github.com/bgruening/galaxytools/workflows/glimmer3/ + + +Dependencies +============ + +These dependencies should be resolved automatically via the Galaxy Tool Shed: + +* http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3 +* http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5