comparison readme.rst @ 6:9d5515db5920 draft default tip

Uploaded
author bgruening
date Fri, 23 Aug 2013 02:54:15 -0400
parents ad01b12e0a0c
children
comparison
equal deleted inserted replaced
5:2405efd751a0 6:9d5515db5920
1 ============================== 1 This package is a Galaxy workflow for gene prediction using Glimmer3.
2 Glimmer3 gene calling workflow
3 ==============================
4 2
5 This Tool Shed Repository contains a workflow for the gene prediction of from a given nucleotide FASTA file. 3 It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of
4 genes to generate gene predictions on a new genome, and then calls EMBOSS
5 (Rice et al. 2000) to translate the predictions into a FASTA file of
6 predicted protein sequences. The workflow requires two input files:
6 7
7 At first an interpolated context model (ICM) is build from a know set of genes, preferable from the closest relative available organism(s). In a following step this ICM model is used to predict genes on the second input. The output is a FASTA file with nucleotide sequences that is further converted to proteins sequences. 8 * Nucleotide FASTA file of know gene sequences (training set)
9 * Nucleotide FASTA file of genome sequence or assembled contigs
8 10
9 To run that worflow glimmer_ und the EMBOSS_ suite is required. Both can be installed from the Tool Shed. 11 First an interpolated context model (ICM) is built from the set of known
12 genes, preferably from the closest relative organism(s) available. Next this
13 ICM model is used to predict genes on the genomic FASTA file. This produces
14 a FASTA file of the predicted gene nucleotide sequences, which is translated
15 into protein sequences using the EMBOSS tool transeq.
10 16
11 .. _glimmer: http://www.cbcb.umd.edu/software/glimmer/ 17 Glimmer is intended for finding genes in microbial DNA, especially bacteria,
12 .. _EMBOSS: http://emboss.sourceforge.net/ 18 archaea, and viruses.
13 19
14 | A. L. Delcher, K.A. Bratke, E.C. Powers, and S.L. Salzberg. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Advance online version) (2007). 20 See http://www.galaxyproject.org for information about the Galaxy Project.
15 21
16 EMBOSS: The European Molecular Biology Open Software Suite (2000)
17 Rice,P. Longden,I. and Bleasby,A.
18 Trends in Genetics 16, (6) pp276--277
19 22
20 ************ 23 Sample Data
24 ===========
25
26 As an example, we will use the first public assembly of the 2011 Shiga-toxin
27 producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the
28 open-source crowd-sourcing analysis described in Rohde et al. (2011) and here:
29 https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki
30
31 You can upload this assembly directly into Galaxy using the "Upload File" tool
32 with either of these URLs - Galaxy should recognise this is a FASTA file with
33 3,057 sequences:
34
35 * http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt
36 * https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt
37
38 This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled
39 by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the
40 MIRA 3.2 assembler. It was initially released via his blog,
41 http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/
42
43 We will also need a training set of known *E. coli* genes, for example the
44 model strain *Escherichia coli* str. K-12 substr. MG1655 which is well
45 annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the
46 gene nucleotide sequences directly into Galaxy via this URL, which Galaxy
47 should recognise as a FASTA file with 4,321 sequences:
48
49 * ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn
50
51 Then run the workflow, which should produce 2,333 predicted genes for the
52 TY2482 assembly (two FASTA files, nucleotide and protein sequences).
53
54
55 Citation
56 ========
57
58 If you use this workflow directly, or a derivative of it, or the associated
59 Glimmer wrappers for Galaxy, in work leading to a scientific publication,
60 please cite:
61
62 Cock, P.J.A., GrĂ¼ning, B., Paszkiewicz, K. and Pritchard, L. (2013)
63 Galaxy tools and workflows for sequence analysis with applications in
64 molecular plant pathology. (Submitted).
65
66 For Glimmer3 please cite:
67
68 Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007)
69 Identifying bacterial genes and endosymbiont DNA with Glimmer.
70 Bioinformatics 23(6), 673-679.
71 http://dx.doi.org/10.1093/bioinformatics/btm009
72
73 For EMBOSS please cite:
74
75 Rice, P., Longden, I. and Bleasby, A. (2000)
76 EMBOSS: The European Molecular Biology Open Software Suite
77 Trends in Genetics 16(6), 276-277.
78 http://dx.doi.org/10.1016/S0168-9525(00)02024-2
79
80
81 Additional References
82 =====================
83
84 Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011)
85 Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4.
86 New England Journal of Medicine 365, 718-724.
87 http://dx.doi.org/10.1056/NEJMoa1107643
88
89
21 Availability 90 Availability
22 ************ 91 ============
23 92
24 This workflow is available on the main Galaxy Tool Shed: 93 This workflow is available on the main Galaxy Tool Shed:
94
25 http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow 95 http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow
26 96
27 Development is being done on github: 97 Development is being done on github:
98
28 https://github.com/bgruening/galaxytools/workflows/glimmer3/ 99 https://github.com/bgruening/galaxytools/workflows/glimmer3/
100
101
102 Dependencies
103 ============
104
105 These dependencies should be resolved automatically via the Galaxy Tool Shed:
106
107 * http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3
108 * http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5