comparison README.rst @ 0:68d65aeb3567 draft

Uploaded v0.0.1
author peterjc
date Mon, 30 Mar 2015 11:25:10 -0400
parents
children 165f0b05fa25
comparison
equal deleted inserted replaced
-1:000000000000 0:68d65aeb3567
1 Introduction
2 ============
3
4 Galaxy is a web-based platform for biological data analysis, supporting
5 extension with additional tools (often wrappers for existing command line
6 tools) and datatypes. See http://www.galaxyproject.org/ and the public
7 server at http://usegalaxy.org for an example.
8
9 The NCBI BLAST suite is a widely used set of tools for biological sequence
10 comparison. It is available as standalone binaries for use at the command
11 line, and via the NCBI website for smaller searches. For more details see
12 http://blast.ncbi.nlm.nih.gov/Blast.cgi
13
14 This is an example workflow using the Galaxy wrappers for NCBI BLAST+,
15 see https://github.com/peterjc/galaxy_blast
16
17
18 Galaxy workflow for counting species of top BLAST hits
19 ======================================================
20
21 This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an
22 initial assessment of a transcriptome assembly to give a crude indication of
23 any major contaimination present based on the species of the top BLAST hit
24 of 1000 representative sequences.
25
26 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png
27
28 In words, the workflow proceeds as follows:
29
30 1. Upload/import your transcriptome assembly or any nucleotide FASTA file.
31 2. Samples 1000 representative sequences, selected uniformly/evenly though
32 the file.
33 3. Convert the sampled FASTA file into a three column tabular file.
34 4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr``
35 database (assuming this is already available setup on your local Galaxy
36 under the alias ``nr``), requesting tabular output including the taxonomy
37 fields, and at most one matching target sequence.
38 5. Remove any duplicate alignments (multiple HSPs for the same match).
39 6. Combine the filtered BLAST output with the tabular version of the 1000
40 sequences to give a new tabular file with exactly 1000 lines, adding
41 ``None`` for sequences missing a BLAST hit.
42 7. Count the BLAST species names in this file.
43 8. Sort the counts.
44
45 Finally we would suggest visualising the sorted tally table as a Pie Chart.
46
47
48 Sample Data
49 ===========
50
51 As an example, you can upload the transcriptome assembly of the nematode
52 *Nacobbus abberans* from Eves van den Akker *et al.* (2015),
53 http://dx.doi.org/10.1093/gbe/evu171 using this URL:
54
55 http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip
56
57 Running this workflow with a copy of the NCBI non-redundant ``nr`` database
58 from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave
59 the following results - note 609 out of the 1000 sequences gave no BLAST hit.
60
61 ===== ==================
62 Count Subject Blast Name
63 ----- ------------------
64 609 None
65 244 nematodes
66 30 ascomycetes
67 27 eukaryotes
68 8 basidiomycetes
69 6 aphids
70 5 eudicots
71 5 flies
72 ... ...
73 ===== ==================
74
75 As you might guess from the filename ``N.abberans_reference_no_contam.fasta``,
76 this transcriptome assembly has already had obvious contamination removed.
77
78 At the time of writing, Galaxy's visualizations could not be included in
79 a workflow. You can generate a pie chart from the final count file using
80 the counts (c1) and labels (c2), like this:
81
82 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png
83
84 Note the nematode count in this image was shown as a mouse-over effect.
85
86
87 Disclaimer
88 ==========
89
90 Species assignment by top BLAST hit is not suitable for any in depth
91 analysis. It is particularly prone to false positives where contaiminants
92 in public datasets are mislabled. See for example Ed Yong (2015),
93 "There's No Plague on the NYC Subway. No Platypuses Either.":
94
95 http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/
96
97
98 Known Issues
99 ============
100
101 This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with
102 the current stable release (Galaxy v15.03, i.e. March 2015).
103
104 The updated "Count" tool version 1.0.1 includes a fix not to remove spaces
105 in the fields being counted. In the example above, while the top hits are
106 not affected, minor entries like "cellular slime molds" are shown as
107 "cellularslimemolds" instead (look closely at the Pie Chart key)..
108
109 The updated "Count" tool version 1.0.1 also adds a new option to sort the
110 output, which avoids the additional sorting step in the current version of
111 the workflow.
112
113 A future update to this workflow will use the revised "Count" tool, once
114 this is included in the next stable Galaxy release - or migrated to the
115 Galaxy Tool Shed.
116
117
118 Availability
119 ============
120
121 This workflow is available to download and/or install from the main Galaxy Tool Shed:
122
123 http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
124
125 Test releases (which should not normally be used) are on the Test Tool Shed:
126
127 http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
128
129 Development is being done on github here:
130
131 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
132
133
134 Citation
135 ========
136
137 Please cite the following paper (currently available as a preprint):
138
139 NCBI BLAST+ integrated into Galaxy.
140 P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo
141 bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint)
142
143 You should also cite Galaxy, and the NCBI BLAST+ tools:
144
145 BLAST+: architecture and applications.
146 C. Camacho et al. BMC Bioinformatics 2009, 10:421.
147 DOI: http://dx.doi.org/10.1186/1471-2105-10-421
148
149
150 Automated Installation
151 ======================
152
153 Installation via the Galaxy Tool Shed should take care of the dependencies
154 on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries.
155
156 However, this workflow requires a current version of the NCBI nr protein
157 BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower
158 case).
159
160
161 History
162 =======
163
164 ======= ======================================================================
165 Version Changes
166 ------- ----------------------------------------------------------------------
167 v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29
168 ======= ======================================================================
169
170
171 Developers
172 ==========
173
174 This workflow is under source code control here:
175
176 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
177
178 To prepare the tar-ball for uploading to the Tool Shed, I use this:
179
180 $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png
181
182 Check this,
183
184 $ tar -tzf blast_top_hit_species.tar.gz
185 README.rst
186 repository_dependencies.xml
187 blast_top_hit_species.ga
188 blast_top_hit_species.png
189 N_abberans_piechart_mouseover.png
190
191
192 Licence (MIT)
193 =============
194
195 Permission is hereby granted, free of charge, to any person obtaining a copy
196 of this software and associated documentation files (the "Software"), to deal
197 in the Software without restriction, including without limitation the rights
198 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
199 copies of the Software, and to permit persons to whom the Software is
200 furnished to do so, subject to the following conditions:
201
202 The above copyright notice and this permission notice shall be included in
203 all copies or substantial portions of the Software.
204
205 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
206 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
207 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
208 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
209 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
210 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
211 THE SOFTWARE.