0
|
1 Introduction
|
|
2 ============
|
|
3
|
|
4 Galaxy is a web-based platform for biological data analysis, supporting
|
|
5 extension with additional tools (often wrappers for existing command line
|
|
6 tools) and datatypes. See http://www.galaxyproject.org/ and the public
|
|
7 server at http://usegalaxy.org for an example.
|
|
8
|
|
9 The NCBI BLAST suite is a widely used set of tools for biological sequence
|
|
10 comparison. It is available as standalone binaries for use at the command
|
|
11 line, and via the NCBI website for smaller searches. For more details see
|
|
12 http://blast.ncbi.nlm.nih.gov/Blast.cgi
|
|
13
|
|
14 This is an example workflow using the Galaxy wrappers for NCBI BLAST+,
|
|
15 see https://github.com/peterjc/galaxy_blast
|
|
16
|
|
17
|
|
18 Galaxy workflow for counting species of top BLAST hits
|
|
19 ======================================================
|
|
20
|
|
21 This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an
|
|
22 initial assessment of a transcriptome assembly to give a crude indication of
|
|
23 any major contaimination present based on the species of the top BLAST hit
|
|
24 of 1000 representative sequences.
|
|
25
|
|
26 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png
|
|
27
|
|
28 In words, the workflow proceeds as follows:
|
|
29
|
|
30 1. Upload/import your transcriptome assembly or any nucleotide FASTA file.
|
|
31 2. Samples 1000 representative sequences, selected uniformly/evenly though
|
|
32 the file.
|
|
33 3. Convert the sampled FASTA file into a three column tabular file.
|
|
34 4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr``
|
|
35 database (assuming this is already available setup on your local Galaxy
|
|
36 under the alias ``nr``), requesting tabular output including the taxonomy
|
|
37 fields, and at most one matching target sequence.
|
|
38 5. Remove any duplicate alignments (multiple HSPs for the same match).
|
|
39 6. Combine the filtered BLAST output with the tabular version of the 1000
|
|
40 sequences to give a new tabular file with exactly 1000 lines, adding
|
|
41 ``None`` for sequences missing a BLAST hit.
|
|
42 7. Count the BLAST species names in this file.
|
|
43 8. Sort the counts.
|
|
44
|
|
45 Finally we would suggest visualising the sorted tally table as a Pie Chart.
|
|
46
|
|
47
|
|
48 Sample Data
|
|
49 ===========
|
|
50
|
|
51 As an example, you can upload the transcriptome assembly of the nematode
|
|
52 *Nacobbus abberans* from Eves van den Akker *et al.* (2015),
|
|
53 http://dx.doi.org/10.1093/gbe/evu171 using this URL:
|
|
54
|
|
55 http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip
|
|
56
|
|
57 Running this workflow with a copy of the NCBI non-redundant ``nr`` database
|
|
58 from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave
|
|
59 the following results - note 609 out of the 1000 sequences gave no BLAST hit.
|
|
60
|
|
61 ===== ==================
|
|
62 Count Subject Blast Name
|
|
63 ----- ------------------
|
|
64 609 None
|
|
65 244 nematodes
|
|
66 30 ascomycetes
|
|
67 27 eukaryotes
|
|
68 8 basidiomycetes
|
|
69 6 aphids
|
|
70 5 eudicots
|
|
71 5 flies
|
|
72 ... ...
|
|
73 ===== ==================
|
|
74
|
|
75 As you might guess from the filename ``N.abberans_reference_no_contam.fasta``,
|
|
76 this transcriptome assembly has already had obvious contamination removed.
|
|
77
|
|
78 At the time of writing, Galaxy's visualizations could not be included in
|
|
79 a workflow. You can generate a pie chart from the final count file using
|
|
80 the counts (c1) and labels (c2), like this:
|
|
81
|
|
82 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png
|
|
83
|
|
84 Note the nematode count in this image was shown as a mouse-over effect.
|
|
85
|
|
86
|
|
87 Disclaimer
|
|
88 ==========
|
|
89
|
|
90 Species assignment by top BLAST hit is not suitable for any in depth
|
|
91 analysis. It is particularly prone to false positives where contaiminants
|
|
92 in public datasets are mislabled. See for example Ed Yong (2015),
|
|
93 "There's No Plague on the NYC Subway. No Platypuses Either.":
|
|
94
|
|
95 http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/
|
|
96
|
|
97
|
|
98 Known Issues
|
|
99 ============
|
|
100
|
|
101 This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with
|
|
102 the current stable release (Galaxy v15.03, i.e. March 2015).
|
|
103
|
|
104 The updated "Count" tool version 1.0.1 includes a fix not to remove spaces
|
|
105 in the fields being counted. In the example above, while the top hits are
|
|
106 not affected, minor entries like "cellular slime molds" are shown as
|
|
107 "cellularslimemolds" instead (look closely at the Pie Chart key)..
|
|
108
|
|
109 The updated "Count" tool version 1.0.1 also adds a new option to sort the
|
|
110 output, which avoids the additional sorting step in the current version of
|
|
111 the workflow.
|
|
112
|
|
113 A future update to this workflow will use the revised "Count" tool, once
|
|
114 this is included in the next stable Galaxy release - or migrated to the
|
|
115 Galaxy Tool Shed.
|
|
116
|
|
117
|
|
118 Availability
|
|
119 ============
|
|
120
|
|
121 This workflow is available to download and/or install from the main Galaxy Tool Shed:
|
|
122
|
|
123 http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
|
|
124
|
|
125 Test releases (which should not normally be used) are on the Test Tool Shed:
|
|
126
|
|
127 http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
|
|
128
|
|
129 Development is being done on github here:
|
|
130
|
|
131 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
|
|
132
|
|
133
|
|
134 Citation
|
|
135 ========
|
|
136
|
|
137 Please cite the following paper (currently available as a preprint):
|
|
138
|
|
139 NCBI BLAST+ integrated into Galaxy.
|
|
140 P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo
|
|
141 bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint)
|
|
142
|
|
143 You should also cite Galaxy, and the NCBI BLAST+ tools:
|
|
144
|
|
145 BLAST+: architecture and applications.
|
|
146 C. Camacho et al. BMC Bioinformatics 2009, 10:421.
|
|
147 DOI: http://dx.doi.org/10.1186/1471-2105-10-421
|
|
148
|
|
149
|
|
150 Automated Installation
|
|
151 ======================
|
|
152
|
|
153 Installation via the Galaxy Tool Shed should take care of the dependencies
|
|
154 on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries.
|
|
155
|
|
156 However, this workflow requires a current version of the NCBI nr protein
|
|
157 BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower
|
|
158 case).
|
|
159
|
|
160
|
|
161 History
|
|
162 =======
|
|
163
|
|
164 ======= ======================================================================
|
|
165 Version Changes
|
|
166 ------- ----------------------------------------------------------------------
|
|
167 v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29
|
|
168 ======= ======================================================================
|
|
169
|
|
170
|
|
171 Developers
|
|
172 ==========
|
|
173
|
|
174 This workflow is under source code control here:
|
|
175
|
|
176 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
|
|
177
|
|
178 To prepare the tar-ball for uploading to the Tool Shed, I use this:
|
|
179
|
|
180 $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png
|
|
181
|
|
182 Check this,
|
|
183
|
|
184 $ tar -tzf blast_top_hit_species.tar.gz
|
|
185 README.rst
|
|
186 repository_dependencies.xml
|
|
187 blast_top_hit_species.ga
|
|
188 blast_top_hit_species.png
|
|
189 N_abberans_piechart_mouseover.png
|
|
190
|
|
191
|
|
192 Licence (MIT)
|
|
193 =============
|
|
194
|
|
195 Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
196 of this software and associated documentation files (the "Software"), to deal
|
|
197 in the Software without restriction, including without limitation the rights
|
|
198 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
199 copies of the Software, and to permit persons to whom the Software is
|
|
200 furnished to do so, subject to the following conditions:
|
|
201
|
|
202 The above copyright notice and this permission notice shall be included in
|
|
203 all copies or substantial portions of the Software.
|
|
204
|
|
205 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
206 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
207 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
208 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
209 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
210 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
211 THE SOFTWARE.
|