Mercurial > repos > peterjc > blast_top_hit_species
diff README.rst @ 0:68d65aeb3567 draft
Uploaded v0.0.1
author | peterjc |
---|---|
date | Mon, 30 Mar 2015 11:25:10 -0400 |
parents | |
children | 165f0b05fa25 |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.rst Mon Mar 30 11:25:10 2015 -0400 @@ -0,0 +1,211 @@ +Introduction +============ + +Galaxy is a web-based platform for biological data analysis, supporting +extension with additional tools (often wrappers for existing command line +tools) and datatypes. See http://www.galaxyproject.org/ and the public +server at http://usegalaxy.org for an example. + +The NCBI BLAST suite is a widely used set of tools for biological sequence +comparison. It is available as standalone binaries for use at the command +line, and via the NCBI website for smaller searches. For more details see +http://blast.ncbi.nlm.nih.gov/Blast.cgi + +This is an example workflow using the Galaxy wrappers for NCBI BLAST+, +see https://github.com/peterjc/galaxy_blast + + +Galaxy workflow for counting species of top BLAST hits +====================================================== + +This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an +initial assessment of a transcriptome assembly to give a crude indication of +any major contaimination present based on the species of the top BLAST hit +of 1000 representative sequences. + +.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png + +In words, the workflow proceeds as follows: + +1. Upload/import your transcriptome assembly or any nucleotide FASTA file. +2. Samples 1000 representative sequences, selected uniformly/evenly though + the file. +3. Convert the sampled FASTA file into a three column tabular file. +4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr`` + database (assuming this is already available setup on your local Galaxy + under the alias ``nr``), requesting tabular output including the taxonomy + fields, and at most one matching target sequence. +5. Remove any duplicate alignments (multiple HSPs for the same match). +6. Combine the filtered BLAST output with the tabular version of the 1000 + sequences to give a new tabular file with exactly 1000 lines, adding + ``None`` for sequences missing a BLAST hit. +7. Count the BLAST species names in this file. +8. Sort the counts. + +Finally we would suggest visualising the sorted tally table as a Pie Chart. + + +Sample Data +=========== + +As an example, you can upload the transcriptome assembly of the nematode +*Nacobbus abberans* from Eves van den Akker *et al.* (2015), +http://dx.doi.org/10.1093/gbe/evu171 using this URL: + +http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip + +Running this workflow with a copy of the NCBI non-redundant ``nr`` database +from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave +the following results - note 609 out of the 1000 sequences gave no BLAST hit. + +===== ================== +Count Subject Blast Name +----- ------------------ + 609 None + 244 nematodes + 30 ascomycetes + 27 eukaryotes + 8 basidiomycetes + 6 aphids + 5 eudicots + 5 flies + ... ... +===== ================== + +As you might guess from the filename ``N.abberans_reference_no_contam.fasta``, +this transcriptome assembly has already had obvious contamination removed. + +At the time of writing, Galaxy's visualizations could not be included in +a workflow. You can generate a pie chart from the final count file using +the counts (c1) and labels (c2), like this: + +.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png + +Note the nematode count in this image was shown as a mouse-over effect. + + +Disclaimer +========== + +Species assignment by top BLAST hit is not suitable for any in depth +analysis. It is particularly prone to false positives where contaiminants +in public datasets are mislabled. See for example Ed Yong (2015), +"There's No Plague on the NYC Subway. No Platypuses Either.": + +http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/ + + +Known Issues +============ + +This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with +the current stable release (Galaxy v15.03, i.e. March 2015). + +The updated "Count" tool version 1.0.1 includes a fix not to remove spaces +in the fields being counted. In the example above, while the top hits are +not affected, minor entries like "cellular slime molds" are shown as +"cellularslimemolds" instead (look closely at the Pie Chart key).. + +The updated "Count" tool version 1.0.1 also adds a new option to sort the +output, which avoids the additional sorting step in the current version of +the workflow. + +A future update to this workflow will use the revised "Count" tool, once +this is included in the next stable Galaxy release - or migrated to the +Galaxy Tool Shed. + + +Availability +============ + +This workflow is available to download and/or install from the main Galaxy Tool Shed: + +http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species + +Test releases (which should not normally be used) are on the Test Tool Shed: + +http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species + +Development is being done on github here: + +https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species + + +Citation +======== + +Please cite the following paper (currently available as a preprint): + +NCBI BLAST+ integrated into Galaxy. +P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo +bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint) + +You should also cite Galaxy, and the NCBI BLAST+ tools: + +BLAST+: architecture and applications. +C. Camacho et al. BMC Bioinformatics 2009, 10:421. +DOI: http://dx.doi.org/10.1186/1471-2105-10-421 + + +Automated Installation +====================== + +Installation via the Galaxy Tool Shed should take care of the dependencies +on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries. + +However, this workflow requires a current version of the NCBI nr protein +BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower +case). + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29 +======= ====================================================================== + + +Developers +========== + +This workflow is under source code control here: + +https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species + +To prepare the tar-ball for uploading to the Tool Shed, I use this: + + $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png + +Check this, + + $ tar -tzf blast_top_hit_species.tar.gz + README.rst + repository_dependencies.xml + blast_top_hit_species.ga + blast_top_hit_species.png + N_abberans_piechart_mouseover.png + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE.