Repository 'blast_top_hit_species'
hg clone https://toolshed.g2.bx.psu.edu/repos/peterjc/blast_top_hit_species

Changeset 0:68d65aeb3567 (2015-03-30)
Next changeset 1:165f0b05fa25 (2015-03-30)
Commit message:
Uploaded v0.0.1
added:
N_abberans_piechart_mouseover.png
README.rst
blast_top_hit_species.ga
blast_top_hit_species.png
repository_dependencies.xml
b
diff -r 000000000000 -r 68d65aeb3567 N_abberans_piechart_mouseover.png
b
Binary file N_abberans_piechart_mouseover.png has changed
b
diff -r 000000000000 -r 68d65aeb3567 README.rst
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/README.rst Mon Mar 30 11:25:10 2015 -0400
b
b'@@ -0,0 +1,211 @@\n+Introduction\n+============\n+\n+Galaxy is a web-based platform for biological data analysis, supporting\n+extension with additional tools (often wrappers for existing command line\n+tools) and datatypes. See http://www.galaxyproject.org/ and the public\n+server at http://usegalaxy.org for an example.\n+\n+The NCBI BLAST suite is a widely used set of tools for biological sequence\n+comparison. It is available as standalone binaries for use at the command\n+line, and via the NCBI website for smaller searches. For more details see\n+http://blast.ncbi.nlm.nih.gov/Blast.cgi\n+\n+This is an example workflow using the Galaxy wrappers for NCBI BLAST+,\n+see https://github.com/peterjc/galaxy_blast\n+\n+\n+Galaxy workflow for counting species of top BLAST hits \n+======================================================\n+\n+This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an\n+initial assessment of a transcriptome assembly to give a crude indication of\n+any major contaimination present based on the species of the top BLAST hit\n+of 1000 representative sequences.\n+\n+.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png\n+\n+In words, the workflow proceeds as follows:\n+\n+1. Upload/import your transcriptome assembly or any nucleotide FASTA file.\n+2. Samples 1000 representative sequences, selected uniformly/evenly though\n+   the file.\n+3. Convert the sampled FASTA file into a three column tabular file.\n+4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr``\n+   database (assuming this is already available setup on your local Galaxy\n+   under the alias ``nr``), requesting tabular output including the taxonomy\n+   fields, and at most one matching target sequence.\n+5. Remove any duplicate alignments (multiple HSPs for the same match).\n+6. Combine the filtered BLAST output with the tabular version of the 1000\n+   sequences to give a new tabular file with exactly 1000 lines, adding\n+   ``None`` for sequences missing a BLAST hit.\n+7. Count the BLAST species names in this file.\n+8. Sort the counts.\n+\n+Finally we would suggest visualising the sorted tally table as a Pie Chart.\n+\n+\n+Sample Data\n+===========\n+\n+As an example, you can upload the transcriptome assembly of the nematode\n+*Nacobbus abberans* from Eves van den Akker *et al.* (2015),\n+http://dx.doi.org/10.1093/gbe/evu171 using this URL:\n+\n+http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip\n+\n+Running this workflow with a copy of the NCBI non-redundant ``nr`` database\n+from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave\n+the following results - note 609 out of the 1000 sequences gave no BLAST hit.\n+\n+===== ==================\n+Count Subject Blast Name\n+----- ------------------\n+  609 None\n+  244 nematodes\n+   30 ascomycetes\n+   27 eukaryotes\n+    8 basidiomycetes\n+    6 aphids\n+    5 eudicots\n+    5 flies\n+  ... ...\n+===== ==================\n+\n+As you might guess from\tthe filename ``N.abberans_reference_no_contam.fasta``,\n+this transcriptome assembly has already had obvious contamination removed.\n+\n+At the time of writing, Galaxy\'s visualizations could not be included in\n+a workflow. You can generate a pie chart from the final count file using\n+the counts (c1) and labels (c2), like this:\n+\n+.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png\n+\n+Note the nematode count in this image was shown as a mouse-over effect.\n+\n+\n+Disclaimer\n+==========\n+\n+Species assignment by top BLAST hit is not suitable for any in depth\n+analysis. It is particularly prone to false positives where contaiminants\n+in public datasets are mislabled. See for example Ed Yong (2015),\n+"There\'s No Plague on the NYC Subway. No Platypuses Either.":\n+\n+http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-eithe'..b'v15.03, i.e. March 2015).\n+\n+The updated "Count" tool version 1.0.1 includes a fix not to remove spaces\n+in the fields being counted. In the example above, while the top hits are\n+not affected, minor entries like "cellular slime molds" are shown as\n+"cellularslimemolds" instead (look closely at the Pie Chart key)..\n+\n+The updated "Count" tool version 1.0.1 also adds a new option to sort the\n+output, which avoids the additional sorting step in the current version of\n+the workflow.\n+\n+A future update to this workflow will use the revised "Count" tool, once\n+this is included in the next stable Galaxy release - or migrated to the\n+Galaxy Tool Shed.\n+\n+\n+Availability\n+============\n+\n+This workflow is available to download and/or install from the main Galaxy Tool Shed:\n+\n+http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species\n+\n+Test releases (which should not normally be used) are on the Test Tool Shed:\n+\n+http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species\n+\n+Development is being done on github here:\n+\n+https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species\n+\n+\n+Citation\n+========\n+\n+Please cite the following paper (currently available as a preprint):\n+\n+NCBI BLAST+ integrated into Galaxy.\n+P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo\n+bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint)\n+\n+You should also cite Galaxy, and the NCBI BLAST+ tools:\n+\n+BLAST+: architecture and applications.\n+C. Camacho et al. BMC Bioinformatics 2009, 10:421.\n+DOI: http://dx.doi.org/10.1186/1471-2105-10-421\n+\n+\n+Automated Installation\n+======================\n+\n+Installation via the Galaxy Tool Shed should take care of the dependencies\n+on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries.\n+\n+However, this workflow requires a current version of the NCBI nr protein\n+BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower\n+case).\n+\n+\n+History\n+=======\n+\n+======= ======================================================================\n+Version Changes\n+------- ----------------------------------------------------------------------\n+v0.1.0  - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29\n+======= ======================================================================\n+\n+\n+Developers\n+==========\n+\n+This workflow is under source code control here:\n+\n+https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species\n+\n+To prepare the tar-ball for uploading to the Tool Shed, I use this:\n+\n+    $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png\n+\n+Check this,\n+\n+    $ tar -tzf blast_top_hit_species.tar.gz\n+    README.rst\n+    repository_dependencies.xml\n+    blast_top_hit_species.ga\n+    blast_top_hit_species.png\n+    N_abberans_piechart_mouseover.png\n+\n+\n+Licence (MIT)\n+=============\n+\n+Permission is hereby granted, free of charge, to any person obtaining a copy\n+of this software and associated documentation files (the "Software"), to deal\n+in the Software without restriction, including without limitation the rights\n+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n+copies of the Software, and to permit persons to whom the Software is\n+furnished to do so, subject to the following conditions:\n+\n+The above copyright notice and this permission notice shall be included in\n+all copies or substantial portions of the Software.\n+\n+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n+THE SOFTWARE.\n'
b
diff -r 000000000000 -r 68d65aeb3567 blast_top_hit_species.ga
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/blast_top_hit_species.ga Mon Mar 30 11:25:10 2015 -0400
[
b'@@ -0,0 +1,331 @@\n+{\n+    "a_galaxy_workflow": "true", \n+    "annotation": "", \n+    "format-version": "0.1", \n+    "name": "Species of top BLAST hits", \n+    "steps": {\n+        "0": {\n+            "annotation": "", \n+            "id": 0, \n+            "input_connections": {}, \n+            "inputs": [\n+                {\n+                    "description": "", \n+                    "name": "Transcriptome FASTA file"\n+                }\n+            ], \n+            "label": null, \n+            "name": "Input dataset", \n+            "outputs": [], \n+            "position": {\n+                "left": 242, \n+                "top": 119\n+            }, \n+            "tool_errors": null, \n+            "tool_id": null, \n+            "tool_state": "{\\"name\\": \\"Transcriptome FASTA file\\"}", \n+            "tool_version": null, \n+            "type": "data_input", \n+            "user_outputs": [], \n+            "uuid": "e445b44b-02a7-4fd1-8944-cd680f967062"\n+        }, \n+        "1": {\n+            "annotation": "This workflow is deliberately a simple/crude assessment, and there is no need to run BLASTX on all the sequences - a sample of 1000 should be enough.", \n+            "id": 1, \n+            "input_connections": {\n+                "input_file": {\n+                    "id": 0, \n+                    "output_name": "output"\n+                }\n+            }, \n+            "inputs": [], \n+            "label": null, \n+            "name": "Sub-sample sequences files", \n+            "outputs": [\n+                {\n+                    "name": "output_file", \n+                    "type": "input"\n+                }\n+            ], \n+            "position": {\n+                "left": 435, \n+                "top": 119\n+            }, \n+            "post_job_actions": {\n+                "RenameDatasetActionoutput_file": {\n+                    "action_arguments": {\n+                        "newname": "1000 sequences from #{input_file}"\n+                    }, \n+                    "action_type": "RenameDatasetAction", \n+                    "output_name": "output_file"\n+                }\n+            }, \n+            "tool_errors": null, \n+            "tool_id": "toolshed.g2.bx.psu.edu/repos/peterjc/sample_seqs/sample_seqs/0.2.1", \n+            "tool_state": "{\\"__page__\\": 0, \\"input_file\\": \\"null\\", \\"__rerun_remap_job_id__\\": null, \\"sampling\\": \\"{\\\\\\"count\\\\\\": \\\\\\"1000\\\\\\", \\\\\\"type\\\\\\": \\\\\\"desired_count\\\\\\", \\\\\\"__current_case__\\\\\\": 2}\\", \\"chromInfo\\": \\"\\\\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\\\\"\\", \\"interleaved\\": \\"\\\\\\"False\\\\\\"\\"}", \n+            "tool_version": "0.2.1", \n+            "type": "tool", \n+            "user_outputs": [], \n+            "uuid": "87ce69ef-5fb0-41b0-9575-d3b96544f8be"\n+        }, \n+        "2": {\n+            "annotation": "We only want one line per query, so limit this to the best scoring target sequence. Assumes current NCBI nr database is available locally as \\"nr\\".", \n+            "id": 2, \n+            "input_connections": {\n+                "query": {\n+                    "id": 1, \n+                    "output_name": "output_file"\n+                }\n+            }, \n+            "inputs": [], \n+            "label": null, \n+            "name": "NCBI BLAST+ blastx", \n+            "outputs": [\n+                {\n+                    "name": "output1", \n+                    "type": "tabular"\n+                }\n+            ], \n+            "position": {\n+                "left": 489, \n+                "top": 263\n+            }, \n+            "post_job_actions": {\n+                "RenameDatasetActionoutput1": {\n+                    "action_arguments": {\n+                        "newname": "Top BLAST match"\n+                    }, \n+                    "action_type": "RenameDatasetAction", \n+                    "output_name": "output1"\n+                }\n+            }, \n+            "tool_errors": null, \n+            "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_pl'..b'lue\\\\\\": \\\\\\"None\\\\\\", \\\\\\"__current_case__\\\\\\": 0}, \\\\\\"fill_columns_by\\\\\\": \\\\\\"fill_unjoined_only\\\\\\", \\\\\\"__current_case__\\\\\\": 1}\\", \\"unmatched\\": \\"\\\\\\"-u\\\\\\"\\", \\"input1\\": \\"null\\", \\"chromInfo\\": \\"\\\\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\\\\"\\"}", \n+            "tool_version": "2.0.2", \n+            "type": "tool", \n+            "user_outputs": [], \n+            "uuid": "4c280b0e-b4a6-4ae4-8a81-d6e93932ef71"\n+        }, \n+        "6": {\n+            "annotation": "Here we make a tally table of the BLAST species name column", \n+            "id": 6, \n+            "input_connections": {\n+                "input": {\n+                    "id": 5, \n+                    "output_name": "out_file1"\n+                }\n+            }, \n+            "inputs": [], \n+            "label": null, \n+            "name": "Count", \n+            "outputs": [\n+                {\n+                    "name": "out_file1", \n+                    "type": "tabular"\n+                }\n+            ], \n+            "position": {\n+                "left": 952, \n+                "top": 398\n+            }, \n+            "post_job_actions": {\n+                "HideDatasetActionout_file1": {\n+                    "action_arguments": {}, \n+                    "action_type": "HideDatasetAction", \n+                    "output_name": "out_file1"\n+                }, \n+                "RenameDatasetActionout_file1": {\n+                    "action_arguments": {\n+                        "newname": "Top BLAST hit species counts (unsorted)"\n+                    }, \n+                    "action_type": "RenameDatasetAction", \n+                    "output_name": "out_file1"\n+                }\n+            }, \n+            "tool_errors": null, \n+            "tool_id": "Count1", \n+            "tool_state": "{\\"__page__\\": 0, \\"column\\": \\"{\\\\\\"__class__\\\\\\": \\\\\\"UnvalidatedValue\\\\\\", \\\\\\"value\\\\\\": [\\\\\\"19\\\\\\"]}\\", \\"__rerun_remap_job_id__\\": null, \\"delim\\": \\"\\\\\\"T\\\\\\"\\", \\"input\\": \\"null\\", \\"chromInfo\\": \\"\\\\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\\\\"\\"}", \n+            "tool_version": "1.0.0", \n+            "type": "tool", \n+            "user_outputs": [], \n+            "uuid": "d3322137-1911-426d-87a7-c82b5fc16825"\n+        }, \n+        "7": {\n+            "annotation": "Sorting the counts makes the results easier to interpret directly.", \n+            "id": 7, \n+            "input_connections": {\n+                "input": {\n+                    "id": 6, \n+                    "output_name": "out_file1"\n+                }\n+            }, \n+            "inputs": [], \n+            "label": null, \n+            "name": "Sort", \n+            "outputs": [\n+                {\n+                    "name": "out_file1", \n+                    "type": "input"\n+                }\n+            ], \n+            "position": {\n+                "left": 1056, \n+                "top": 506\n+            }, \n+            "post_job_actions": {\n+                "RenameDatasetActionout_file1": {\n+                    "action_arguments": {\n+                        "newname": "Top BLAST hit species counts"\n+                    }, \n+                    "action_type": "RenameDatasetAction", \n+                    "output_name": "out_file1"\n+                }\n+            }, \n+            "tool_errors": null, \n+            "tool_id": "sort1", \n+            "tool_state": "{\\"__page__\\": 0, \\"style\\": \\"\\\\\\"num\\\\\\"\\", \\"column\\": \\"{\\\\\\"__class__\\\\\\": \\\\\\"UnvalidatedValue\\\\\\", \\\\\\"value\\\\\\": \\\\\\"1\\\\\\"}\\", \\"__rerun_remap_job_id__\\": null, \\"column_set\\": \\"[]\\", \\"input\\": \\"null\\", \\"chromInfo\\": \\"\\\\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\\\\"\\", \\"order\\": \\"\\\\\\"DESC\\\\\\"\\"}", \n+            "tool_version": "1.0.3", \n+            "type": "tool", \n+            "user_outputs": [], \n+            "uuid": "c81cc61d-52a3-44ee-b646-b23e0e004c38"\n+        }\n+    }, \n+    "uuid": "9fe8754a-3a87-4f6a-89a2-141b02b4793e"\n+}\n\\ No newline at end of file\n'
b
diff -r 000000000000 -r 68d65aeb3567 blast_top_hit_species.png
b
Binary file blast_top_hit_species.png has changed
b
diff -r 000000000000 -r 68d65aeb3567 repository_dependencies.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/repository_dependencies.xml Mon Mar 30 11:25:10 2015 -0400
b
@@ -0,0 +1,9 @@
+<?xml version="1.0"?>
+<repositories description="This workflow requires the NCBI BLAST+ tools etc">
+    <repository changeset_revision="2fe07f50a41e" name="ncbi_blast_plus" owner="devteam" toolshed="https://toolshed.g2.bx.psu.edu" />
+    <repository changeset_revision="9d189d08f2ad" name="fasta_to_tabular" owner="devteam" toolshed="https://toolshed.g2.bx.psu.edu" />
+    <repository changeset_revision="02c13ef1a669" name="sample_seqs" owner="peterjc" toolshed="https://toolshed.g2.bx.psu.edu" />
+    <repository changeset_revision="7ce75adb93be" name="unique" owner="bgruening" toolshed="https://toolshed.g2.bx.psu.edu" />
+    <!-- Also uses tool_id join1, Count1, and sort1 which are currently
+         still shipped with Galaxy itself rather than via the Tool Shed -->
+</repositories>