view readme.md @ 0:db68266f7364 draft

Uploaded
author bgruening
date Tue, 24 Feb 2015 04:48:44 -0500
parents
children
line wrap: on
line source

Galaxy workflow for the identification of candidate genes clusters
------------------------------------------------------------------

This approach screens two proteins against all nucleotide sequence from the
NCBI nt database within hours on our cluster, leading to all organisms with an inter-
esting gene structure for further investigation. As usual in Galaxy workflows every
parameter, including the proximity distance, can be changed and additional steps
can be easily added. For example additional filtering to refine the initial BLAST
hits, or inclusion of a third query sequence.

![Workflow Image](https://raw.githubusercontent.com/bgruening/galaxytools/master/workflows/ncbi_blast_plus/find_genes_located_nearby/find_genes_located_nearby.png)


Sample Data
===========

As an example, we will use two protein sequences from *Streptomyces aurantiacus*
that are part of a gene cluster, responsible for metabolite producion.

You can upload both sequences directly into Galaxy using the "Upload File" tool
with either of these URLs - Galaxy should recognise this is FASTA files.

 * https://raw.githubusercontent.com/bgruening/galaxytools/master/workflows/ncbi_blast_plus/find_genes_located_nearby/WP_037658548.fasta
 * https://raw.githubusercontent.com/bgruening/galaxytools/master/workflows/ncbi_blast_plus/find_genes_located_nearby/WP_037658557.fasta

In addition you can find both sequences at the NCBI server:
 * http://www.ncbi.nlm.nih.gov/protein/739806622 (cytochrome P450)
 
```text
>gi|739806622|ref|WP_037658557.1| cytochrome P450 [Streptomyces aurantiacus]
MQRTCPFSVPPVYTKFREESPITQVVLPDGGKAWLVTKYDDVRAVMANPKLSSDRRAPDFPVVVPGQNAA
LAKHAPFMIILDGAEHAAARRPVISEFSVRRVAAMKPRIQEIVDGFIDDMLKMPKPVDLNQVFSLPVPSL
VVSEILGMPYEGHEYFMELAEILLRRTTDEQGRIAVSVELRKYMDKLVEEKIENPGDDLLSRQIELQRQQ
GGIDRPQLASLCLLVLLAGHETTANMINLGVFSMLTKPELLAEIKADPSKTPKAVDELLRFYTIPDFGAH
RLALDDVEIGGVLIRKGEAVIASTFAANRDPAVFDDPEELDFGRDARHHVAFGYGPHQCLGQNLGRLELQ
VVFDTLFRRLPELRLAVPEEELSFKSDALVYGLYELPVTW
```

 * http://www.ncbi.nlm.nih.gov/protein/739806613 (beta-ACP synthase)

```
>gi|739806613|ref|WP_037658548.1| beta-ACP synthase [Streptomyces aurantiacus]
MSGRRVVVTGMEVLAPGGVGTDNFWSLLSEGRTATRGITFFDPAQFRSRVAAEIDFDPYAHGLTPQEVRR
MDRAAQFAVVAARGAVADSGLDTDTLDPYRIGVTIGSAVGATMSLDEDYRVVSDAGRLDLVDHTYADPFF
YNYFVPSSFATEVARLVGAQGPSSVVSAGCTSGLDSVGYAVELIREGTADVMVAGATDAPISPITMACFD
AIKATTPRHDDPEHASRPFDDTRNGFVLGEGTAVFVLEELESARRRGARIYAEIAGYATRSNAYHMTGLR
PDGAEMAEAITVALDEARMNPTAIDYINAHGSGTKQNDRHETAAFKRSLGEHAYRTPVSSIKSMVGHSLG
AIGSIEIAASILAIQHDVVPPTANLHTPDPQCDLDYVPLNAREQIVDAVLTVGSGFGGFQSAMVLAQPER
NAA
```


Citation
========

If you use this workflow directly, or a derivative of it, or the associated
NCBI BLAST wrappers for Galaxy, in work leading to a scientific publication,
please cite:

Peter J. A. Cock, John M. Chilton, Björn Grüning, James E. Johnson, Nicola Soranzo
NCBI BLAST+ integrated into Galaxy

http://biorxiv.org/content/early/2015/01/21/014043
http://dx.doi.org/10.1101/014043


Availability
============

This workflow is available on the main Galaxy Tool Shed:

http://toolshed.g2.bx.psu.edu/view/bgruening/find_genes_located_nearby_workflow

Development is being done on github:

https://github.com/bgruening/galaxytools/workflows/ncbi_blast_plus/


Dependencies
============

These dependencies should be resolved automatically via the Galaxy Tool Shed:

* http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus