0
|
1 Galaxy workflow for the identification of candidate genes clusters
|
|
2 ------------------------------------------------------------------
|
|
3
|
|
4 This approach screens two proteins against all nucleotide sequence from the
|
|
5 NCBI nt database within hours on our cluster, leading to all organisms with an inter-
|
|
6 esting gene structure for further investigation. As usual in Galaxy workflows every
|
|
7 parameter, including the proximity distance, can be changed and additional steps
|
|
8 can be easily added. For example additional filtering to refine the initial BLAST
|
|
9 hits, or inclusion of a third query sequence.
|
|
10
|
|
11 ![Workflow Image](https://raw.githubusercontent.com/bgruening/galaxytools/master/workflows/ncbi_blast_plus/find_genes_located_nearby/find_genes_located_nearby.png)
|
|
12
|
|
13
|
|
14 Sample Data
|
|
15 ===========
|
|
16
|
|
17 As an example, we will use two protein sequences from *Streptomyces aurantiacus*
|
|
18 that are part of a gene cluster, responsible for metabolite producion.
|
|
19
|
|
20 You can upload both sequences directly into Galaxy using the "Upload File" tool
|
|
21 with either of these URLs - Galaxy should recognise this is FASTA files.
|
|
22
|
|
23 * https://raw.githubusercontent.com/bgruening/galaxytools/master/workflows/ncbi_blast_plus/find_genes_located_nearby/WP_037658548.fasta
|
|
24 * https://raw.githubusercontent.com/bgruening/galaxytools/master/workflows/ncbi_blast_plus/find_genes_located_nearby/WP_037658557.fasta
|
|
25
|
|
26 In addition you can find both sequences at the NCBI server:
|
|
27 * http://www.ncbi.nlm.nih.gov/protein/739806622 (cytochrome P450)
|
|
28
|
|
29 ```text
|
|
30 >gi|739806622|ref|WP_037658557.1| cytochrome P450 [Streptomyces aurantiacus]
|
|
31 MQRTCPFSVPPVYTKFREESPITQVVLPDGGKAWLVTKYDDVRAVMANPKLSSDRRAPDFPVVVPGQNAA
|
|
32 LAKHAPFMIILDGAEHAAARRPVISEFSVRRVAAMKPRIQEIVDGFIDDMLKMPKPVDLNQVFSLPVPSL
|
|
33 VVSEILGMPYEGHEYFMELAEILLRRTTDEQGRIAVSVELRKYMDKLVEEKIENPGDDLLSRQIELQRQQ
|
|
34 GGIDRPQLASLCLLVLLAGHETTANMINLGVFSMLTKPELLAEIKADPSKTPKAVDELLRFYTIPDFGAH
|
|
35 RLALDDVEIGGVLIRKGEAVIASTFAANRDPAVFDDPEELDFGRDARHHVAFGYGPHQCLGQNLGRLELQ
|
|
36 VVFDTLFRRLPELRLAVPEEELSFKSDALVYGLYELPVTW
|
|
37 ```
|
|
38
|
|
39 * http://www.ncbi.nlm.nih.gov/protein/739806613 (beta-ACP synthase)
|
|
40
|
|
41 ```
|
|
42 >gi|739806613|ref|WP_037658548.1| beta-ACP synthase [Streptomyces aurantiacus]
|
|
43 MSGRRVVVTGMEVLAPGGVGTDNFWSLLSEGRTATRGITFFDPAQFRSRVAAEIDFDPYAHGLTPQEVRR
|
|
44 MDRAAQFAVVAARGAVADSGLDTDTLDPYRIGVTIGSAVGATMSLDEDYRVVSDAGRLDLVDHTYADPFF
|
|
45 YNYFVPSSFATEVARLVGAQGPSSVVSAGCTSGLDSVGYAVELIREGTADVMVAGATDAPISPITMACFD
|
|
46 AIKATTPRHDDPEHASRPFDDTRNGFVLGEGTAVFVLEELESARRRGARIYAEIAGYATRSNAYHMTGLR
|
|
47 PDGAEMAEAITVALDEARMNPTAIDYINAHGSGTKQNDRHETAAFKRSLGEHAYRTPVSSIKSMVGHSLG
|
|
48 AIGSIEIAASILAIQHDVVPPTANLHTPDPQCDLDYVPLNAREQIVDAVLTVGSGFGGFQSAMVLAQPER
|
|
49 NAA
|
|
50 ```
|
|
51
|
|
52
|
|
53 Citation
|
|
54 ========
|
|
55
|
|
56 If you use this workflow directly, or a derivative of it, or the associated
|
|
57 NCBI BLAST wrappers for Galaxy, in work leading to a scientific publication,
|
|
58 please cite:
|
|
59
|
|
60 Peter J. A. Cock, John M. Chilton, Björn Grüning, James E. Johnson, Nicola Soranzo
|
|
61 NCBI BLAST+ integrated into Galaxy
|
|
62
|
|
63 http://biorxiv.org/content/early/2015/01/21/014043
|
|
64 http://dx.doi.org/10.1101/014043
|
|
65
|
|
66
|
|
67 Availability
|
|
68 ============
|
|
69
|
|
70 This workflow is available on the main Galaxy Tool Shed:
|
|
71
|
|
72 http://toolshed.g2.bx.psu.edu/view/bgruening/find_genes_located_nearby_workflow
|
|
73
|
|
74 Development is being done on github:
|
|
75
|
|
76 https://github.com/bgruening/galaxytools/workflows/ncbi_blast_plus/
|
|
77
|
|
78
|
|
79 Dependencies
|
|
80 ============
|
|
81
|
|
82 These dependencies should be resolved automatically via the Galaxy Tool Shed:
|
|
83
|
|
84 * http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus
|