Galaxy |

Changeset 11:99b82a2b1272 (2013-04-03)

Previous changeset 10:09ff180d1615 (2013-03-27) Next changeset 12:6753a9261390 (2013-04-03)

Commit message:
Uploaded v0.2.0 which added PSORTb wrapper (written with Konrad Paszkiewicz)

modified:
tools/protein_analysis/LICENSE
tools/protein_analysis/README
tools/protein_analysis/signalp3.py
tools/protein_analysis/signalp3.xml
tools/protein_analysis/tmhmm2.py
tools/protein_analysis/tmhmm2.xml
tools/protein_analysis/wolf_psort.py
tools/protein_analysis/wolf_psort.xml

added:
test-data/four_human_proteins.blast2go.tabular
test-data/four_human_proteins.blastp_nr.top2.tabular
test-data/four_human_proteins.blastp_nr.top3.tabular
test-data/four_human_proteins.blastp_nr.top4.xml
test-data/k12_ten_proteins.fasta
test-data/k12_ten_proteins_psortb_p_terse.tabular
tools/protein_analysis/psortb.py
tools/protein_analysis/psortb.xml

removed:
test-data/four_human_proteins.fasta.orig

diff -r 09ff180d1615 -r 99b82a2b1272 test-data/four_human_proteins.blast2go.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/four_human_proteins.blast2go.tabular Wed Apr 03 10:49:10 2013 -0400

@@ -0,0 +1,96 @@
+sp|Q9BS26|ERP44_HUMAN GO:0005789 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0006457 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0005788 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0009100 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0045454 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0005515 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0005793 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0006986 endoplasmic reticulum resident protein 44
+sp|Q9BS26|ERP44_HUMAN GO:0003756 endoplasmic reticulum resident protein 44
+sp|Q9NSY1|BMP2K_HUMAN GO:0006468 bmp-2-inducible protein kinase
+sp|Q9NSY1|BMP2K_HUMAN GO:0004674 bmp-2-inducible protein kinase
+sp|Q9NSY1|BMP2K_HUMAN GO:0005730 bmp-2-inducible protein kinase
+sp|Q9NSY1|BMP2K_HUMAN GO:0030500 bmp-2-inducible protein kinase
+sp|Q9NSY1|BMP2K_HUMAN GO:0005524 bmp-2-inducible protein kinase
+sp|Q9NSY1|BMP2K_HUMAN GO:0019208 bmp-2-inducible protein kinase
+sp|P06213|INSR_HUMAN GO:0032148 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005525 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045995 insulin receptor
+sp|P06213|INSR_HUMAN GO:0023014 insulin receptor
+sp|P06213|INSR_HUMAN GO:0031995 insulin receptor
+sp|P06213|INSR_HUMAN GO:0043548 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005901 insulin receptor
+sp|P06213|INSR_HUMAN GO:0008284 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045429 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005625 insulin receptor
+sp|P06213|INSR_HUMAN GO:0009749 insulin receptor
+sp|P06213|INSR_HUMAN GO:0018108 insulin receptor
+sp|P06213|INSR_HUMAN GO:0051384 insulin receptor
+sp|P06213|INSR_HUMAN GO:0043423 insulin receptor
+sp|P06213|INSR_HUMAN GO:0046326 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005829 insulin receptor
+sp|P06213|INSR_HUMAN GO:0042169 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045725 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045821 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045471 insulin receptor
+sp|P06213|INSR_HUMAN GO:0043410 insulin receptor
+sp|P06213|INSR_HUMAN GO:0001933 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005792 insulin receptor
+sp|P06213|INSR_HUMAN GO:0031994 insulin receptor
+sp|P06213|INSR_HUMAN GO:0042593 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005159 insulin receptor
+sp|P06213|INSR_HUMAN GO:0004716 insulin receptor
+sp|P06213|INSR_HUMAN GO:0033574 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045740 insulin receptor
+sp|P06213|INSR_HUMAN GO:0043559 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045444 insulin receptor
+sp|P06213|INSR_HUMAN GO:0010310 insulin receptor
+sp|P06213|INSR_HUMAN GO:0010042 insulin receptor
+sp|P06213|INSR_HUMAN GO:0043560 insulin receptor
+sp|P06213|INSR_HUMAN GO:0010629 insulin receptor
+sp|P06213|INSR_HUMAN GO:0048639 insulin receptor
+sp|P06213|INSR_HUMAN GO:0032403 insulin receptor
+sp|P06213|INSR_HUMAN GO:0051290 insulin receptor
+sp|P06213|INSR_HUMAN GO:0014823 insulin receptor
+sp|P06213|INSR_HUMAN GO:0010008 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005009 insulin receptor
+sp|P06213|INSR_HUMAN GO:0003007 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005524 insulin receptor
+sp|P06213|INSR_HUMAN GO:0030335 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045202 insulin receptor
+sp|P06213|INSR_HUMAN GO:0030238 insulin receptor
+sp|P06213|INSR_HUMAN GO:0007186 insulin receptor
+sp|P06213|INSR_HUMAN GO:0019087 insulin receptor
+sp|P06213|INSR_HUMAN GO:0046777 insulin receptor
+sp|P06213|INSR_HUMAN GO:0019903 insulin receptor
+sp|P06213|INSR_HUMAN GO:0010560 insulin receptor
+sp|P06213|INSR_HUMAN GO:0051425 insulin receptor
+sp|P06213|INSR_HUMAN GO:0033280 insulin receptor
+sp|P06213|INSR_HUMAN GO:0071363 insulin receptor
+sp|P06213|INSR_HUMAN GO:0051897 insulin receptor
+sp|P06213|INSR_HUMAN GO:0032355 insulin receptor
+sp|P06213|INSR_HUMAN GO:0000187 insulin receptor
+sp|P06213|INSR_HUMAN GO:0045840 insulin receptor
+sp|P06213|INSR_HUMAN GO:0032410 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005634 insulin receptor
+sp|P06213|INSR_HUMAN GO:0005899 insulin receptor
+sp|P06213|INSR_HUMAN GO:0034612 insulin receptor
+sp|P06213|INSR_HUMAN GO:0031405 insulin receptor
+sp|P06213|INSR_HUMAN GO:0060267 insulin receptor
+sp|P06213|INSR_HUMAN GO:0031017 insulin receptor
+sp|P06213|INSR_HUMAN GO:0008286 insulin receptor
+sp|P06213|INSR_HUMAN GO:0006355 insulin receptor
+sp|P08100|OPSD_HUMAN GO:0016056 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0071482 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0006468 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0009586 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0046872 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0004930 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0018298 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0060342 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0042622 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0005515 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0060041 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0009881 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0005794 rhodopsin
+sp|P08100|OPSD_HUMAN GO:0005887 rhodopsin

diff -r 09ff180d1615 -r 99b82a2b1272 test-data/four_human_proteins.blastp_nr.top2.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/four_human_proteins.blastp_nr.top2.tabular Wed Apr 03 10:49:10 2013 -0400

[

@@ -0,0 +1,5 @@
+#Query BLAST hit 1 BLAST hit 2
+sp|Q9BS26|ERP44_HUMAN gi|52487191|ref|NP_055866.1| endoplasmic reticulum resident protein 44 precursor [Homo sapiens] >gi|332832471|ref|XP_003312248.1| PREDICTED: endoplasmic reticulum resident protein 44 [Pan troglodytes] >gi|395740740|ref|XP_002820091.2| PREDICTED: endoplasmic reticulum resident protein 44 [Pongo abelii] >gi|397499932|ref|XP_003820684.1| PREDICTED: endoplasmic reticulum resident protein 44 [Pan paniscus] >gi|31077035|sp|Q9BS26.1|ERP44_HUMAN RecName: Full=Endoplasmic reticulum resident protein 44; Short=ER protein 44; Short=ERp44; AltName: Full=Thioredoxin domain-containing protein 4; Flags: Precursor >gi|13529224|gb|AAH05374.1| Endoplasmic reticulum protein 44 [Homo sapiens] >gi|18857865|emb|CAC87611.1| ERp44 protein [Homo sapiens] >gi|168267418|dbj|BAG09765.1| thioredoxin domain-containing protein 4 [synthetic construct] >gi|193786731|dbj|BAG52054.1| unnamed protein product [Homo sapiens] >gi|410223880|gb|JAA09159.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410253060|gb|JAA14497.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410293858|gb|JAA25529.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410329535|gb|JAA33714.1| endoplasmic reticulum protein 44 [Pan troglodytes] gi|3043670|dbj|BAA25499.1| KIAA0573 protein [Homo sapiens]
+sp|Q9NSY1|BMP2K_HUMAN gi|38787935|ref|NP_942595.1| BMP-2-inducible protein kinase isoform a [Homo sapiens] >gi|34222653|sp|Q9NSY1.2|BMP2K_HUMAN RecName: Full=BMP-2-inducible protein kinase; Short=BIKe gi|332819458|ref|XP_526576.3| PREDICTED: BMP-2-inducible protein kinase [Pan troglodytes]
+sp|P06213|INSR_HUMAN gi|119395736|ref|NP_000199.2| insulin receptor isoform Long preproprotein [Homo sapiens] >gi|308153655|sp|P06213.4|INSR_HUMAN RecName: Full=Insulin receptor; Short=IR; AltName: CD_antigen=CD220; Contains: RecName: Full=Insulin receptor subunit alpha; Contains: RecName: Full=Insulin receptor subunit beta; Flags: Precursor gi|386830|gb|AAA59452.1| insulin receptor [Homo sapiens]
+sp|P08100|OPSD_HUMAN gi|4506527|ref|NP_000530.1| rhodopsin [Homo sapiens] >gi|114589117|ref|XP_516740.2| PREDICTED: rhodopsin [Pan troglodytes] >gi|297670049|ref|XP_002813191.1| PREDICTED: rhodopsin [Pongo abelii] >gi|332231791|ref|XP_003265078.1| PREDICTED: rhodopsin [Nomascus leucogenys] >gi|397518622|ref|XP_003829483.1| PREDICTED: rhodopsin [Pan paniscus] >gi|426342073|ref|XP_004036340.1| PREDICTED: rhodopsin [Gorilla gorilla gorilla] >gi|129207|sp|P08100.1|OPSD_HUMAN RecName: Full=Rhodopsin; AltName: Full=Opsin-2 >gi|1236137|gb|AAC31763.1| rhodopsin [Homo sapiens] >gi|21928611|dbj|BAC05894.1| seven transmembrane helix receptor [Homo sapiens] >gi|31873264|emb|CAD97623.1| hypothetical protein [Homo sapiens] >gi|85567017|gb|AAI12107.1| Rhodopsin [Homo sapiens] >gi|85567192|gb|AAI12105.1| Rhodopsin [Homo sapiens] >gi|108752084|gb|AAI11452.1| RHO protein [synthetic construct] >gi|117644328|emb|CAL37658.1| hypothetical protein [synthetic construct] >gi|119599650|gb|EAW79244.1| rhodopsin (opsin 2, rod pigment) (retinitis pigmentosa 4, autosomal dominant) [Homo sapiens] >gi|208967306|dbj|BAG73667.1| rhodopsin [synthetic construct] gi|403268285|ref|XP_003926208.1| PREDICTED: rhodopsin [Saimiri boliviensis boliviensis]

diff -r 09ff180d1615 -r 99b82a2b1272 test-data/four_human_proteins.blastp_nr.top3.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/four_human_proteins.blastp_nr.top3.tabular Wed Apr 03 10:49:10 2013 -0400

[

@@ -0,0 +1,5 @@
+#Query BLAST hit 1 BLAST hit 2 BLAST hit 3
+sp|Q9BS26|ERP44_HUMAN gi|52487191|ref|NP_055866.1| endoplasmic reticulum resident protein 44 precursor [Homo sapiens] >gi|332832471|ref|XP_003312248.1| PREDICTED: endoplasmic reticulum resident protein 44 [Pan troglodytes] >gi|395740740|ref|XP_002820091.2| PREDICTED: endoplasmic reticulum resident protein 44 [Pongo abelii] >gi|397499932|ref|XP_003820684.1| PREDICTED: endoplasmic reticulum resident protein 44 [Pan paniscus] >gi|31077035|sp|Q9BS26.1|ERP44_HUMAN RecName: Full=Endoplasmic reticulum resident protein 44; Short=ER protein 44; Short=ERp44; AltName: Full=Thioredoxin domain-containing protein 4; Flags: Precursor >gi|13529224|gb|AAH05374.1| Endoplasmic reticulum protein 44 [Homo sapiens] >gi|18857865|emb|CAC87611.1| ERp44 protein [Homo sapiens] >gi|168267418|dbj|BAG09765.1| thioredoxin domain-containing protein 4 [synthetic construct] >gi|193786731|dbj|BAG52054.1| unnamed protein product [Homo sapiens] >gi|410223880|gb|JAA09159.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410253060|gb|JAA14497.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410293858|gb|JAA25529.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410329535|gb|JAA33714.1| endoplasmic reticulum protein 44 [Pan troglodytes] gi|3043670|dbj|BAA25499.1| KIAA0573 protein [Homo sapiens] gi|37183214|gb|AAQ89407.1| TXNDC4 [Homo sapiens]
+sp|Q9NSY1|BMP2K_HUMAN gi|38787935|ref|NP_942595.1| BMP-2-inducible protein kinase isoform a [Homo sapiens] >gi|34222653|sp|Q9NSY1.2|BMP2K_HUMAN RecName: Full=BMP-2-inducible protein kinase; Short=BIKe gi|332819458|ref|XP_526576.3| PREDICTED: BMP-2-inducible protein kinase [Pan troglodytes] gi|119626236|gb|EAX05831.1| BMP2 inducible kinase, isoform CRA_c [Homo sapiens]
+sp|P06213|INSR_HUMAN gi|119395736|ref|NP_000199.2| insulin receptor isoform Long preproprotein [Homo sapiens] >gi|308153655|sp|P06213.4|INSR_HUMAN RecName: Full=Insulin receptor; Short=IR; AltName: CD_antigen=CD220; Contains: RecName: Full=Insulin receptor subunit alpha; Contains: RecName: Full=Insulin receptor subunit beta; Flags: Precursor gi|386830|gb|AAA59452.1| insulin receptor [Homo sapiens] gi|410220302|gb|JAA07370.1| insulin receptor [Pan troglodytes] >gi|410250978|gb|JAA13456.1| insulin receptor [Pan troglodytes] >gi|410291630|gb|JAA24415.1| insulin receptor [Pan troglodytes] >gi|410335477|gb|JAA36685.1| insulin receptor [Pan troglodytes]
+sp|P08100|OPSD_HUMAN gi|4506527|ref|NP_000530.1| rhodopsin [Homo sapiens] >gi|114589117|ref|XP_516740.2| PREDICTED: rhodopsin [Pan troglodytes] >gi|297670049|ref|XP_002813191.1| PREDICTED: rhodopsin [Pongo abelii] >gi|332231791|ref|XP_003265078.1| PREDICTED: rhodopsin [Nomascus leucogenys] >gi|397518622|ref|XP_003829483.1| PREDICTED: rhodopsin [Pan paniscus] >gi|426342073|ref|XP_004036340.1| PREDICTED: rhodopsin [Gorilla gorilla gorilla] >gi|129207|sp|P08100.1|OPSD_HUMAN RecName: Full=Rhodopsin; AltName: Full=Opsin-2 >gi|1236137|gb|AAC31763.1| rhodopsin [Homo sapiens] >gi|21928611|dbj|BAC05894.1| seven transmembrane helix receptor [Homo sapiens] >gi|31873264|emb|CAD97623.1| hypothetical protein [Homo sapiens] >gi|85567017|gb|AAI12107.1| Rhodopsin [Homo sapiens] >gi|85567192|gb|AAI12105.1| Rhodopsin [Homo sapiens] >gi|108752084|gb|AAI11452.1| RHO protein [synthetic construct] >gi|117644328|emb|CAL37658.1| hypothetical protein [synthetic construct] >gi|119599650|gb|EAW79244.1| rhodopsin (opsin 2, rod pigment) (retinitis pigmentosa 4, autosomal dominant) [Homo sapiens] >gi|208967306|dbj|BAG73667.1| rhodopsin [synthetic construct] gi|403268285|ref|XP_003926208.1| PREDICTED: rhodopsin [Saimiri boliviensis boliviensis] gi|109098032|ref|XP_001094250.1| PREDICTED: rhodopsin [Macaca mulatta] >gi|402887068|ref|XP_003906927.1| PREDICTED: rhodopsin [Papio anubis] >gi|355564526|gb|EHH21026.1| hypothetical protein EGK_04000 [Macaca mulatta] >gi|355786368|gb|EHH66551.1| hypothetical protein EGM_03566 [Macaca fascicularis]

diff -r 09ff180d1615 -r 99b82a2b1272 test-data/four_human_proteins.blastp_nr.top4.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/four_human_proteins.blastp_nr.top4.xml Wed Apr 03 10:49:10 2013 -0400

[

b'@@ -0,0 +1,546 @@\n+<?xml version="1.0"?>\n+<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">\n+<BlastOutput>\n+ <BlastOutput_program>blastp</BlastOutput_program>\n+ <BlastOutput_version>BLASTP 2.2.26+</BlastOutput_version>\n+ <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>\n+ <BlastOutput_db>/var/local/blast/ncbi/nr</BlastOutput_db>\n+ <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>\n+ <BlastOutput_query-def>sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1</BlastOutput_query-def>\n+ <BlastOutput_query-len>406</BlastOutput_query-len>\n+ <BlastOutput_param>\n+ <Parameters>\n+ <Parameters_matrix>BLOSUM62</Parameters_matrix>\n+ <Parameters_expect>0.001</Parameters_expect>\n+ <Parameters_gap-open>11</Parameters_gap-open>\n+ <Parameters_gap-extend>1</Parameters_gap-extend>\n+ <Parameters_filter>F</Parameters_filter>\n+ </Parameters>\n+ </BlastOutput_param>\n+ <BlastOutput_iterations>\n+ <Iteration>\n+ <Iteration_iter-num>1</Iteration_iter-num>\n+ <Iteration_query-ID>Query_1</Iteration_query-ID>\n+ <Iteration_query-def>sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1</Iteration_query-def>\n+ <Iteration_query-len>406</Iteration_query-len>\n+ <Iteration_hits>\n+ <Hit>\n+ <Hit_num>1</Hit_num>\n+ <Hit_id>gi|52487191|ref|NP_055866.1|</Hit_id>\n+ <Hit_def>endoplasmic reticulum resident protein 44 precursor [Homo sapiens] >gi|332832471|ref|XP_003312248.1| PREDICTED: endoplasmic reticulum resident protein 44 [Pan troglodytes] >gi|395740740|ref|XP_002820091.2| PREDICTED: endoplasmic reticulum resident protein 44 [Pongo abelii] >gi|397499932|ref|XP_003820684.1| PREDICTED: endoplasmic reticulum resident protein 44 [Pan paniscus] >gi|31077035|sp|Q9BS26.1|ERP44_HUMAN RecName: Full=Endoplasmic reticulum resident protein 44; Short=ER protein 44; Short=ERp44; AltName: Full=Thioredoxin domain-containing protein 4; Flags: Precursor >gi|13529224|gb|AAH05374.1| Endoplasmic reticulum protein 44 [Homo sapiens] >gi|18857865|emb|CAC87611.1| ERp44 protein [Homo sapiens] >gi|168267418|dbj|BAG09765.1| thioredoxin domain-containing protein 4 [synthetic construct] >gi|193786731|dbj|BAG52054.1| unnamed protein product [Homo sapiens] >gi|410223880|gb|JAA09159.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410253060|gb|JAA14497.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410293858|gb|JAA25529.1| endoplasmic reticulum protein 44 [Pan troglodytes] >gi|410329535|gb|JAA33714.1| endoplasmic reticulum protein 44 [Pan troglodytes]</Hit_def>\n+ <Hit_accession>NP_055866</Hit_accession>\n+ <Hit_len>406</Hit_len>\n+ <Hit_hsps>\n+ <Hsp>\n+ <Hsp_num>1</Hsp_num>\n+ <Hsp_bit-score>847.04</Hsp_bit-score>\n+ <Hsp_score>2187</Hsp_score>\n+ <Hsp_evalue>0</Hsp_evalue>\n+ <Hsp_query-from>1</Hsp_query-from>\n+ <Hsp_query-to>406</Hsp_query-to>\n+ <Hsp_hit-from>1</Hsp_hit-from>\n+ <Hsp_hit-to>406</Hsp_hit-to>\n+ <Hsp_query-frame>0</Hsp_query-frame>\n+ <Hsp_hit-frame>0</Hsp_hit-frame>\n+ <Hsp_identity>406</Hsp_identity>\n+ <Hsp_positive>406</Hsp_positive>\n+ <Hsp_gaps>0</Hsp_gaps>\n+ <Hsp_align-len>406</Hsp_align-len>\n+ <Hsp_qseq>MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRFSQMLHPIFEEASDVIKEEFPNENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMKREYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSKRNIIGYFEQKDSDNYRVFERVANILHDDCA'..b'AEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA</Hsp_qseq>\n+ <Hsp_hseq>MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNAEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLFGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSASIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA</Hsp_hseq>\n+ <Hsp_midline>MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMV GGFT+TLYTSLHGYFVFGPTGCN EGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPL GWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMI+IFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSA+IYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA</Hsp_midline>\n+ </Hsp>\n+ </Hit_hsps>\n+ </Hit>\n+ <Hit>\n+ <Hit_num>4</Hit_num>\n+ <Hit_id>gi|3024288|sp|Q28886.1|OPSD_MACFA</Hit_id>\n+ <Hit_def>RecName: Full=Rhodopsin >gi|21466156|gb|AAB33079.2| opsin [Macaca fascicularis]</Hit_def>\n+ <Hit_accession>Q28886</Hit_accession>\n+ <Hit_len>348</Hit_len>\n+ <Hit_hsps>\n+ <Hsp>\n+ <Hsp_num>1</Hsp_num>\n+ <Hsp_bit-score>705.286</Hsp_bit-score>\n+ <Hsp_score>1819</Hsp_score>\n+ <Hsp_evalue>0</Hsp_evalue>\n+ <Hsp_query-from>1</Hsp_query-from>\n+ <Hsp_query-to>348</Hsp_query-to>\n+ <Hsp_hit-from>1</Hsp_hit-from>\n+ <Hsp_hit-to>348</Hsp_hit-to>\n+ <Hsp_query-frame>0</Hsp_query-frame>\n+ <Hsp_hit-frame>0</Hsp_hit-frame>\n+ <Hsp_identity>341</Hsp_identity>\n+ <Hsp_positive>344</Hsp_positive>\n+ <Hsp_gaps>0</Hsp_gaps>\n+ <Hsp_align-len>348</Hsp_align-len>\n+ <Hsp_qseq>MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA</Hsp_qseq>\n+ <Hsp_hseq>MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNAEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLFGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIVIFFCYGQLVFTVKEARAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSASIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA</Hsp_hseq>\n+ <Hsp_midline>MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMV GGFT+TLYTSLHGYFVFGPTGCN EGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPL GWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMI+IFFCYGQLVFTVKEA AQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSA+IYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA</Hsp_midline>\n+ </Hsp>\n+ </Hit_hsps>\n+ </Hit>\n+ </Iteration_hits>\n+ <Iteration_stat>\n+ <Statistics>\n+ <Statistics_db-num>23216691</Statistics_db-num>\n+ <Statistics_db-len>7978792554</Statistics_db-len>\n+ <Statistics_hsp-len>143</Statistics_hsp-len>\n+ <Statistics_eff-space>955055176905</Statistics_eff-space>\n+ <Statistics_kappa>0.041</Statistics_kappa>\n+ <Statistics_lambda>0.267</Statistics_lambda>\n+ <Statistics_entropy>0.14</Statistics_entropy>\n+ </Statistics>\n+ </Iteration_stat>\n+ </Iteration>\n+ </BlastOutput_iterations>\n+</BlastOutput>\n\\ No newline at end of file\n'

diff -r 09ff180d1615 -r 99b82a2b1272 test-data/four_human_proteins.fasta.orig
--- a/test-data/four_human_proteins.fasta.orig Wed Mar 27 11:21:05 2013 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,61 +0,0 @@
->sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1
-MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRF
-SQMLHPIFEEASDVIKEEFPNENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMK
-REYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSKRNIIGYFEQKDSDNYRVFER
-VANILHDDCAFLSAFGDVSKPERYSGDNIIYKPPGHSAPDMVYLGAMTNFDVTYNWIQDK
-CVPLVREITFENGEELTEEGLPFLILFHMKEDTESLEIFQNEVARQLISEKGTINFLHAD
-CDKFRHPLLHIQKTPADCPVIAIDSFRHMYVFGDFKDVLIPGKLKQFVFDLHSGKLHREF
-HHGPDPTDTAPGEQAQDVASSPPESSFQKLAPSEYRYTLLRDRDEL
->sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2
-MKKFSRMPKSEGGSGGGAAGGGAGGAGAGAGCGSGGSSVGVRVFAVGRHQVTLEESLAEG
-GFSTVFLVRTHGGIRCALKRMYVNNMPDLNVCKREITIMKELSGHKNIVGYLDCAVNSIS
-DNVWEVLILMEYCRAGQVVNQMNKKLQTGFTEPEVLQIFCDTCEAVARLHQCKTPIIHRD
-LKVENILLNDGGNYVLCDFGSATNKFLNPQKDGVNVVEEEIKKYTTLSYRAPEMINLYGG
-KPITTKADIWALGCLLYKLCFFTLPFGESQVAICDGNFTIPDNSRYSRNIHCLIRFMLEP
-DPEHRPDIFQVSYFAFKFAKKDCPVSNINNSSIPSALPEPMTASEAAARKSQIKARITDT
-IGPTETSIAPRQRPKANSATTATPSVLTIQSSATPVKVLAPGEFGNHRPKGALRPGNGPE
-ILLGQGPPQQPPQQHRVLQQLQQGDWRLQQLHLQHRHPHQQQQQQQQQQQQQQQQQQQQQ
-QQQQQQHHHHHHHHLLQDAYMQQYQHATQQQQMLQQQFLMHSVYQPQPSASQYPTMMPQY
-QQAFFQQQMLAQHQPSQQQASPEYLTSPQEFSPALVSYTSSLPAQVGTIMDSSYSANRSV
-ADKEAIANFTNQKNISNPPDMSGWNPFGEDNFSKLTEEELLDREFDLLRSNRLEERASSD
-KNVDSLSAPHNHPPEDPFGSVPFISHSGSPEKKAEHSSINQENGTANPIKNGKTSPASKD
-QRTGKKTSVQGQVQKGNDESESDFESDPPSPKSSEEEEQDDEEVLQGEQGDFNDDDTEPE
-NLGHRPLLMDSEDEEEEEKHSSDSDYEQAKAKYSDMSSVYRDRSGSGPTQDLNTILLTSA
-QLSSDVAVETPKQEFDVFGAVPFFAVRAQQPQQEKNEKNLPQHRFPAAGLEQEEFDVFTK
-APFSKKVNVQECHAVGPEAHTIPGYPKSVDVFGSTPFQPFLTSTSKSESNEDLFGLVPFD
-EITGSQQQKVKQRSLQKLSSRQRRTKQDMSKSNGKRHHGTPTSTKKTLKPTYRTPERARR
-HKKVGRRDSQSSNEFLTISDSKENISVALTDGKDRGNVLQPEESLLDPFGAKPFHSPDLS
-WHPPHQGLSDIRADHNTVLPGRPRQNSLHGSFHSADVLKMDDFGAVPFTELVVQSITPHQ
-SQQSQPVELDPFGAAPFPSKQ
->sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4
-MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHL
-QILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYAL
-VIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNYIVLNKDDNE
-ECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL
-GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQG
-CHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGC
-TVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETL
-EIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE
-RNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQ
-NVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFS
-DERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWE
-RQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL
-KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAF
-PNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYV
-SARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCV
-SRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG
-PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSR
-EKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKG
-FTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMA
-AEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV
-RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDN
-CPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEME
-FEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSN
-PS
->sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1
-MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
-VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
-GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
-EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQES
-ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
-YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA

diff -r 09ff180d1615 -r 99b82a2b1272 test-data/k12_ten_proteins.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/k12_ten_proteins.fasta Wed Apr 03 10:49:10 2013 -0400

[

@@ -0,0 +1,60 @@
+>gi|16127995|ref|NP_414542.1| thr operon leader peptide [Escherichia coli str. K-12 substr. MG1655]
+MKRISTTITTTITITTGNGAG
+>gi|16127996|ref|NP_414543.1| fused aspartokinase I and homoserine dehydrogenase I [Escherichia coli str. K-12 substr. MG1655]
+MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI
+FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA
+RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS
+AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC
+LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
+QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
+ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW
+LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV
+ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM
+KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE
+IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK
+VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV
+>gi|16127997|ref|NP_414544.1| homoserine kinase [Escherichia coli str. K-12 substr. MG1655]
+MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE
+RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY
+DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF
+IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA
+DWLGKNYLQNQEGFVHICRLDTAGARVLEN
+>gi|16127998|ref|NP_414545.1| threonine synthase [Escherichia coli str. K-12 substr. MG1655]
+MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLDFVTRSAKILSAFIGDEIPQE
+ILEERVRAAFAFPAPVANVESDVGCLELFHGPTLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAA
+VAHAFYGLPNVKVVILYPRGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNS
+ANSINISRLLAQICYYFEAVAQLPQETRNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVNDTVP
+RFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDETTQQTMRELKELGYTS
+EPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGETLDLPKELAERADLPLLSHNLPADFAAL
+RKLMMNHQ
+>gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655]
+MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL
+HGPPPPPRHHKKAPHDHHGGHGPGKHHR
+>gi|16128000|ref|NP_414547.1| peroxide resistance protein, lowers intracellular iron [Escherichia coli str. K-12 substr. MG1655]
+MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPD
+FTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARG
+KDLYQFWGDIITNKLNEALAAQGDNVVINLASDEYFKSVKPKKLNAEIIKPVFLDEKNGKFKIISFYAKK
+ARGLMSRFIIENRLTKPEQLTGFNSEGYFFDEDSSSNGELVFKRYEQR
+>gi|16128001|ref|NP_414548.1| putative transporter [Escherichia coli str. K-12 substr. MG1655]
+MPDFFSFINSVLWGSVMIYLLFGAGCWFTFRTGFVQFRYIRQFGKSLKNSIHPQPGGLTSFQSLCTSLAA
+RVGSGNLAGVALAITAGGPGAVFWMWVAAFIGMATSFAECSLAQLYKERDVNGQFRGGPAWYMARGLGMR
+WMGVLFAVFLLIAYGIIFSGVQANAVARALSFSFDFPPLVTGIILAVFTLLAITRGLHGVARLMQGFVPL
+MAIIWVLTSLVICVMNIGQLPHVIWSIFESAFGWQEAAGGAAGYTLSQAITNGFQRSMFSNEAGMGSTPN
+AAAAAASWPPHPAAQGIVQMIGIFIDTLVICTASAMLILLAGNGTTYMPLEGIQLIQKAMRVLMGSWGAE
+FVTLVVILFAFSSIVANYIYAENNLFFLRLNNPKAIWCLRICTFATVIGGTLLSLPLMWQLADIIMACMA
+ITNLTAILLLSPVVHTIASDYLRQRKLGVRPVFDPLRYPDIGRQLSPDAWDDVSQE
+>gi|16128002|ref|NP_414549.1| transaldolase B [Escherichia coli str. K-12 substr. MG1655]
+MTDKLTSLRQYTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRKLIDDAVAWAKQQSNDRAQQI
+VDATDKLAVNIGLEILKLVPGRISTEVDARLSYDTEASIAKAKRLIKLYNDAGISNDRILIKLASTWQGI
+RAAEQLEKEGINCNLTLLFSFAQARACAEAGVFLISPFVGRILDWYKANTDKKEYAPAEDPGVVSVSEIY
+QYYKEHGYETVVMGASFRNIGEILELAGCDRLTIAPALLKELAESEGAIERKLSYTGEVKARPARITESE
+FLWQHNQDPMAVDKLAEGIRKFAIDQEKLEKMIGDLL
+>gi|16128003|ref|NP_414550.1| molybdochelatase incorporating molybdenum into molybdopterin [Escherichia coli str. K-12 substr. MG1655]
+MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDEQAIIEQTLCELVDEMSCHLV
+LTTGGTGPARRDVTPDATLAVADREMPGFGEQMRQISLHFVPTAILSRQVGVIRKQALILNLPGQPKSIK
+ETLEGVKDAEGNVVVHGIFASVPYCIQLLEGPYVETAPEVVAAFRPKSARRDVSE
+>gi|16128004|ref|NP_414551.1| inner membrane protein, Grp1_Fun34_YaaH family [Escherichia coli str. K-12 substr. MG1655]
+MGNTKLANPAPLGLMGFGMTTILLNLHNVGYFALDGIILAMGIFYGGIAQIFAGLLEYKKGNTFGLTAFT
+SYGSFWLTLVAILLMPKLGLTDAPNAQFLGVYLGLWGVFTLFMFFGTLKGARVLQFVFFSLTVLFALLAI
+GNIAGNAAIIHFAGWIGLICGASAIYLAMGEVLNEQFGRTVLPIGESH
+

diff -r 09ff180d1615 -r 99b82a2b1272 test-data/k12_ten_proteins_psortb_p_terse.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/k12_ten_proteins_psortb_p_terse.tabular Wed Apr 03 10:49:10 2013 -0400

@@ -0,0 +1,11 @@
+#SeqID Localization Score
+gi|16127995|ref|NP_414542.1| Extracellular 8.91
+gi|16127996|ref|NP_414543.1| Cytoplasmic 7.50
+gi|16127997|ref|NP_414544.1| Cytoplasmic 7.50
+gi|16127998|ref|NP_414545.1| Cytoplasmic 7.50
+gi|16127999|ref|NP_414546.1| Unknown 3.33
+gi|16128000|ref|NP_414547.1| Cytoplasmic 7.50
+gi|16128001|ref|NP_414548.1| CytoplasmicMembrane 10.00
+gi|16128002|ref|NP_414549.1| Cytoplasmic 7.50
+gi|16128003|ref|NP_414550.1| CytoplasmicMembrane 8.16
+gi|16128004|ref|NP_414551.1| CytoplasmicMembrane 10.00

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/LICENSE
--- a/tools/protein_analysis/LICENSE Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/LICENSE Wed Apr 03 10:49:10 2013 -0400

@@ -1,8 +1,10 @@
-Copyright (c) 2010-2011 Peter Cock, The James Hutton Institute
-(formerly SCRI, Scottish Crop Research Institute), UK.
+These wrappers are copyright 2010-2013 by Peter Cock, James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+Contributions/revisions copyright 2011 Konrad Paszkiewicz. All rights reserved.

-License for TMHMM 2.0, SignalP 3.0, and WoLF PSORT wrappers for Galaxy
-(note that tools themselves are copyright and licensed separately).
+License for TMHMM 2.0, SignalP 3.0, WoLF PSORT and PSORTb wrappers for
+Galaxy (note that tools themselves are copyright and licensed separately)
+and the RXLR motif tool for Galaxy.

Permission to use, copy, modify, and distribute this software and its
documentation with or without modifications and for any purpose and

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/README
--- a/tools/protein_analysis/README Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/README Wed Apr 03 10:49:10 2013 -0400

@@ -7,13 +7,16 @@

* WoLF PSORT v0.2 from http://wolfpsort.org/

+* PSORTb v3 from http://www.psort.org/downloads/index.html
+
Also, the RXLR motif tool uses SignalP 3.0 and HMMER 2.3.2 internally.

To use these Galaxy wrappers you must first install the command line tools.
-At the time of writing they are all free for academic use.
+At the time of writing they are all free for academic use, or open source.

-These wrappers are copyright 2010-2012 by Peter Cock, James Hutton Institute
+These wrappers are copyright 2010-2013 by Peter Cock, James Hutton Institute
(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+Contributions/revisions copyright 2011 Konrad Paszkiewicz. All rights reserved.
See the included LICENCE file for details (an MIT style open source licence).

Requirements
@@ -60,6 +63,9 @@
promoter2.xml (Galaxy tool definition)
promoter2.py (Python wrapper script)

+psortb.xml (Galaxy tool definition)
+psortb.py (Python wrapper script)
+
wolf_psort.xml (Galaxy tool definition)
wolf_psort.py (Python wrapper script)

@@ -77,6 +83,7 @@
   <section name="Protein sequence analysis" id="protein_analysis">
     <tool file="protein_analysis/tmhmm2.xml" />
     <tool file="protein_analysis/signalp3.xml" />
+    <tool file="protein_analysis/psortb.xml" />
     <tool file="protein_analysis/wolf_psort.xml" />
     <tool file="protein_analysis/rxlr_motifs.xml" />
   </section>
@@ -95,17 +102,23 @@
empty.fasta
empty_tmhmm2.tabular
empty_signalp3.tabular
+k12_ten_proteins.fasta
+k12_ten_proteins_psortb_p_terse.tabular

5. Run the Galaxy functional tests for these new wrappers with:

./run_functional_tests.sh -id tmhmm2
./run_functional_tests.sh -id signalp3
+./run_functional_tests.sh -id Psortb
+./run_functional_tests.sh -id rxlr_motifs

Alternatively, this should work (assuming you left the name and id as shown in
the XML file tool_conf.xml.sample):

./run_functional_tests.sh -sid Protein_sequence_analysis-protein_analysis

+To check the section ID expected, use ./run_functional_tests.sh -list
+
6. Restart Galaxy and check the new tools are shown and work.

@@ -130,6 +143,8 @@
v0.1.2 - Use the new <stdio> settings in the XML wrappers to catch errors
        - Use SGE style $NSLOTS for thread count (otherwise default to 4)
v0.1.3 - Added missing file whisson_et_al_rxlr_eer_cropped.hmm to Tool Shed
+v0.2.0 - Added PSORTb wrapper to the suite, based on earlier work
+         contributed by Konrad Paszkiewicz.

Developers
@@ -144,11 +159,11 @@
For making the "Galaxy Tool Shed" http://community.g2.bx.psu.edu/ tarball use
the following command from the Galaxy root folder:

-tar -czf ~/tmhmm_signalp_etc.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py tools/protein_analysis/promoter2.xml tools/protein_analysis/promoter2.py tools/protein_analysis/wolf_psort.xml tools/protein_analysis/wolf_psort.py tools/protein_analysis/rxlr_motifs.xml tools/protein_analysis/rxlr_motifs.py tools/protein_analysis/whisson_et_al_rxlr_eer_cropped.hmm test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular
+tar -czf ~/tmhmm_signalp_etc.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py tools/protein_analysis/promoter2.xml tools/protein_analysis/promoter2.py tools/protein_analysis/psortb.xml tools/protein_analysis/psortb.py tools/protein_analysis/wolf_psort.xml tools/protein_analysis/wolf_psort.py tools/protein_analysis/rxlr_motifs.xml tools/protein_analysis/rxlr_motifs.py tools/protein_analysis/whisson_et_al_rxlr_eer_cropped.hmm test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular test-data/k12_ten_proteins.fasta test-data/k12_ten_proteins_psortb_p_terse.tabular

Check this worked:

-$ tar -tzf tmhmm_signalp_etc.tar.gz
+$ tar -tzf ~/tmhmm_signalp_etc.tar.gz
tools/protein_analysis/LICENSE
tools/protein_analysis/README
tools/protein_analysis/suite_config.xml
@@ -159,6 +174,8 @@
tools/protein_analysis/tmhmm2.py
tools/protein_analysis/promoter2.xml
tools/protein_analysis/promoter2.py
+tools/protein_analysis/psortb.xml
+tools/protein_analysis/psortb.py
tools/protein_analysis/wolf_psort.xml
tools/protein_analysis/wolf_psort.py
tools/protein_analysis/rxlr_motifs.xml
@@ -170,3 +187,5 @@
test-data/empty.fasta
test-data/empty_tmhmm2.tabular
test-data/empty_signalp3.tabular
+test-data/k12_ten_proteins.fasta
+test-data/k12_ten_proteins_psortb_p_terse.tabular

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/psortb.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/protein_analysis/psortb.py Wed Apr 03 10:49:10 2013 -0400

[

@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+"""Wrapper for psortb for use in Galaxy.
+
+This script takes exactly six command line arguments - which includes the
+number of threads, and the input protein FASTA filename and output
+tabular filename. It then splits up the FASTA input and calls multiple
+copies of the standalone psortb v3 program, then collates the output.
+e.g. Rather than this,
+
+psort $type -c $cutoff -d $divergent -o long $sequence > $outfile
+
+Call this:
+
+psort $threads $type $cutoff $divergent $sequence $outfile
+
+If ommitting -c or -d options, set $cutoff and $divergent to zero or blank.
+
+Note that this is somewhat redundant with job-splitting available in Galaxy
+itself (see the SignalP XML file for settings), but both can be applied.
+
+Additionally it ensures the header line (with the column names) starts
+with a # character as used elsewhere in Galaxy.
+"""
+import sys
+import os
+import tempfile
+from seq_analysis_utils import stop_err, split_fasta, run_jobs, thread_count
+
+FASTA_CHUNK = 500
+
+if "-v" in sys.argv or "--version" in sys.argv:
+    """Return underlying PSORTb's version"""
+    sys.exit(os.system("psort --version"))
+
+if len(sys.argv) != 8:
+    stop_err("Require 7 arguments, number of threads (int), type (e.g. archaea), "
+             "output (e.g. terse/normal/long), cutoff, divergent, input protein "
+             "FASTA file & output tabular file")
+
+num_threads = thread_count(sys.argv[1], default=4)
+org_type = sys.argv[2]
+out_type = sys.argv[3]
+cutoff = sys.argv[4]
+if cutoff.strip() and float(cutoff.strip()) != 0.0:
+    cutoff = "-c %s" % cutoff
+else:
+    cutoff = ""
+divergent = sys.argv[5]
+if divergent.strip() and float(divergent.strip()) != 0.0:
+    divergent = "-d %s" % divergent
+else:
+    divergent = ""
+fasta_file = sys.argv[6]
+tabular_file = sys.argv[7]
+
+if out_type == "terse":
+    header = ['SeqID', 'Localization', 'Score']
+elif out_type == "normal":
+    stop_err("Normal output not implemented yet, sorry.")
+elif out_type == "long":
+    if org_type == "-n":
+        #Gram negative bacteria
+        header = ['SeqID', 'CMSVM-_Localization', 'CMSVM-_Details', 'CytoSVM-_Localization', 'CytoSVM-_Details',
+                  'ECSVM-_Localization', 'ECSVM-_Details', 'ModHMM-_Localization', 'ModHMM-_Details',
+                  'Motif-_Localization', 'Motif-_Details', 'OMPMotif-_Localization', 'OMPMotif-_Details',
+                  'OMSVM-_Localization', 'OMSVM-_Details', 'PPSVM-_Localization', 'PPSVM-_Details',
+                  'Profile-_Localization', 'Profile-_Details',
+                  'SCL-BLAST-_Localization', 'SCL-BLAST-_Details', 'SCL-BLASTe-_Localization', 'SCL-BLASTe-_Details',
+                  'Signal-_Localization', 'Signal-_Details',
+                  'Cytoplasmic_Score', 'CytoplasmicMembrane_Score', 'Periplasmic_Score', 'OuterMembrane_Score',
+                  'Extracellular_Score', 'Final_Localization', 'Final_Localization_Details', 'Final_Score',
+                  'Secondary_Localization', 'PSortb_Version']
+    elif org_type == "-p":
+        #Gram positive bacteria
+        header = ['SeqID', 'CMSVM+_Localization', 'CMSVM+_Details', 'CWSVM+_Localization', 'CWSVM+_Details',
+                  'CytoSVM+_Localization', 'CytoSVM+_Details', 'ECSVM+_Localization', 'ECSVM+_Details',
+                  'ModHMM+_Localization', 'ModHMM+_Details', 'Motif+_Localization', 'Motif+_Details',
+                  'Profile+_Localization', 'Profile+_Details',
+                  'SCL-BLAST+_Localization', 'SCL-BLAST+_Details', 'SCL-BLASTe+_Localization', 'SCL-BLASTe+_Details',
+                  'Signal+_Localization', 'Signal+_Details',
+                  'Cytoplasmic_Score', 'CytoplasmicMembrane_Score', 'Cellwall_Score',
+                  'Extracellular_Score', 'Final_Localization', 'Final_Localization_Details', 'Final_Score',
+                  'Secondary_Localization', 'PSortb_Version']
+    elif org_type == "-a":
+        #Archaea
+        header = ['SeqID', 'CMSVM_a_Localization', 'CMSVM_a_Details', 'CWSVM_a_Localization', 'CWSVM_a_Details',
+                  'CytoSVM_a_Localization', 'CytoSVM_a_Details', 'ECSVM_a_Localization', 'ECSVM_a_Details',
+                  'ModHMM_a_Localization', 'ModHMM_a_Details', 'Motif_a_Localization', 'Motif_a_Details',
+                  'Profile_a_Localization', 'Profile_a_Details',
+                  'SCL-BLAST_a_Localization', 'SCL-BLAST_a_Details', 'SCL-BLASTe_a_Localization', 'SCL-BLASTe_a_Details',
+                  'Signal_a_Localization', 'Signal_a_Details',
+                  'Cytoplasmic_Score', 'CytoplasmicMembrane_Score', 'Cellwall_Score',
+                  'Extracellular_Score', 'Final_Localization', 'Final_Localization_Details', 'Final_Score',
+                  'Secondary_Localization', 'PSortb_Version']
+    else:
+        stop_err("Expected -n, -p or -a for the organism type, not %r" % org_type)
+else:
+    stop_err("Expected terse, normal or long for the output type, not %r" % out_type)
+
+tmp_dir = tempfile.mkdtemp()
+
+def clean_tabular(raw_handle, out_handle):
+    """Clean up tabular TMHMM output, returns output line count."""
+    global header
+    count = 0
+    for line in raw_handle:
+        if not line.strip() or line.startswith("#"):
+            #Ignore any blank lines or comment lines
+            continue
+        parts = [x.strip() for x in line.rstrip("\r\n").split("\t")]
+        if parts == header:
+            #Ignore the header line
+            continue
+        if not parts[-1] and len(parts) == len(header) + 1:
+            #Ignore dummy blank extra column, e.g.
+            #"...2.0\t\tPSORTb version 3.0\t\n"
+            parts = parts[:-1]
+        assert len(parts) == len(header), \
+            "%i fields, not %i, in line:\n%r" % (len(line), len(header), line)
+        out_handle.write(line)
+        count += 1
+    return count
+
+#Note that if the input FASTA file contains no sequences,
+#split_fasta returns an empty list (i.e. zero temp files).
+fasta_files = split_fasta(fasta_file, os.path.join(tmp_dir, "tmhmm"), FASTA_CHUNK)
+temp_files = [f+".out" for f in fasta_files]
+jobs = ["psort %s %s %s -o %s %s > %s" % (org_type, cutoff, divergent, out_type, fasta, temp)
+        for fasta, temp in zip(fasta_files, temp_files)]
+
+def clean_up(file_list):
+    for f in file_list:
+        if os.path.isfile(f):
+            os.remove(f)
+    try:
+        os.rmdir(tmp_dir)
+    except:
+        pass
+
+if len(jobs) > 1 and num_threads > 1:
+    #A small "info" message for Galaxy to show the user.
+    print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs))
+results = run_jobs(jobs, num_threads)
+for fasta, temp, cmd in zip(fasta_files, temp_files, jobs):
+    error_level = results[cmd]
+    if error_level:
+        try:
+            output = open(temp).readline()
+        except IOError:
+            output = ""
+        clean_up(fasta_files + temp_files)
+        stop_err("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output),
+                 error_level)
+del results
+del jobs
+
+out_handle = open(tabular_file, "w")
+out_handle.write("#%s\n" % "\t".join(header))
+count = 0
+for temp in temp_files:
+    data_handle = open(temp)
+    count += clean_tabular(data_handle, out_handle)
+    data_handle.close()
+    if not count:
+        clean_up(fasta_files + temp_files)
+        stop_err("No output from psortb")
+out_handle.close()
+print "%i records" % count
+
+clean_up(fasta_files + temp_files)

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/psortb.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/protein_analysis/psortb.xml Wed Apr 03 10:49:10 2013 -0400

@@ -0,0 +1,89 @@
+<tool id="Psortb" name="psortb" version="0.0.1">
+  <description>Determines sub-cellular localisation of bacterial/archaeal protein sequences</description>
+  
+  
+  <parallelism method="basic" split_inputs="fasta_file" split_mode="to_size" split_size="2000" merge_outputs="tabular_file"></parallelism>
+  <version_command interpreter="python">psortb.py --version</version_command>
+  <command interpreter="python">psortb.py "\$NSLOTS" "$type" "$long" "$cutoff" "$divergent" "$sequence" "$outfile"</command>
+  <stdio>
+    
+    <exit_code range="1:" />
+    <exit_code range=":-1" />
+  </stdio>
+  <inputs>
+    <param format="fasta" name="sequence" type="data"
+    label="Input sequences for which to predict localisation (protein FASTA format)" />
+    <param name="type" type="select"
+    label="Organism type (N.B. all sequences in the above file must be of the same type)" >
+      <option value="-p">Gram positive bacteria</option>
+      <option value="-n">Gram negative bacteria</option>
+      <option value="-a">Archaea</option>
+    </param>
+    <param name="long" type="select" label="Output type">
+      <option value="terse">Short (terse, tabular with 3 columns)</option>
+      
+      <option value="long">Long (verbose, tabular with about 30 columns, depending on organism type)</option>
+    </param>
+    <param name="cutoff" size="10" type="float" optional="true" value=""
+    label="Sets a cutoff value for reported results (e.g. 7.5)"
+    help="Leave blank or use zero for no cutoff." />
+    <param name="divergent" size="10" type="float" optional="true" value=""
+    label="Sets a cutoff value for the multiple localization flag (e.g. 4.5)"
+    help="Leave blank or use zero for no cutoff." />
+  </inputs>
+  <outputs>
+    <data format="tabular" name="outfile" />
+  </outputs>
+  <requirements>
+    <requirement type="binary">psort</requirement>
+  </requirements>
+  <tests>
+    <test>
+      <param name="sequence" value="empty.fasta" ftype="fasta"/>
+      <param name="long" value="terse"/>
+      <output name="outfile" file="empty_psortb_terse.tabular" ftype="tabular"/>
+    </test>
+    <test>
+      <param name="sequence" value="k12_ten_proteins.fasta" ftype="fasta"/>
+      <param name="long" value="terse"/>
+      <output name="outfile" file="k12_ten_proteins_psortb_p_terse.tabular" ftype="tabular"/>
+    </test>
+  </tests>
+  <help>
+
+**What it does**
+
+This calls the command line tool PSORTb v3.0 for prediction of prokaryotic
+localization sites. The input dataset needs to be protein FASTA sequences.
+The default output is a simple tabular file with three columns, one row
+per query sequence:
+
+====== ==============================
+Column Description
+------ ------------------------------
+     1 Sequence identifier
+     2 Localisation, e.g. Cytoplasmic
+     3 Score
+====== ==============================
+
+The long output is also tabular with one row per query sequence, but has
+lots more columns (a different set for each supported organism type). In
+both cases, a simple header line is included (starting with a hash, #,
+so that Galaxy treats it as a comment) giving the column names.
+
+
+**References**
+
+N.Y. Yu, J.R. Wagner, M.R. Laird, G. Melli, S. Rey, R. Lo, P. Dao,
+S.C. Sahinalp, M. Ester, L.J. Foster, F.S.L. Brinkman (2010)
+PSORTb 3.0: Improved protein subcellular localization prediction with
+refined localization subcategories and predictive capabilities for all
+prokaryotes, Bioinformatics 26(13):1608-1615
+http://dx.doi.org/10.1093/bioinformatics/btq249
+
+http://www.psort.org/documentation/index.html
+
+  </help>
+</tool>

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/signalp3.py
--- a/tools/protein_analysis/signalp3.py Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/signalp3.py Wed Apr 03 10:49:10 2013 -0400

[

@@ -63,34 +63,34 @@
MAX_LEN = 6000 #Found by trial and error

if len(sys.argv) not in  [6,8]:
-   stop_err("Require five (or 7) arguments, organism, truncate, threads, "
-            "input protein FASTA file & output tabular file (plus "
-            "optionally cut method and GFF3 output file). "
-            "Got %i arguments." % (len(sys.argv)-1))
+    stop_err("Require five (or 7) arguments, organism, truncate, threads, "
+             "input protein FASTA file & output tabular file (plus "
+             "optionally cut method and GFF3 output file). "
+             "Got %i arguments." % (len(sys.argv)-1))

organism = sys.argv[1]
if organism not in ["euk", "gram+", "gram-"]:
-   stop_err("Organism argument %s is not one of euk, gram+ or gram-" % organism)
+    stop_err("Organism argument %s is not one of euk, gram+ or gram-" % organism)

try:
-   truncate = int(sys.argv[2])
+    truncate = int(sys.argv[2])
except:
-   truncate = 0
+    truncate = 0
if truncate < 0:
-   stop_err("Truncate argument %s is not a positive integer (or zero)" % sys.argv[2])
+    stop_err("Truncate argument %s is not a positive integer (or zero)" % sys.argv[2])

num_threads = thread_count(sys.argv[3], default=4)
fasta_file = sys.argv[4]
tabular_file = sys.argv[5]

if len(sys.argv) == 8:
-   cut_method = sys.argv[6]
-   if cut_method not in ["NN_Cmax", "NN_Ymax", "NN_Smax", "HMM_Cmax"]:
-      stop_err("Invalid cut method %r" % cut_method)
-   gff3_file = sys.argv[7]
+    cut_method = sys.argv[6]
+    if cut_method not in ["NN_Cmax", "NN_Ymax", "NN_Smax", "HMM_Cmax"]:
+        stop_err("Invalid cut method %r" % cut_method)
+    gff3_file = sys.argv[7]
else:
-   cut_method = None
-   gff3_file = None
+    cut_method = None
+    gff3_file = None

tmp_dir = tempfile.mkdtemp()
@@ -98,18 +98,19 @@
def clean_tabular(raw_handle, out_handle, gff_handle=None, cut_method=None):
     """Clean up SignalP output to make it tabular."""
     if cut_method:
-       cut_col = {"NN_Cmax" : 2,
-                  "NN_Ymax" : 5,
-                  "NN_Smax" : 8,
-                  "HMM_Cmax" : 16}[cut_method]
+        cut_col = {"NN_Cmax" : 2,
+                   "NN_Ymax" : 5,
+                   "NN_Smax" : 8,
+                   "HMM_Cmax" : 16}[cut_method]
     else:
-       cut_col = None
+        cut_col = None
     for line in raw_handle:
         if not line or line.startswith("#"):
             continue
         parts = line.rstrip("\r\n").split()
         assert len(parts)==21, repr(line)
-        assert parts[14].startswith(parts[0])
+        assert parts[14].startswith(parts[0]), \
+            "Bad entry in SignalP output, ID miss-match:\n%r" % line
         #Remove redundant truncated name column (col 0)
         #and put full name at start (col 14)
         parts = parts[14:15] + parts[1:14] + parts[15:]
@@ -218,6 +219,6 @@

#GFF3:
if cut_method:
-   make_gff(fasta_file, tabular_file, gff3_file, cut_method)
+    make_gff(fasta_file, tabular_file, gff3_file, cut_method)

clean_up(fasta_files + temp_files)

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/signalp3.xml
--- a/tools/protein_analysis/signalp3.xml Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/signalp3.xml Wed Apr 03 10:49:10 2013 -0400

@@ -1,4 +1,4 @@
-<tool id="signalp3" name="SignalP 3.0" version="0.0.10">
+<tool id="signalp3" name="SignalP 3.0" version="0.0.11">
     <description>Find signal peptides in protein sequences</description>
     
     
@@ -71,9 +71,13 @@

The input is a FASTA file of protein sequences, and the output is tabular with twenty columns (one row per protein):

- * Sequence identifier
- * Neural Network (NN) predictions (13 columns)
- * Hidden Markov Model (HMM) predictions (6 columns)
+====== =================================================
+Column Description
+------ -------------------------------------------------
+     1 Sequence identifier
+  2-14 Neural Network (NN) predictions (13 columns)
+ 15-20 Hidden Markov Model (HMM) predictions (6 columns)
+====== =================================================

Internally the input FASTA file is divided into parts (to allow multiple processors to be used), and the proteins truncated as specified (see below). The raw output from SignalP is then reformatted into a tabular layout suitable for Galaxy (see below).

@@ -83,15 +87,47 @@

The NN output comprises three different scores (C-max, S-max and Y-max) and two scores derived from them (S-mean and D-score).

-The C-score is the 'cleavage site' score. For each position in the submitted sequence, a C-score is reported, which should only be significantly high at the cleavage site. Confusion is often seen with the position numbering of the cleavage site. When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein, meaning that a predicted cleavage site between amino acid 26-27 is reported as 27, corresponding to the mature protein starting at (and including) position 27.
-
-The S-score for the signal peptide prediction is calculated for every single amino acid position in the submitted sequence (not shown in the output via Galaxy), with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein.
-
-Y-max is a derivative of the C-score combined with the S-score resulting in a better cleavage site prediction than the raw C-score alone. This is due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is the true cleavage site. The cleavage site is assigned from the Y-score where the slope of the S-score is steep and a significant C-score is found.
-
-The S-mean is the average of the S-score, ranging from the N-terminal amino acid to the amino acid assigned with the highest Y-max score, thus the S-mean score is calculated for the length of the predicted signal peptide. The S-mean score was in SignalP version 2.0 used as the criteria for discrimination of secretory and non-secretory proteins.
-
-The D-score was introduced in SignalP version 3.0 and is a simple average of the S-mean and Y-max score. The score shows superior discrimination performance of secretory and non-secretory proteins to that of the S-mean score which was used in SignalP version 1 and 2.
+====== ======= ===============================================================
+Column Name    Description
+------ ------- ---------------------------------------------------------------
+   2-4 C-score The C-score is the 'cleavage site' score. For each position in
+               the submitted sequence, a C-score is reported, which should
+               only be significantly high at the cleavage site. Confusion is
+               often seen with the position numbering of the cleavage site.
+               When a cleavage site position is referred to by a single number,
+               the number indicates the first residue in the mature protein,
+               meaning, that a predicted cleavage site between amino acid 26-27
+               is reported as 27, corresponding to the mature protein starting
+               at (and including) position 27.
+------ ------- ---------------------------------------------------------------
+   5-7 S-score The S-score for the signal peptide prediction is calculated for
+               every single amino acid position in the submitted sequence (not
+               shown in the output via Galaxy), with high scores indicating
+               that the corresponding amino acid is part of a signal peptide,
+               and low scores indicating that the amino acid is part of a
+               mature protein.
+------ ------- ---------------------------------------------------------------
+  8-10 Y-max   Y-max is a derivative of the C-score combined with the S-score
+               resulting in a better cleavage site prediction than the raw
+               C-score alone. This is due to the fact that multiple high-peaking
+               C-scores can be found in one sequence, where only one is the
+               true cleavage site. The cleavage site is assigned from the
+               Y-score where the slope of the S-score is steep and a
+               significant C-score is found.
+------ ------- ---------------------------------------------------------------
+ 11-12 S-mean  The S-mean is the average of the S-score, ranging from the
+               N-terminal amino acid to the amino acid assigned with the
+               highest Y-max score, thus the S-mean score is calculated for
+               the length of the predicted signal peptide. The S-mean score
+               was in SignalP version 2.0 used as the criteria for
+               discrimination of secretory and non-secretory proteins.
+------ ------- ---------------------------------------------------------------
+ 13-14 D-score The D-score was introduced in SignalP version 3.0 and is a
+               simple average of the S-mean and Y-max score. The score shows
+               superior discrimination performance of secretory and
+               non-secretory proteins to that of the S-mean score which was
+               used in SignalP version 1 and 2.
+====== ======= ===============================================================

For non-secretory proteins all the scores represented in the SignalP3-NN output should ideally be very low.

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/tmhmm2.py
--- a/tools/protein_analysis/tmhmm2.py Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/tmhmm2.py Wed Apr 03 10:49:10 2013 -0400

[

@@ -48,7 +48,7 @@
FASTA_CHUNK = 500

if len(sys.argv) != 4:
- stop_err("Require three arguments, number of threads (int), input protein FASTA file & output tabular file")
+ stop_err("Require three arguments, number of threads (int), input protein FASTA file & output tabular file")

num_threads = thread_count(sys.argv[1], default=4)
fasta_file = sys.argv[2]

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/tmhmm2.xml
--- a/tools/protein_analysis/tmhmm2.xml Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/tmhmm2.xml Wed Apr 03 10:49:10 2013 -0400

@@ -1,4 +1,4 @@
-<tool id="tmhmm2" name="TMHMM 2.0" version="0.0.9">
+<tool id="tmhmm2" name="TMHMM 2.0" version="0.0.10">
     <description>Find transmembrane domains in protein sequences</description>
     
     
@@ -47,12 +47,19 @@

The input is a FASTA file of protein sequences, and the output is tabular with six columns (one row per protein):

- 1. Sequence identifier
- 2. Sequence length
- 3. Expected number of amino acids in TM helices (ExpAA). If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide).
- 4. Expected number of amino acids in TM helices in the first 60 amino acids of the protein (Exp60). If this number more than a few, be aware that a predicted transmembrane helix in the N-term could be a signal peptide.
- 5. Number of transmembrane helices predicted by N-best.
- 6. Topology predicted by N-best (encoded as a strip using o for output and i for inside)
+====== =====================================================================================
+Column Description
+------ -------------------------------------------------------------------------------------
+     1 Sequence identifier
+     2 Sequence length
+     3 Expected number of amino acids in TM helices (ExpAA). If this number is larger than
+       18 it is very likely to be a transmembrane protein (OR have a signal peptide).
+     4 Expected number of amino acids in TM helices in the first 60 amino acids of the
+       protein (Exp60). If this number more than a few, be aware that a predicted
+       transmembrane helix in the N-term could be a signal peptide.
+     5 Number of transmembrane helices predicted by N-best.
+     6 Topology predicted by N-best (encoded as a strip using o for output and i for inside)
+====== =====================================================================================

Predicted TM segments in the n-terminal region sometimes turn out to be signal peptides.

@@ -60,6 +67,7 @@

Do not use the program to predict whether a non-membrane protein is cytoplasmic or not.

+
**Notes**

The short format output from TMHMM v2.0 looks like this (six columns tab separated, shown here as a table):
@@ -81,6 +89,7 @@
gi|3298468|dbj|BAA31520.1|          107 59.37   31.17       3 o23-45i52-74o89-106i
=================================== === ===== ======= ======= ====================

+
**References**

Krogh, Larsson, von Heijne, and Sonnhammer.

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/wolf_psort.py
--- a/tools/protein_analysis/wolf_psort.py Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/wolf_psort.py Wed Apr 03 10:49:10 2013 -0400

[

@@ -59,11 +59,11 @@
"""

if len(sys.argv) != 5:
-   stop_err("Require four arguments, organism, threads, input protein FASTA file & output tabular file")
+    stop_err("Require four arguments, organism, threads, input protein FASTA file & output tabular file")

organism = sys.argv[1]
if organism not in ["animal", "plant", "fungi"]:
-   stop_err("Organism argument %s is not one of animal, plant, fungi" % organism)
+    stop_err("Organism argument %s is not one of animal, plant, fungi" % organism)

num_threads = thread_count(sys.argv[2], default=4)
fasta_file = sys.argv[3]

diff -r 09ff180d1615 -r 99b82a2b1272 tools/protein_analysis/wolf_psort.xml
--- a/tools/protein_analysis/wolf_psort.xml Wed Mar 27 11:21:05 2013 -0400
+++ b/tools/protein_analysis/wolf_psort.xml Wed Apr 03 10:49:10 2013 -0400

@@ -1,4 +1,4 @@
-<tool id="wolf_psort" name="WoLF PSORT" version="0.0.2">
+<tool id="wolf_psort" name="WoLF PSORT" version="0.0.3">
     <description>Eukaryote protein subcellular localization prediction</description>
     <command interpreter="python">
       wolf_psort.py $organism 8 $fasta_file $tabular_file
@@ -31,10 +31,14 @@

The input is a FASTA file of protein sequences, and the output is tabular with four columns (multiple rows per protein):

- * Sequence identifier
- * Compartment
- * Score
- * Prediction rank
+====== ===================
+Column Description
+------ -------------------
+     1 Sequence identifier
+     2 Compartment
+     3 Score
+     4 Prediction rank
+====== ===================

**Localization Compartments**