Mercurial > repos > peterjc > seq_filter_by_id
changeset 1:262f08104540 draft
Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
| author | peterjc | 
|---|---|
| date | Mon, 15 Apr 2013 12:27:30 -0400 | 
| parents | 5844f6a450ed | 
| children | abdd608c869b | 
| files | test-data/k12_hypothetical.fasta test-data/k12_hypothetical.tabular test-data/k12_ten_proteins.fasta tools/filters/seq_filter_by_id.py tools/filters/seq_filter_by_id.txt tools/filters/seq_filter_by_id.xml | 
| diffstat | 6 files changed, 177 insertions(+), 49 deletions(-) [+] | 
line wrap: on
 line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_hypothetical.fasta Mon Apr 15 12:27:30 2013 -0400 @@ -0,0 +1,3 @@ +>gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655] +MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL +HGPPPPPRHHKKAPHDHHGGHGPGKHHR
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_hypothetical.tabular Mon Apr 15 12:27:30 2013 -0400 @@ -0,0 +1,2 @@ +#ID Description +gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655]
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_ten_proteins.fasta Mon Apr 15 12:27:30 2013 -0400 @@ -0,0 +1,60 @@ +>gi|16127995|ref|NP_414542.1| thr operon leader peptide [Escherichia coli str. K-12 substr. MG1655] +MKRISTTITTTITITTGNGAG +>gi|16127996|ref|NP_414543.1| fused aspartokinase I and homoserine dehydrogenase I [Escherichia coli str. K-12 substr. MG1655] +MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI +FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA +RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS +AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC +LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT +QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL +ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW +LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV +ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM +KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE +IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK +VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV +>gi|16127997|ref|NP_414544.1| homoserine kinase [Escherichia coli str. K-12 substr. MG1655] +MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE +RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY +DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF +IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA +DWLGKNYLQNQEGFVHICRLDTAGARVLEN +>gi|16127998|ref|NP_414545.1| threonine synthase [Escherichia coli str. K-12 substr. MG1655] +MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLDFVTRSAKILSAFIGDEIPQE +ILEERVRAAFAFPAPVANVESDVGCLELFHGPTLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAA +VAHAFYGLPNVKVVILYPRGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNS +ANSINISRLLAQICYYFEAVAQLPQETRNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVNDTVP +RFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDETTQQTMRELKELGYTS +EPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGETLDLPKELAERADLPLLSHNLPADFAAL +RKLMMNHQ +>gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655] +MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL +HGPPPPPRHHKKAPHDHHGGHGPGKHHR +>gi|16128000|ref|NP_414547.1| peroxide resistance protein, lowers intracellular iron [Escherichia coli str. K-12 substr. MG1655] +MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPD +FTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARG +KDLYQFWGDIITNKLNEALAAQGDNVVINLASDEYFKSVKPKKLNAEIIKPVFLDEKNGKFKIISFYAKK +ARGLMSRFIIENRLTKPEQLTGFNSEGYFFDEDSSSNGELVFKRYEQR +>gi|16128001|ref|NP_414548.1| putative transporter [Escherichia coli str. K-12 substr. MG1655] +MPDFFSFINSVLWGSVMIYLLFGAGCWFTFRTGFVQFRYIRQFGKSLKNSIHPQPGGLTSFQSLCTSLAA +RVGSGNLAGVALAITAGGPGAVFWMWVAAFIGMATSFAECSLAQLYKERDVNGQFRGGPAWYMARGLGMR +WMGVLFAVFLLIAYGIIFSGVQANAVARALSFSFDFPPLVTGIILAVFTLLAITRGLHGVARLMQGFVPL +MAIIWVLTSLVICVMNIGQLPHVIWSIFESAFGWQEAAGGAAGYTLSQAITNGFQRSMFSNEAGMGSTPN +AAAAAASWPPHPAAQGIVQMIGIFIDTLVICTASAMLILLAGNGTTYMPLEGIQLIQKAMRVLMGSWGAE +FVTLVVILFAFSSIVANYIYAENNLFFLRLNNPKAIWCLRICTFATVIGGTLLSLPLMWQLADIIMACMA +ITNLTAILLLSPVVHTIASDYLRQRKLGVRPVFDPLRYPDIGRQLSPDAWDDVSQE +>gi|16128002|ref|NP_414549.1| transaldolase B [Escherichia coli str. K-12 substr. MG1655] +MTDKLTSLRQYTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRKLIDDAVAWAKQQSNDRAQQI +VDATDKLAVNIGLEILKLVPGRISTEVDARLSYDTEASIAKAKRLIKLYNDAGISNDRILIKLASTWQGI +RAAEQLEKEGINCNLTLLFSFAQARACAEAGVFLISPFVGRILDWYKANTDKKEYAPAEDPGVVSVSEIY +QYYKEHGYETVVMGASFRNIGEILELAGCDRLTIAPALLKELAESEGAIERKLSYTGEVKARPARITESE +FLWQHNQDPMAVDKLAEGIRKFAIDQEKLEKMIGDLL +>gi|16128003|ref|NP_414550.1| molybdochelatase incorporating molybdenum into molybdopterin [Escherichia coli str. K-12 substr. MG1655] +MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDEQAIIEQTLCELVDEMSCHLV +LTTGGTGPARRDVTPDATLAVADREMPGFGEQMRQISLHFVPTAILSRQVGVIRKQALILNLPGQPKSIK +ETLEGVKDAEGNVVVHGIFASVPYCIQLLEGPYVETAPEVVAAFRPKSARRDVSE +>gi|16128004|ref|NP_414551.1| inner membrane protein, Grp1_Fun34_YaaH family [Escherichia coli str. K-12 substr. MG1655] +MGNTKLANPAPLGLMGFGMTTILLNLHNVGYFALDGIILAMGIFYGGIAQIFAGLLEYKKGNTFGLTAFT +SYGSFWLTLVAILLMPKLGLTDAPNAQFLGVYLGLWGVFTLFMFFGTLKGARVLQFVFFSLTVLFALLAI +GNIAGNAAIIHFAGWIGLICGASAIYLAMGEVLNEQFGRTVLPIGESH +
--- a/tools/filters/seq_filter_by_id.py Tue Jun 07 17:24:30 2011 -0400 +++ b/tools/filters/seq_filter_by_id.py Mon Apr 15 12:27:30 2013 -0400 @@ -25,10 +25,11 @@ molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878. -This script is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved. +This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute +(formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved. See accompanying text file for licence details (MIT/BSD style). -This is version 0.0.1 of the script. +This is version 0.0.4 of the script, use -v or --version to get the version. """ import sys @@ -36,15 +37,21 @@ sys.stderr.write(msg.rstrip() + "\n") sys.exit(err) +if "-v" in sys.argv or "--version" in sys.argv: + print "v0.0.4" + sys.exit(0) + #Parse Command Line try: tabular_file, cols_arg, in_file, seq_format, out_positive_file, out_negative_file = sys.argv[1:] except ValueError: - stop_err("Expected six arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv))) + stop_err("Expected six arguments (tab file, columns, in seq, seq format, out pos, out neg), got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv))) try: columns = [int(arg)-1 for arg in cols_arg.split(",")] except ValueError: stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg) +if min(columns) < 0: + stop_err("Expect one-based column numbers (not zero-based counting), got %s" % cols_arg) if out_positive_file == "-" and out_negative_file == "-": stop_err("Neither output file requested") @@ -73,12 +80,80 @@ handle.close() +def crude_fasta_iterator(handle): + """Yields tuples, record ID and the full record as a string.""" + while True: + line = handle.readline() + if line == "": + return # Premature end of file, or just empty? + if line[0] == ">": + break + + while True: + if line[0] != ">": + raise ValueError( + "Records in Fasta files should start with '>' character") + id = line[1:].split(None, 1)[0] + lines = [line] + line = handle.readline() + while True: + if not line: + break + if line[0] == ">": + break + lines.append(line) + line = handle.readline() + yield id, "".join(lines) + if not line: + return # StopIteration + + +def fasta_filter(in_file, pos_file, neg_file, wanted): + """FASTA filter producing 60 character line wrapped outout.""" + pos_count = neg_count = 0 + #Galaxy now requires Python 2.5+ so can use with statements, + with open(in_file) as in_handle: + #Doing the if statement outside the loop for speed + #(with the downside of three very similar loops). + if pos_file != "-" and neg_file != "-": + print "Generating two FASTA files" + with open(pos_file, "w") as pos_handle: + with open(neg_file, "w") as neg_handle: + for identifier, record in crude_fasta_iterator(in_handle): + if identifier in wanted: + pos_handle.write(record) + pos_count += 1 + else: + neg_handle.write(record) + neg_count += 1 + elif pos_file != "-": + print "Generating matching FASTA file" + with open(pos_file, "w") as pos_handle: + for identifier, record in crude_fasta_iterator(in_handle): + if identifier in wanted: + pos_handle.write(record) + pos_count += 1 + else: + neg_count += 1 + else: + print "Generating non-matching FASTA file" + assert neg_file != "-" + with open(neg_file, "w") as neg_handle: + for identifier, record in crude_fasta_iterator(in_handle): + if identifier in wanted: + pos_count += 1 + else: + neg_handle.write(record) + neg_count += 1 + return pos_count, neg_count + + if seq_format.lower()=="sff": #Now write filtered SFF file based on IDs from BLAST file try: from Bio.SeqIO.SffIO import SffIterator, SffWriter except ImportError: - stop_err("Requires Biopython 1.54 or later") + stop_err("SFF filtering requires Biopython 1.54 or later") try: from Bio.SeqIO.SffIO import ReadRocheXmlManifest @@ -92,6 +167,7 @@ manifest = None #This makes two passes though the SFF file with isn't so efficient, #but this makes the code simple. + pos_count = neg_count = 0 if out_positive_file != "-": out_handle = open(out_positive_file, "wb") writer = SffWriter(out_handle, xml=manifest) @@ -108,44 +184,11 @@ in_handle.close() #At the time of writing, Galaxy doesn't show SFF file read counts, #so it is useful to put them in stdout and thus shown in job info. - if out_positive_file != "-" and out_negative_file != "-": - print "%i with and %i without specified IDs" % (pos_count, neg_count) - elif out_positive_file != "-": - print "%i with specified IDs" % pos_count - elif out_negative_file != "-": - print "%i without specified IDs" % neg_count + print "%i with and %i without specified IDs" % (pos_count, neg_count) elif seq_format.lower()=="fasta": #Write filtered FASTA file based on IDs from tabular file - from galaxy_utils.sequence.fasta import fastaReader, fastaWriter - reader = fastaReader(open(in_file, "rU")) - if out_positive_file != "-" and out_negative_file != "-": - print "Generating two FASTA files" - positive_writer = fastaWriter(open(out_positive_file, "w")) - negative_writer = fastaWriter(open(out_negative_file, "w")) - for record in reader: - #The [1:] is because the fastaReader leaves the > on the identifer. - if record.identifier and record.identifier.split()[0][1:] in ids: - positive_writer.write(record) - else: - negative_writer.write(record) - positive_writer.close() - negative_writer.close() - elif out_positive_file != "-": - print "Generating matching FASTA file" - positive_writer = fastaWriter(open(out_positive_file, "w")) - for record in reader: - #The [1:] is because the fastaReader leaves the > on the identifer. - if record.identifier and record.identifier.split()[0][1:] in ids: - positive_writer.write(record) - positive_writer.close() - elif out_negative_file != "-": - print "Generating non-matching FASTA file" - negative_writer = fastaWriter(open(out_negative_file, "w")) - for record in reader: - #The [1:] is because the fastaReader leaves the > on the identifer. - if not record.identifier or record.identifier.split()[0][1:] not in ids: - negative_writer.write(record) - negative_writer.close() + pos_count, neg_count = fasta_filter(in_file, out_positive_file, out_negative_file, ids) + print "%i with and %i without specified IDs" % (pos_count, neg_count) elif seq_format.lower().startswith("fastq"): #Write filtered FASTQ file based on IDs from tabular file from galaxy_utils.sequence.fastq import fastqReader, fastqWriter @@ -155,7 +198,7 @@ positive_writer = fastqWriter(open(out_positive_file, "w")) negative_writer = fastqWriter(open(out_negative_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the @ on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if record.identifier and record.identifier.split()[0][1:] in ids: positive_writer.write(record) else: @@ -166,7 +209,7 @@ print "Generating matching FASTQ file" positive_writer = fastqWriter(open(out_positive_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the @ on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if record.identifier and record.identifier.split()[0][1:] in ids: positive_writer.write(record) positive_writer.close() @@ -174,9 +217,10 @@ print "Generating non-matching FASTQ file" negative_writer = fastqWriter(open(out_negative_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the @ on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if not record.identifier or record.identifier.split()[0][1:] not in ids: negative_writer.write(record) negative_writer.close() + reader.close() else: stop_err("Unsupported file type %r" % seq_format)
--- a/tools/filters/seq_filter_by_id.txt Tue Jun 07 17:24:30 2011 -0400 +++ b/tools/filters/seq_filter_by_id.txt Mon Apr 15 12:27:30 2013 -0400 @@ -1,7 +1,8 @@ Galaxy tool to filter FASTA, FASTQ or SFF sequences by ID ========================================================= -This tool is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved. +This tool is copyright 2010-2013 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. See the licence text below. This tool is a short Python script (using both the Galaxy and Biopython library @@ -10,6 +11,10 @@ include filtering based on search results from a tool like NCBI BLAST before assembly. + +Installation +============ + There are just two files to install: * seq_filter_by_id.py (the Python script) @@ -30,6 +35,9 @@ ======= v0.0.1 - Initial version, combining three separate scripts for each file format. +v0.0.4 - Record script version when run from Galaxy. + - Faster FASTA code which preserves the original line wrapping. + - Basic unit test included. Developers @@ -41,10 +49,10 @@ This incorporates the previously used hg branch: http://bitbucket.org/peterjc/galaxy-central/src/fasta_filter -For making the "Galaxy Tool Shed" http://community.g2.bx.psu.edu/ tarball use +For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use the following command from the Galaxy root folder: -tar -czf seq_filter_by_id.tar.gz tools/filters/seq_filter_by_id.* +$ tar -czf seq_filter_by_id.tar.gz tools/filters/seq_filter_by_id.* test-data/k12_ten_proteins.fasta test-data/k12_hypothetical.fasta test-data/k12_hypothetical.tabular Check this worked: @@ -52,6 +60,9 @@ filter/seq_filter_by_id.py filter/seq_filter_by_id.txt filter/seq_filter_by_id.xml +test-data/k12_ten_proteins.fasta +test-data/k12_hypothetical.fasta +test-data/k12_hypothetical.tabular Licence (MIT/BSD style)
--- a/tools/filters/seq_filter_by_id.xml Tue Jun 07 17:24:30 2011 -0400 +++ b/tools/filters/seq_filter_by_id.xml Mon Apr 15 12:27:30 2013 -0400 @@ -1,5 +1,6 @@ -<tool id="seq_filter_by_id" name="Filter sequences by ID" version="0.0.1"> +<tool id="seq_filter_by_id" name="Filter sequences by ID" version="0.0.4"> <description>from a tabular file</description> + <version_command interpreter="python">seq_filter_by_id.py --version</version_command> <command interpreter="python"> seq_filter_by_id.py $input_tabular $columns $input_file $input_file.ext #if $output_choice_cond.output_choice=="both" @@ -11,7 +12,7 @@ #end if </command> <inputs> - <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file to filter on the identifiers" description="FASTA, FASTQ, or SFF format." /> + <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> <param name="input_tabular" type="data" format="tabular" label="Tabular file containing sequence identifiers"/> <param name="columns" type="data_column" data_ref="input_tabular" multiple="True" numerical="False" label="Column(s) containing sequence identifiers" help="Multi-select list - hold the appropriate key while clicking to select multiple columns"> <validator type="no_options" message="Pick at least one column"/> @@ -55,6 +56,13 @@ </data> </outputs> <tests> + <test> + <param name="input_file" value="k12_ten_proteins.fasta" ftype="fasta" /> + <param name="input_tabular" value="k12_hypothetical.tabular" ftype="tabular" /> + <param name="columns" value="1" /> + <param name="output_choice" value="pos" /> + <output name="output_pos" file="k12_hypothetical.fasta" ftype="fasta" /> + </test> </tests> <requirements> <requirement type="python-module">Bio</requirement> @@ -80,7 +88,7 @@ reads without BLAST matches (i.e. those which do not match your contaminant database). -You may have a file of FASTA sequences which has been run some some analysis +You may have a file of FASTA sequences which has been used with some analysis tool giving tabular output, which has then been filtered on some criteria. You can then use this tool to divide the original FASTA file into those entries matching or not matching your criteria (those with or without their identifier
