seq_filter_by_id: tools/filters/seq_filter_by

annotate tools/filters/seq_filter_by_id.py @ 2:abdd608c869b draft

Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.

author	peterjc
date	Wed, 24 Apr 2013 11:34:12 -0400
parents	262f08104540
children

rev	line source
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	1 #!/usr/bin/env python
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	3
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	4 Takes six command line options, tabular filename, ID column numbers (comma
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	5 separated list using one based counting), input filename, input type (e.g.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	6 FASTA or SFF) and two output filenames (for records with and without the
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	7 given IDs, same format as input sequence file).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	8
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	9 If either output filename is just a minus sign, that file is not created.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	10 This is intended to allow output for just the matched (or just the non-matched)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	11 records.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	12
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	13 When filtering an SFF file, any Roche XML manifest in the input file is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	14 preserved in both output files.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	15
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	16 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	17 in column one, and the ID of the match from the database is in column two.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	18 Here sensible values for the column numbers would therefore be "1" or "2".
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	19
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	20 This tool is a short Python script which requires Biopython 1.54 or later
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	21 for SFF file support. If you use this tool in scientific work leading to a
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	22 publication, please cite the Biopython application note:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	23
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	24 Cock et al 2009. Biopython: freely available Python tools for computational
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	25 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	26 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	27
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	28 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	29 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	30 See accompanying text file for licence details (MIT/BSD style).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	31
2 abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	32 This is version 0.0.5 of the script, use -v or --version to get the version.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	33 """
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	34 import sys
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	35
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	36 def stop_err(msg, err=1):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	37 sys.stderr.write(msg.rstrip() + "\n")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	38 sys.exit(err)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	39
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	40 if "-v" in sys.argv or "--version" in sys.argv:
2 abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	41 print "v0.0.5"
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	42 sys.exit(0)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	43
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	44 #Parse Command Line
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	45 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	46 tabular_file, cols_arg, in_file, seq_format, out_positive_file, out_negative_file = sys.argv[1:]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	47 except ValueError:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	48 stop_err("Expected six arguments (tab file, columns, in seq, seq format, out pos, out neg), got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	49 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	50 columns = [int(arg)-1 for arg in cols_arg.split(",")]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	51 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	52 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg)
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	53 if min(columns) < 0:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	54 stop_err("Expect one-based column numbers (not zero-based counting), got %s" % cols_arg)
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	55
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	56 if out_positive_file == "-" and out_negative_file == "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	57 stop_err("Neither output file requested")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	58
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	59
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	60 #Read tabular file and record all specified identifiers
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	61 ids = set()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	62 handle = open(tabular_file, "rU")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	63 if len(columns)>1:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	64 #General case of many columns
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	65 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	66 if line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	67 #Ignore comments
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	68 continue
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	69 parts = line.rstrip("\n").split("\t")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	70 for col in columns:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	71 ids.add(parts[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	72 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	73 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	74 #Single column, special case speed up
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	75 col = columns[0]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	76 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	77 if not line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	78 ids.add(line.rstrip("\n").split("\t")[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	79 print "Using %i IDs from tabular file" % (len(ids))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	80 handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	81
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	82
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	83 def crude_fasta_iterator(handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	84 """Yields tuples, record ID and the full record as a string."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	85 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	86 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	87 if line == "":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	88 return # Premature end of file, or just empty?
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	89 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	90 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	91
2 abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	92 no_id_warned = False
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	93 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	94 if line[0] != ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	95 raise ValueError(
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	96 "Records in Fasta files should start with '>' character")
2 abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	97 try:
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	98 id = line[1:].split(None, 1)[0]
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	99 except IndexError:
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	100 if not no_id_warned:
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	101 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n")
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	102 no_id_warned = True
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID. peterjc parents: 1 diff changeset	103 id = None
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	104 lines = [line]
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	105 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	106 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	107 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	108 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	109 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	110 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	111 lines.append(line)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	112 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	113 yield id, "".join(lines)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	114 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	115 return # StopIteration
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	116
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	117
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	118 def fasta_filter(in_file, pos_file, neg_file, wanted):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	119 """FASTA filter producing 60 character line wrapped outout."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	120 pos_count = neg_count = 0
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	121 #Galaxy now requires Python 2.5+ so can use with statements,
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	122 with open(in_file) as in_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	123 #Doing the if statement outside the loop for speed
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	124 #(with the downside of three very similar loops).
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	125 if pos_file != "-" and neg_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	126 print "Generating two FASTA files"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	127 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	128 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	129 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	130 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	131 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	132 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	133 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	134 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	135 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	136 elif pos_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	137 print "Generating matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	138 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	139 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	140 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	141 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	142 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	143 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	144 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	145 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	146 print "Generating non-matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	147 assert neg_file != "-"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	148 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	149 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	150 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	151 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	152 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	153 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	154 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	155 return pos_count, neg_count
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	156
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	157
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	158 if seq_format.lower()=="sff":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	159 #Now write filtered SFF file based on IDs from BLAST file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	160 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	161 from Bio.SeqIO.SffIO import SffIterator, SffWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	162 except ImportError:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	163 stop_err("SFF filtering requires Biopython 1.54 or later")
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	164
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	165 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	166 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	167 except ImportError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	168 #Prior to Biopython 1.56 this was a private function
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	169 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	170 in_handle = open(in_file, "rb") #must be binary mode!
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	171 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	172 manifest = ReadRocheXmlManifest(in_handle)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	173 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	174 manifest = None
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	175 #This makes two passes though the SFF file with isn't so efficient,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	176 #but this makes the code simple.
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	177 pos_count = neg_count = 0
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	178 if out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	179 out_handle = open(out_positive_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	180 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	181 in_handle.seek(0) #start again after getting manifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	182 pos_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	183 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	184 if out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	185 out_handle = open(out_negative_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	186 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	187 in_handle.seek(0) #start again
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	188 neg_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id not in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	189 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	190 #And we're done
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	191 in_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	192 #At the time of writing, Galaxy doesn't show SFF file read counts,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	193 #so it is useful to put them in stdout and thus shown in job info.
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	194 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	195 elif seq_format.lower()=="fasta":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	196 #Write filtered FASTA file based on IDs from tabular file
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	197 pos_count, neg_count = fasta_filter(in_file, out_positive_file, out_negative_file, ids)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	198 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	199 elif seq_format.lower().startswith("fastq"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	200 #Write filtered FASTQ file based on IDs from tabular file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	201 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	202 reader = fastqReader(open(in_file, "rU"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	203 if out_positive_file != "-" and out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	204 print "Generating two FASTQ files"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	205 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	206 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	207 for record in reader:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	208 #The [1:] is because the fastaReader leaves the > on the identifier.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	209 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	210 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	211 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	212 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	213 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	214 negative_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	215 elif out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	216 print "Generating matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	217 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	218 for record in reader:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	219 #The [1:] is because the fastaReader leaves the > on the identifier.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	220 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	221 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	222 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	223 elif out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	224 print "Generating non-matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	225 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	226 for record in reader:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	227 #The [1:] is because the fastaReader leaves the > on the identifier.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	228 if not record.identifier or record.identifier.split()[0][1:] not in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	229 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	230 negative_writer.close()
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	231 reader.close()
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	232 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	233 stop_err("Unsupported file type %r" % seq_format)

Mercurial > repos > peterjc > seq_filter_by_id

annotate tools/filters/seq_filter_by_id.py @ 2:abdd608c869b draft