seq_filter_by_id: tools/filters/seq_filter_by

annotate tools/filters/seq_filter_by_id.py @ 1:262f08104540 draft

Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)

author	peterjc
date	Mon, 15 Apr 2013 12:27:30 -0400
parents	5844f6a450ed
children	abdd608c869b

rev	line source
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	1 #!/usr/bin/env python
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	3
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	4 Takes six command line options, tabular filename, ID column numbers (comma
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	5 separated list using one based counting), input filename, input type (e.g.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	6 FASTA or SFF) and two output filenames (for records with and without the
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	7 given IDs, same format as input sequence file).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	8
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	9 If either output filename is just a minus sign, that file is not created.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	10 This is intended to allow output for just the matched (or just the non-matched)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	11 records.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	12
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	13 When filtering an SFF file, any Roche XML manifest in the input file is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	14 preserved in both output files.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	15
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	16 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	17 in column one, and the ID of the match from the database is in column two.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	18 Here sensible values for the column numbers would therefore be "1" or "2".
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	19
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	20 This tool is a short Python script which requires Biopython 1.54 or later
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	21 for SFF file support. If you use this tool in scientific work leading to a
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	22 publication, please cite the Biopython application note:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	23
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	24 Cock et al 2009. Biopython: freely available Python tools for computational
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	25 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	26 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	27
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	28 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	29 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	30 See accompanying text file for licence details (MIT/BSD style).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	31
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	32 This is version 0.0.4 of the script, use -v or --version to get the version.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	33 """
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	34 import sys
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	35
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	36 def stop_err(msg, err=1):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	37 sys.stderr.write(msg.rstrip() + "\n")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	38 sys.exit(err)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	39
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	40 if "-v" in sys.argv or "--version" in sys.argv:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	41 print "v0.0.4"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	42 sys.exit(0)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	43
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	44 #Parse Command Line
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	45 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	46 tabular_file, cols_arg, in_file, seq_format, out_positive_file, out_negative_file = sys.argv[1:]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	47 except ValueError:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	48 stop_err("Expected six arguments (tab file, columns, in seq, seq format, out pos, out neg), got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	49 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	50 columns = [int(arg)-1 for arg in cols_arg.split(",")]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	51 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	52 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg)
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	53 if min(columns) < 0:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	54 stop_err("Expect one-based column numbers (not zero-based counting), got %s" % cols_arg)
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	55
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	56 if out_positive_file == "-" and out_negative_file == "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	57 stop_err("Neither output file requested")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	58
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	59
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	60 #Read tabular file and record all specified identifiers
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	61 ids = set()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	62 handle = open(tabular_file, "rU")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	63 if len(columns)>1:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	64 #General case of many columns
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	65 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	66 if line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	67 #Ignore comments
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	68 continue
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	69 parts = line.rstrip("\n").split("\t")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	70 for col in columns:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	71 ids.add(parts[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	72 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	73 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	74 #Single column, special case speed up
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	75 col = columns[0]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	76 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	77 if not line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	78 ids.add(line.rstrip("\n").split("\t")[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	79 print "Using %i IDs from tabular file" % (len(ids))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	80 handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	81
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	82
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	83 def crude_fasta_iterator(handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	84 """Yields tuples, record ID and the full record as a string."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	85 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	86 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	87 if line == "":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	88 return # Premature end of file, or just empty?
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	89 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	90 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	91
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	92 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	93 if line[0] != ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	94 raise ValueError(
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	95 "Records in Fasta files should start with '>' character")
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	96 id = line[1:].split(None, 1)[0]
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	97 lines = [line]
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	98 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	99 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	100 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	101 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	102 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	103 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	104 lines.append(line)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	105 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	106 yield id, "".join(lines)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	107 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	108 return # StopIteration
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	109
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	110
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	111 def fasta_filter(in_file, pos_file, neg_file, wanted):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	112 """FASTA filter producing 60 character line wrapped outout."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	113 pos_count = neg_count = 0
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	114 #Galaxy now requires Python 2.5+ so can use with statements,
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	115 with open(in_file) as in_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	116 #Doing the if statement outside the loop for speed
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	117 #(with the downside of three very similar loops).
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	118 if pos_file != "-" and neg_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	119 print "Generating two FASTA files"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	120 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	121 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	122 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	123 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	124 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	125 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	126 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	127 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	128 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	129 elif pos_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	130 print "Generating matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	131 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	132 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	133 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	134 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	135 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	136 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	137 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	138 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	139 print "Generating non-matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	140 assert neg_file != "-"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	141 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	142 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	143 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	144 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	145 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	146 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	147 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	148 return pos_count, neg_count
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	149
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	150
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	151 if seq_format.lower()=="sff":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	152 #Now write filtered SFF file based on IDs from BLAST file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	153 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	154 from Bio.SeqIO.SffIO import SffIterator, SffWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	155 except ImportError:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	156 stop_err("SFF filtering requires Biopython 1.54 or later")
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	157
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	158 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	159 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	160 except ImportError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	161 #Prior to Biopython 1.56 this was a private function
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	162 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	163 in_handle = open(in_file, "rb") #must be binary mode!
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	164 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	165 manifest = ReadRocheXmlManifest(in_handle)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	166 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	167 manifest = None
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	168 #This makes two passes though the SFF file with isn't so efficient,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	169 #but this makes the code simple.
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	170 pos_count = neg_count = 0
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	171 if out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	172 out_handle = open(out_positive_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	173 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	174 in_handle.seek(0) #start again after getting manifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	175 pos_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	176 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	177 if out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	178 out_handle = open(out_negative_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	179 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	180 in_handle.seek(0) #start again
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	181 neg_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id not in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	182 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	183 #And we're done
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	184 in_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	185 #At the time of writing, Galaxy doesn't show SFF file read counts,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	186 #so it is useful to put them in stdout and thus shown in job info.
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	187 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	188 elif seq_format.lower()=="fasta":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	189 #Write filtered FASTA file based on IDs from tabular file
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	190 pos_count, neg_count = fasta_filter(in_file, out_positive_file, out_negative_file, ids)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	191 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	192 elif seq_format.lower().startswith("fastq"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	193 #Write filtered FASTQ file based on IDs from tabular file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	194 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	195 reader = fastqReader(open(in_file, "rU"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	196 if out_positive_file != "-" and out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	197 print "Generating two FASTQ files"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	198 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	199 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	200 for record in reader:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	201 #The [1:] is because the fastaReader leaves the > on the identifier.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	202 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	203 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	204 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	205 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	206 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	207 negative_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	208 elif out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	209 print "Generating matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	210 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	211 for record in reader:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	212 #The [1:] is because the fastaReader leaves the > on the identifier.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	213 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	214 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	215 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	216 elif out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	217 print "Generating non-matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	218 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	219 for record in reader:
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	220 #The [1:] is because the fastaReader leaves the > on the identifier.
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	221 if not record.identifier or record.identifier.split()[0][1:] not in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	222 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	223 negative_writer.close()
1 262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes) peterjc parents: 0 diff changeset	224 reader.close()
0 5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	225 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository peterjc parents: diff changeset	226 stop_err("Unsupported file type %r" % seq_format)

Mercurial > repos > peterjc > seq_filter_by_id

annotate tools/filters/seq_filter_by_id.py @ 1:262f08104540 draft