annotate tools/filters/seq_filter_by_id.py @ 1:262f08104540 draft

Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
author peterjc
date Mon, 15 Apr 2013 12:27:30 -0400
parents 5844f6a450ed
children abdd608c869b
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
3
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
4 Takes six command line options, tabular filename, ID column numbers (comma
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
5 separated list using one based counting), input filename, input type (e.g.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
6 FASTA or SFF) and two output filenames (for records with and without the
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
7 given IDs, same format as input sequence file).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
8
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
9 If either output filename is just a minus sign, that file is not created.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
10 This is intended to allow output for just the matched (or just the non-matched)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
11 records.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
12
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
13 When filtering an SFF file, any Roche XML manifest in the input file is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
14 preserved in both output files.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
15
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
16 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
17 in column one, and the ID of the match from the database is in column two.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
18 Here sensible values for the column numbers would therefore be "1" or "2".
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
19
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
20 This tool is a short Python script which requires Biopython 1.54 or later
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
21 for SFF file support. If you use this tool in scientific work leading to a
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
22 publication, please cite the Biopython application note:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
23
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
24 Cock et al 2009. Biopython: freely available Python tools for computational
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
25 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
26 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
27
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
28 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
29 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
30 See accompanying text file for licence details (MIT/BSD style).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
31
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
32 This is version 0.0.4 of the script, use -v or --version to get the version.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
33 """
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
34 import sys
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
35
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
36 def stop_err(msg, err=1):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
37 sys.stderr.write(msg.rstrip() + "\n")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
38 sys.exit(err)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
39
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
40 if "-v" in sys.argv or "--version" in sys.argv:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
41 print "v0.0.4"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
42 sys.exit(0)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
43
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
44 #Parse Command Line
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
45 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
46 tabular_file, cols_arg, in_file, seq_format, out_positive_file, out_negative_file = sys.argv[1:]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
47 except ValueError:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
48 stop_err("Expected six arguments (tab file, columns, in seq, seq format, out pos, out neg), got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
49 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
50 columns = [int(arg)-1 for arg in cols_arg.split(",")]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
51 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
52 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg)
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
53 if min(columns) < 0:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
54 stop_err("Expect one-based column numbers (not zero-based counting), got %s" % cols_arg)
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
55
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
56 if out_positive_file == "-" and out_negative_file == "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
57 stop_err("Neither output file requested")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
58
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
59
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
60 #Read tabular file and record all specified identifiers
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
61 ids = set()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
62 handle = open(tabular_file, "rU")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
63 if len(columns)>1:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
64 #General case of many columns
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
65 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
66 if line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
67 #Ignore comments
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
68 continue
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
69 parts = line.rstrip("\n").split("\t")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
70 for col in columns:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
71 ids.add(parts[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
72 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
73 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
74 #Single column, special case speed up
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
75 col = columns[0]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
76 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
77 if not line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
78 ids.add(line.rstrip("\n").split("\t")[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
79 print "Using %i IDs from tabular file" % (len(ids))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
80 handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
81
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
82
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
83 def crude_fasta_iterator(handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
84 """Yields tuples, record ID and the full record as a string."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
85 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
86 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
87 if line == "":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
88 return # Premature end of file, or just empty?
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
89 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
90 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
91
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
92 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
93 if line[0] != ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
94 raise ValueError(
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
95 "Records in Fasta files should start with '>' character")
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
96 id = line[1:].split(None, 1)[0]
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
97 lines = [line]
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
98 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
99 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
100 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
101 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
102 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
103 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
104 lines.append(line)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
105 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
106 yield id, "".join(lines)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
107 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
108 return # StopIteration
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
109
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
110
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
111 def fasta_filter(in_file, pos_file, neg_file, wanted):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
112 """FASTA filter producing 60 character line wrapped outout."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
113 pos_count = neg_count = 0
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
114 #Galaxy now requires Python 2.5+ so can use with statements,
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
115 with open(in_file) as in_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
116 #Doing the if statement outside the loop for speed
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
117 #(with the downside of three very similar loops).
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
118 if pos_file != "-" and neg_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
119 print "Generating two FASTA files"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
120 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
121 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
122 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
123 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
124 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
125 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
126 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
127 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
128 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
129 elif pos_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
130 print "Generating matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
131 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
132 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
133 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
134 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
135 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
136 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
137 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
138 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
139 print "Generating non-matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
140 assert neg_file != "-"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
141 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
142 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
143 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
144 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
145 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
146 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
147 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
148 return pos_count, neg_count
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
149
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
150
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
151 if seq_format.lower()=="sff":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
152 #Now write filtered SFF file based on IDs from BLAST file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
153 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
154 from Bio.SeqIO.SffIO import SffIterator, SffWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
155 except ImportError:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
156 stop_err("SFF filtering requires Biopython 1.54 or later")
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
157
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
158 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
159 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
160 except ImportError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
161 #Prior to Biopython 1.56 this was a private function
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
162 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
163 in_handle = open(in_file, "rb") #must be binary mode!
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
164 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
165 manifest = ReadRocheXmlManifest(in_handle)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
166 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
167 manifest = None
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
168 #This makes two passes though the SFF file with isn't so efficient,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
169 #but this makes the code simple.
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
170 pos_count = neg_count = 0
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
171 if out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
172 out_handle = open(out_positive_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
173 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
174 in_handle.seek(0) #start again after getting manifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
175 pos_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
176 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
177 if out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
178 out_handle = open(out_negative_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
179 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
180 in_handle.seek(0) #start again
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
181 neg_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id not in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
182 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
183 #And we're done
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
184 in_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
185 #At the time of writing, Galaxy doesn't show SFF file read counts,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
186 #so it is useful to put them in stdout and thus shown in job info.
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
187 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
188 elif seq_format.lower()=="fasta":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
189 #Write filtered FASTA file based on IDs from tabular file
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
190 pos_count, neg_count = fasta_filter(in_file, out_positive_file, out_negative_file, ids)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
191 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
192 elif seq_format.lower().startswith("fastq"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
193 #Write filtered FASTQ file based on IDs from tabular file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
194 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
195 reader = fastqReader(open(in_file, "rU"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
196 if out_positive_file != "-" and out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
197 print "Generating two FASTQ files"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
198 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
199 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
200 for record in reader:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
201 #The [1:] is because the fastaReader leaves the > on the identifier.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
202 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
203 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
204 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
205 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
206 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
207 negative_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
208 elif out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
209 print "Generating matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
210 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
211 for record in reader:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
212 #The [1:] is because the fastaReader leaves the > on the identifier.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
213 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
214 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
215 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
216 elif out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
217 print "Generating non-matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
218 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
219 for record in reader:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
220 #The [1:] is because the fastaReader leaves the > on the identifier.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
221 if not record.identifier or record.identifier.split()[0][1:] not in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
222 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
223 negative_writer.close()
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
224 reader.close()
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
225 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
226 stop_err("Unsupported file type %r" % seq_format)