annotate tools/filters/seq_filter_by_id.py @ 2:abdd608c869b draft

Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
author peterjc
date Wed, 24 Apr 2013 11:34:12 -0400
parents 262f08104540
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
3
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
4 Takes six command line options, tabular filename, ID column numbers (comma
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
5 separated list using one based counting), input filename, input type (e.g.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
6 FASTA or SFF) and two output filenames (for records with and without the
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
7 given IDs, same format as input sequence file).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
8
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
9 If either output filename is just a minus sign, that file is not created.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
10 This is intended to allow output for just the matched (or just the non-matched)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
11 records.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
12
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
13 When filtering an SFF file, any Roche XML manifest in the input file is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
14 preserved in both output files.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
15
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
16 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
17 in column one, and the ID of the match from the database is in column two.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
18 Here sensible values for the column numbers would therefore be "1" or "2".
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
19
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
20 This tool is a short Python script which requires Biopython 1.54 or later
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
21 for SFF file support. If you use this tool in scientific work leading to a
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
22 publication, please cite the Biopython application note:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
23
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
24 Cock et al 2009. Biopython: freely available Python tools for computational
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
25 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
26 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
27
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
28 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
29 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
30 See accompanying text file for licence details (MIT/BSD style).
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
31
2
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
32 This is version 0.0.5 of the script, use -v or --version to get the version.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
33 """
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
34 import sys
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
35
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
36 def stop_err(msg, err=1):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
37 sys.stderr.write(msg.rstrip() + "\n")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
38 sys.exit(err)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
39
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
40 if "-v" in sys.argv or "--version" in sys.argv:
2
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
41 print "v0.0.5"
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
42 sys.exit(0)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
43
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
44 #Parse Command Line
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
45 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
46 tabular_file, cols_arg, in_file, seq_format, out_positive_file, out_negative_file = sys.argv[1:]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
47 except ValueError:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
48 stop_err("Expected six arguments (tab file, columns, in seq, seq format, out pos, out neg), got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
49 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
50 columns = [int(arg)-1 for arg in cols_arg.split(",")]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
51 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
52 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg)
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
53 if min(columns) < 0:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
54 stop_err("Expect one-based column numbers (not zero-based counting), got %s" % cols_arg)
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
55
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
56 if out_positive_file == "-" and out_negative_file == "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
57 stop_err("Neither output file requested")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
58
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
59
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
60 #Read tabular file and record all specified identifiers
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
61 ids = set()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
62 handle = open(tabular_file, "rU")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
63 if len(columns)>1:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
64 #General case of many columns
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
65 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
66 if line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
67 #Ignore comments
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
68 continue
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
69 parts = line.rstrip("\n").split("\t")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
70 for col in columns:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
71 ids.add(parts[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
72 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
73 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
74 #Single column, special case speed up
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
75 col = columns[0]
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
76 for line in handle:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
77 if not line.startswith("#"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
78 ids.add(line.rstrip("\n").split("\t")[col])
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
79 print "Using %i IDs from tabular file" % (len(ids))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
80 handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
81
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
82
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
83 def crude_fasta_iterator(handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
84 """Yields tuples, record ID and the full record as a string."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
85 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
86 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
87 if line == "":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
88 return # Premature end of file, or just empty?
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
89 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
90 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
91
2
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
92 no_id_warned = False
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
93 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
94 if line[0] != ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
95 raise ValueError(
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
96 "Records in Fasta files should start with '>' character")
2
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
97 try:
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
98 id = line[1:].split(None, 1)[0]
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
99 except IndexError:
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
100 if not no_id_warned:
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
101 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n")
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
102 no_id_warned = True
abdd608c869b Uploaded v0.0.5, checked script return code for errors X copes with a FASTA entry missing an ID.
peterjc
parents: 1
diff changeset
103 id = None
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
104 lines = [line]
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
105 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
106 while True:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
107 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
108 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
109 if line[0] == ">":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
110 break
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
111 lines.append(line)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
112 line = handle.readline()
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
113 yield id, "".join(lines)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
114 if not line:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
115 return # StopIteration
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
116
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
117
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
118 def fasta_filter(in_file, pos_file, neg_file, wanted):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
119 """FASTA filter producing 60 character line wrapped outout."""
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
120 pos_count = neg_count = 0
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
121 #Galaxy now requires Python 2.5+ so can use with statements,
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
122 with open(in_file) as in_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
123 #Doing the if statement outside the loop for speed
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
124 #(with the downside of three very similar loops).
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
125 if pos_file != "-" and neg_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
126 print "Generating two FASTA files"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
127 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
128 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
129 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
130 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
131 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
132 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
133 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
134 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
135 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
136 elif pos_file != "-":
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
137 print "Generating matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
138 with open(pos_file, "w") as pos_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
139 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
140 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
141 pos_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
142 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
143 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
144 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
145 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
146 print "Generating non-matching FASTA file"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
147 assert neg_file != "-"
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
148 with open(neg_file, "w") as neg_handle:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
149 for identifier, record in crude_fasta_iterator(in_handle):
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
150 if identifier in wanted:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
151 pos_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
152 else:
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
153 neg_handle.write(record)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
154 neg_count += 1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
155 return pos_count, neg_count
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
156
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
157
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
158 if seq_format.lower()=="sff":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
159 #Now write filtered SFF file based on IDs from BLAST file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
160 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
161 from Bio.SeqIO.SffIO import SffIterator, SffWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
162 except ImportError:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
163 stop_err("SFF filtering requires Biopython 1.54 or later")
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
164
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
165 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
166 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
167 except ImportError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
168 #Prior to Biopython 1.56 this was a private function
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
169 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
170 in_handle = open(in_file, "rb") #must be binary mode!
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
171 try:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
172 manifest = ReadRocheXmlManifest(in_handle)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
173 except ValueError:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
174 manifest = None
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
175 #This makes two passes though the SFF file with isn't so efficient,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
176 #but this makes the code simple.
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
177 pos_count = neg_count = 0
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
178 if out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
179 out_handle = open(out_positive_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
180 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
181 in_handle.seek(0) #start again after getting manifest
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
182 pos_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
183 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
184 if out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
185 out_handle = open(out_negative_file, "wb")
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
186 writer = SffWriter(out_handle, xml=manifest)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
187 in_handle.seek(0) #start again
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
188 neg_count = writer.write_file(rec for rec in SffIterator(in_handle) if rec.id not in ids)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
189 out_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
190 #And we're done
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
191 in_handle.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
192 #At the time of writing, Galaxy doesn't show SFF file read counts,
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
193 #so it is useful to put them in stdout and thus shown in job info.
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
194 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
195 elif seq_format.lower()=="fasta":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
196 #Write filtered FASTA file based on IDs from tabular file
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
197 pos_count, neg_count = fasta_filter(in_file, out_positive_file, out_negative_file, ids)
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
198 print "%i with and %i without specified IDs" % (pos_count, neg_count)
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
199 elif seq_format.lower().startswith("fastq"):
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
200 #Write filtered FASTQ file based on IDs from tabular file
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
201 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
202 reader = fastqReader(open(in_file, "rU"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
203 if out_positive_file != "-" and out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
204 print "Generating two FASTQ files"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
205 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
206 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
207 for record in reader:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
208 #The [1:] is because the fastaReader leaves the > on the identifier.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
209 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
210 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
211 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
212 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
213 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
214 negative_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
215 elif out_positive_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
216 print "Generating matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
217 positive_writer = fastqWriter(open(out_positive_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
218 for record in reader:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
219 #The [1:] is because the fastaReader leaves the > on the identifier.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
220 if record.identifier and record.identifier.split()[0][1:] in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
221 positive_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
222 positive_writer.close()
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
223 elif out_negative_file != "-":
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
224 print "Generating non-matching FASTQ file"
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
225 negative_writer = fastqWriter(open(out_negative_file, "w"))
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
226 for record in reader:
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
227 #The [1:] is because the fastaReader leaves the > on the identifier.
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
228 if not record.identifier or record.identifier.split()[0][1:] not in ids:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
229 negative_writer.write(record)
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
230 negative_writer.close()
1
262f08104540 Uploaded v0.0.4 which includes a unit test and is faster at filtering FASTA files with large records (e.g. whole chromosomes)
peterjc
parents: 0
diff changeset
231 reader.close()
0
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
232 else:
5844f6a450ed Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
233 stop_err("Unsupported file type %r" % seq_format)