annotate tools/fastq/fastq_filter_by_id.py @ 0:10e963c79a45

Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
author peterjc
date Tue, 07 Jun 2011 17:23:26 -0400
parents
children d570cc324779
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
2 """Filter a FASTQ file with IDs from a tabular file, e.g. from BLAST.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
3
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
4 Takes five command line options, tabular filename, ID column numbers
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
5 (comma separated list using one based counting), input FASTA filename, and
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
6 two output FASTA filenames (for records with and without the given IDs).
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
7
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
8 If either output filename is just a minus sign, that file is not created.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
9 This is intended to allow output for just the matched (or just the non-matched)
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
10 records.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
11
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
13 in column one, and the ID of the match from the database is in column two.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
14 Here sensible values for the column numbers would therefore be "1" or "2".
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
15
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
16 This script is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
17 See accompanying text file for licence details (MIT/BSD style).
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
18
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
19 This is version 0.0.2 of the script.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
20 """
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
21 import sys
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
22 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
23
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
24 def stop_err( msg ):
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
25 sys.stderr.write( msg )
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
26 sys.exit()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
27
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
28 #Parse Command Line
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
29 try:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
30 tabular_file, cols_arg, in_file, out_positive_file, out_negative_file = sys.argv[1:]
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
31 except ValueError:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
32 stop_err("Expected five arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
33 try:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
34 columns = [int(arg)-1 for arg in cols_arg.split(",")]
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
35 except ValueError:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
36 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg)
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
37
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
38 #Read tabular file and record all specified identifiers
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
39 ids = set()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
40 handle = open(tabular_file, "rU")
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
41 if len(columns)>1:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
42 #General case of many columns
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
43 for line in handle:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
44 if line.startswith("#"):
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
45 #Ignore comments
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
46 continue
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
47 parts = line.rstrip("\n").split("\t")
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
48 for col in columns:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
49 ids.add(parts[col])
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
50 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
51 else:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
52 #Single column, special case speed up
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
53 col = columns[0]
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
54 for line in handle:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
55 if not line.startswith("#"):
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
56 ids.add(line.rstrip("\n").split("\t")[col])
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
57 print "Using %i IDs from tabular file" % (len(ids))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
58 handle.close()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
59
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
60 #Write filtered FASTQ file based on IDs from tabular file
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
61 reader = fastqReader(open(in_file, "rU"))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
62 if out_positive_file != "-" and out_negative_file != "-":
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
63 print "Generating two FASTQ files"
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
64 positive_writer = fastqWriter(open(out_positive_file, "w"))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
65 negative_writer = fastqWriter(open(out_negative_file, "w"))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
66 for record in reader:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
67 #The [1:] is because the fastaReader leaves the @ on the identifer.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
68 if record.identifier and record.identifier.split()[0][1:] in ids:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
69 positive_writer.write(record)
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
70 else:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
71 negative_writer.write(record)
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
72 positive_writer.close()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
73 negative_writer.close()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
74 elif out_positive_file != "-":
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
75 print "Generating matching FASTQ file"
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
76 positive_writer = fastqWriter(open(out_positive_file, "w"))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
77 for record in reader:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
78 #The [1:] is because the fastaReader leaves the @ on the identifer.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
79 if record.identifier and record.identifier.split()[0][1:] in ids:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
80 positive_writer.write(record)
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
81 positive_writer.close()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
82 elif out_negative_file != "-":
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
83 print "Generating non-matching FASTQ file"
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
84 negative_writer = fastqWriter(open(out_negative_file, "w"))
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
85 for record in reader:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
86 #The [1:] is because the fastaReader leaves the @ on the identifer.
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
87 if not record.identifier or record.identifier.split()[0][1:] not in ids:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
88 negative_writer.write(record)
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
89 positive_writer.close()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
90 negative_writer.close()
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
91 else:
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
92 stop_err("Neither output file requested")
10e963c79a45 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
93 reader.close()