Mercurial > repos > peterjc > fastq_filter_by_id
annotate tools/fastq/fastq_filter_by_id.py @ 0:10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
author | peterjc |
---|---|
date | Tue, 07 Jun 2011 17:23:26 -0400 |
parents | |
children | d570cc324779 |
rev | line source |
---|---|
0
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
2 """Filter a FASTQ file with IDs from a tabular file, e.g. from BLAST. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
3 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
4 Takes five command line options, tabular filename, ID column numbers |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
5 (comma separated list using one based counting), input FASTA filename, and |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
6 two output FASTA filenames (for records with and without the given IDs). |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
7 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
8 If either output filename is just a minus sign, that file is not created. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
9 This is intended to allow output for just the matched (or just the non-matched) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
10 records. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
11 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
13 in column one, and the ID of the match from the database is in column two. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
14 Here sensible values for the column numbers would therefore be "1" or "2". |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
15 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
16 This script is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
17 See accompanying text file for licence details (MIT/BSD style). |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
18 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
19 This is version 0.0.2 of the script. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
20 """ |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
21 import sys |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
22 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
23 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
24 def stop_err( msg ): |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
25 sys.stderr.write( msg ) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
26 sys.exit() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
27 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
28 #Parse Command Line |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
29 try: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
30 tabular_file, cols_arg, in_file, out_positive_file, out_negative_file = sys.argv[1:] |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
31 except ValueError: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
32 stop_err("Expected five arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv))) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
33 try: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
34 columns = [int(arg)-1 for arg in cols_arg.split(",")] |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
35 except ValueError: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
36 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
37 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
38 #Read tabular file and record all specified identifiers |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
39 ids = set() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
40 handle = open(tabular_file, "rU") |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
41 if len(columns)>1: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
42 #General case of many columns |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
43 for line in handle: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
44 if line.startswith("#"): |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
45 #Ignore comments |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
46 continue |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
47 parts = line.rstrip("\n").split("\t") |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
48 for col in columns: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
49 ids.add(parts[col]) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
50 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns)) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
51 else: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
52 #Single column, special case speed up |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
53 col = columns[0] |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
54 for line in handle: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
55 if not line.startswith("#"): |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
56 ids.add(line.rstrip("\n").split("\t")[col]) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
57 print "Using %i IDs from tabular file" % (len(ids)) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
58 handle.close() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
59 |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
60 #Write filtered FASTQ file based on IDs from tabular file |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
61 reader = fastqReader(open(in_file, "rU")) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
62 if out_positive_file != "-" and out_negative_file != "-": |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
63 print "Generating two FASTQ files" |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
64 positive_writer = fastqWriter(open(out_positive_file, "w")) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
65 negative_writer = fastqWriter(open(out_negative_file, "w")) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
66 for record in reader: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
67 #The [1:] is because the fastaReader leaves the @ on the identifer. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
68 if record.identifier and record.identifier.split()[0][1:] in ids: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
69 positive_writer.write(record) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
70 else: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
71 negative_writer.write(record) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
72 positive_writer.close() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
73 negative_writer.close() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
74 elif out_positive_file != "-": |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
75 print "Generating matching FASTQ file" |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
76 positive_writer = fastqWriter(open(out_positive_file, "w")) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
77 for record in reader: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
78 #The [1:] is because the fastaReader leaves the @ on the identifer. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
79 if record.identifier and record.identifier.split()[0][1:] in ids: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
80 positive_writer.write(record) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
81 positive_writer.close() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
82 elif out_negative_file != "-": |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
83 print "Generating non-matching FASTQ file" |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
84 negative_writer = fastqWriter(open(out_negative_file, "w")) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
85 for record in reader: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
86 #The [1:] is because the fastaReader leaves the @ on the identifer. |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
87 if not record.identifier or record.identifier.split()[0][1:] not in ids: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
88 negative_writer.write(record) |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
89 positive_writer.close() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
90 negative_writer.close() |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
91 else: |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
92 stop_err("Neither output file requested") |
10e963c79a45
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
93 reader.close() |