Mercurial > repos > peterjc > fasta_filter_by_id
annotate tools/fasta_tools/fasta_filter_by_id.py @ 0:2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
author | peterjc |
---|---|
date | Tue, 07 Jun 2011 17:22:24 -0400 |
parents | |
children | 5cd569750e85 |
rev | line source |
---|---|
0
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
2 """Filter a FASTA file with IDs from a tabular file, e.g. from BLAST. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
3 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
4 Takes five command line options, tabular filename, ID column numbers |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
5 (comma separated list using one based counting), input FASTA filename, and |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
6 two output FASTA filenames (for records with and without any BLAST hits). |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
7 If the either output filename is just a minus sign, that file is not created. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
8 This is intended to allow output for just the matched (or just the non-matched) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
9 records. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
10 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
11 Note in the default NCBI BLAST+ tabular output, the query sequence ID is |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
12 in column one, and the ID of the match from the database is in column two. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
13 Here sensible values for the column numbers would therefore be "1" or "2". |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
14 """ |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
15 import sys |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
16 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
17 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
18 def stop_err( msg ): |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
19 sys.stderr.write( msg ) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
20 sys.exit() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
21 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
22 #Parse Command Line |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
23 try: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
24 tabular_file, cols_arg, in_file, out_positive_file, out_negative_file = sys.argv[1:] |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
25 except ValueError: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
26 stop_err("Expected five arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv))) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
27 try: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
28 columns = [int(arg)-1 for arg in cols_arg.split(",")] |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
29 except ValueError: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
30 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
31 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
32 #Read tabular file and record all specified identifiers |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
33 ids = set() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
34 handle = open(tabular_file, "rU") |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
35 if len(columns)>1: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
36 #General case of many columns |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
37 for line in handle: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
38 if line.startswith("#"): |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
39 #Ignore comments |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
40 continue |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
41 parts = line.rstrip("\n").split("\t") |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
42 for col in columns: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
43 ids.add(parts[col]) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
44 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns)) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
45 else: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
46 #Single column, special case speed up |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
47 col = columns[0] |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
48 for line in handle: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
49 if not line.startswith("#"): |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
50 ids.add(line.rstrip("\n").split("\t")[col]) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
51 print "Using %i IDs from tabular file" % (len(ids)) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
52 handle.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
53 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
54 #Write filtered FASTA file based on IDs from BLAST file |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
55 reader = fastaReader(open(in_file, "rU")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
56 if out_positive_file != "-" and out_negative_file != "-": |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
57 print "Generating two FASTA files" |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
58 positive_writer = fastaWriter(open(out_positive_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
59 negative_writer = fastaWriter(open(out_negative_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
60 for record in reader: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
61 #The [1:] is because the fastaReader leaves the > on the identifer. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
62 if record.identifier and record.identifier.split()[0][1:] in ids: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
63 positive_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
64 else: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
65 negative_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
66 positive_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
67 negative_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
68 elif out_positive_file != "-": |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
69 print "Generating matching FASTA file" |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
70 positive_writer = fastaWriter(open(out_positive_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
71 for record in reader: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
72 #The [1:] is because the fastaReader leaves the > on the identifer. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
73 if record.identifier and record.identifier.split()[0][1:] in ids: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
74 positive_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
75 positive_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
76 elif out_negative_file != "-": |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
77 print "Generating non-matching FASTA file" |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
78 negative_writer = fastaWriter(open(out_negative_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
79 for record in reader: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
80 #The [1:] is because the fastaReader leaves the > on the identifer. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
81 if not record.identifier or record.identifier.split()[0][1:] not in ids: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
82 negative_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
83 negative_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
84 else: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
85 stop_err("Neither output file requested") |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
86 reader.close() |