annotate tools/ncbi_blast_plus/check_no_duplicates.py @ 14:2fe07f50a41e draft

Uploaded v0.1.01 - Requires blastdbd datatype (blast_datatypes v0.0.19). Support for makeprofiledb to create protein domain databases and use them in RPS-BLAST and RPS-TBLASTN. Tools now support GI and SeqID filters, and embed the citations.
author peterjc
date Mon, 01 Dec 2014 05:59:16 -0500
parents 4c4a0da938ff
children 3034ce97dd33
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
11
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
2 """Check for duplicate sequence identifiers in FASTA files.
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
3
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
4 This is run as a pre-check before makeblastdb, in order to avoid
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
5 a regression bug in BLAST+ 2.2.28 which fails to catch this. See:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
6 http://blastedbio.blogspot.co.uk/2012/10/my-ids-not-good-enough-for-ncbi-blast.html
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
7
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
8 This script takes one or more FASTA filenames as input, and
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
9 will return a non-zero error if any duplicate identifiers
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
10 are found.
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
11 """
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
12 import sys
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
13 import os
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
14
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
15 if "-v" in sys.argv or "--version" in sys.argv:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
16 print("v0.0.22")
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
17 sys.exit(0)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
18
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
19 def stop_err(msg, error=1):
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
20 sys.stderr.write("%s\n" % msg)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
21 sys.exit(error)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
22
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
23
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
24 identifiers = set()
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
25 files = 0
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
26 for filename in sys.argv[1:]:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
27 if not os.path.isfile(filename):
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
28 stop_err("Missing FASTA file %r" % filename, 2)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
29 files += 1
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
30 handle = open(filename)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
31 for line in handle:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
32 if line.startswith(">"):
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
33 #The split will also take care of the new line character,
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
34 #e.g. ">test\n" and ">test description here\n" both give "test"
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
35 seq_id = line[1:].split(None, 1)[0]
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
36 if seq_id in identifiers:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
37 handle.close()
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
38 stop_err("Repeated identifiers, e.g. %r" % seq_id, 1)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
39 identifiers.add(seq_id)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
40 handle.close()
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
41 if not files:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
42 stop_err("No FASTA files given to check for duplicates", 3)
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
43 elif files == 1:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
44 print("%i sequences" % len(identifiers))
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
45 else:
4c4a0da938ff Uploaded v0.0.22, now wraps BLAST+ 2.2.28 allowing extended tabular output to include the hit descriptions as column 25.
peterjc
parents:
diff changeset
46 print("%i sequences in %i FASTA files" % (len(identifiers), files))