annotate linelisting.py @ 0:be856549e863 draft default tip

planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
author public-health-bioinformatics
date Thu, 04 Jul 2019 19:37:41 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
1 #!/usr/bin/env python
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
2 '''Reads in a fasta file of extracted antigenic sites and one containing a
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
3 reference flu antigenic map, reading them in as protein SeqRecords. Compares each amino
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
4 acid of each sample antigenic map to corresponding sites in the reference and replaces
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
5 identical amino acids with dots. Writes headers (including amino acid position numbers
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
6 read in from the respective index array), the reference amino acid sequence and column
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
7 headings required for non-aggregated line lists. Outputs headers and modified (i.e. dotted)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
8 sequences to a csv file.'''
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
9
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
10 '''Author: Diane Eisler, Molecular Microbiology & Genomics, BCCDC Public Health Laboratory, Nov 2017'''
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
11
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
12 import sys,string,os, time, Bio, re, argparse
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
13 from Bio import Seq, SeqIO, SeqUtils, Alphabet, SeqRecord
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
14 from Bio.SeqRecord import SeqRecord
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
15 from Bio.Alphabet import IUPAC
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
16 from Bio.Seq import Seq
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
17
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
18 inputAntigenicMaps = sys.argv[1] #batch fasta file with antigenic map sequences
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
19 refAntigenicMap = sys.argv[2] #fasta file of reference antigenic map sequence
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
20 antigenicSiteIndexArray = sys.argv[3] #antigenic site index array csv file
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
21 cladeDefinitionFile = sys.argv[4] #clade definition csv file
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
22 outFileHandle = sys.argv[5] #user-specifed output filename
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
23
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
24 lineListFile = open(outFileHandle,'w') #open a writable output file
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
25
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
26 indicesLine = "" #comma-separated antigenic site positions
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
27 cladeList = [] #list of clade names read from clade definition file
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
28 ref_seq = "" #reference antigenic map (protein sequence)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
29 seqList = [] #list of aa sequences to compare to reference
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
30
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
31 BC_list = [] #empty list for BC samples
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
32 AB_list = [] #empty list for AB samples
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
33 ON_list = [] #empty list for ON samples
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
34 QC_list = [] #empty list for QC samples
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
35 nonprov_list = [] #empty list for samples not in above 4 provinces
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
36 #dictionary for location-separated sequence lists
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
37 segregated_lists = {'1_BC':BC_list,'2_AB':AB_list,'3_ON':ON_list,'4_QC': QC_list, '5_nonprov': nonprov_list}
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
38 uniqueSeqs = {} #empty dict with unique seqs as keys and lists of SeqRecords as values
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
39
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
40 def replace_matching_aa_with_dot(record):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
41 """Compare amino acids in record to reference sequence, replace matching symbols
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
42 with dots, and return record with modified amino acid sequence."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
43 orig_seq = str(record.seq) #get sequence string from SeqRecord
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
44 mod_seq = ""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
45 #replace only those aa's matching the reference with dots
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
46 for i in range(0, len(orig_seq)):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
47 if (orig_seq[i] == ref_seq[i]):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
48 mod_seq = mod_seq + '.'
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
49 else:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
50 mod_seq = mod_seq + orig_seq[i]
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
51 #assign modified sequence to new SeqRecord and return it
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
52 rec = SeqRecord(Seq(mod_seq,IUPAC.protein), id = record.id, name = "", description = "")
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
53 return rec
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
54
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
55 def extract_clade(record):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
56 """Extract clade name (or 'No_Match') from sequence name and return as clade name. """
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
57 if record.id.endswith('No_Match'):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
58 clade_name = 'No_Match'
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
59 end_index = record.id.index(clade_name)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
60 record.id = record.id[:end_index -1]
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
61 return clade_name
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
62 else: #
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
63 for clade in cladeList:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
64 if record.id.endswith(clade):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
65 clade_name = clade
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
66 end_index = record.id.index(clade)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
67 record.id = record.id[:end_index -1]
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
68 return clade_name
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
69
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
70 def sort_by_location(record):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
71 """Search sequence name for province name or 2 letter province code and add SeqRecord to
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
72 province-specific dictionary."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
73 seq_name = record.id
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
74 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
75 BC_list.append(record) #add Sequence record to BC_list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
76 elif ('-AB-' in seq_name) or ('/Alberta/' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
77 AB_list.append(record) #add Sequence record to AB_list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
78 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
79 ON_list.append(record) #add Sequence record to ON_list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
80 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
81 QC_list.append(record) #add Sequence record to QC_list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
82 else:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
83 nonprov_list.append(record) #add Sequence record to nonprov_list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
84 return
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
85
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
86 def extract_province(record):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
87 """Search sequence name for province name or 2 letter province code and return province."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
88 seq_name = record.id
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
89 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
90 province = 'British Columbia'
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
91 elif ('-AB-' in seq_name) or ('Alberta' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
92 province = '/Alberta/'
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
93 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
94 province = 'Ontario'
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
95 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
96 province = 'Quebec'
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
97 else:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
98 province = "other"
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
99 return province
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
100
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
101 def get_sequence_length(record):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
102 """Return the length of a sequence in a Sequence record."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
103 sequenceLength = len(str((record.seq)))
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
104 return sequenceLength
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
105
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
106 def get_antigenic_site_substitutions(record):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
107 """Count number of non-dotted amino acids in SeqRecord sequence and return as substitutions."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
108 sequenceLength = get_sequence_length(record)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
109 seqString = str(record.seq)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
110 matches = seqString.count('.')
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
111 substitutions = sequenceLength - matches
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
112 return substitutions
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
113
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
114 def calculate_percent_id(record, substitutions):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
115 """Calculate sequence identity to a reference (based on substitutions and sequence length) and return percent id."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
116 sequenceLength = get_sequence_length(record)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
117 percentID = (1.00 - (float(substitutions)/float(sequenceLength)))
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
118 return percentID
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
119
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
120 def output_linelist(sequenceList):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
121 """Output a list of SeqRecords to a non-aggregated line list in csv format."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
122 for record in sequenceList:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
123 #get province, clade from sequence record
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
124 province = extract_province(record)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
125 clade = extract_clade(record)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
126 #calculate number of substitutions and % id to reference
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
127 substitutions = get_antigenic_site_substitutions(record)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
128 percentID = calculate_percent_id(record,substitutions)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
129 name_part = (record.id).rstrip() + ','
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
130 clade_part = clade + ','
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
131 substitutions_part = str(substitutions) + ','
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
132 percID_part = str(percentID) + ','
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
133 col = " ," #empty column
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
134 sequence = str(record.seq).strip()
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
135 csv_seq = ",".join(sequence) +","
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
136 #write linelisted antigenic maps to csv file
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
137 comma_sep_line = name_part + col + clade_part + col + csv_seq + substitutions_part + percID_part + "\n"
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
138 lineListFile.write(comma_sep_line)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
139 return
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
140
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
141 with open (antigenicSiteIndexArray,'r') as siteIndices:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
142 """Read amino acid positions from antigenic site index array and print as header after one empty row."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
143 col = "," #empty column
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
144 #read items separated by comma's to position list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
145 for line in siteIndices:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
146 #remove whitespace from the end of each line
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
147 indicesLine = line.rstrip()
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
148 row1 = "\n"
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
149 #add comma-separated AA positions to header line
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
150 row2 = col + col + col + col + indicesLine + "\n"
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
151 #write first (empty) and 2nd (amino acid position) lines to linelist output file
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
152 lineListFile.write(row1)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
153 lineListFile.write(row2)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
154
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
155 with open (refAntigenicMap,'r') as refMapFile:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
156 """Read reference antigenic map from fasta and output amino acids, followed by column headers."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
157 #read sequences from fasta to SeqRecord, uppercase, and store sequence string to ref_seq
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
158 record = SeqIO.read(refMapFile,"fasta",alphabet=IUPAC.protein)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
159 record = record.upper()
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
160 ref_seq = str(record.seq).strip() #store sequence in variable for comparison to sample seqs
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
161 col = "," #empty column
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
162 name_part = (record.id).rstrip() + ','
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
163 sequence = str(record.seq).strip()
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
164 csv_seq = ",".join(sequence)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
165 #output row with reference sequence displayed above sample sequences
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
166 row3 = name_part + col + col + col + csv_seq + "\n"
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
167 lineListFile.write(row3)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
168 #replaces digits in the indicesLine with empty strings
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
169 positions = indicesLine.split(',')
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
170 numPos = len(positions)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
171 empty_indicesLine = ',' * numPos
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
172 #print column headers for sample sequences
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
173 row4 = "Sequence Name,N,Clade,Extra Substitutions," + empty_indicesLine + "Number of Amino Acid Substitutions in Antigenic Sites,% Identity of Antigenic Site Residues\n"
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
174 lineListFile.write(row4)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
175 print("\nREFERENCE ANTIGENIC MAP: '%s' (%i amino acids)" % (record.id, len(record)))
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
176
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
177 with open(cladeDefinitionFile,'r') as cladeFile:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
178 """Read clade definition file and store clade names in a list."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
179 #remove whitespace from the end of each line and split elements at commas
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
180 for line in cladeFile:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
181 elementList = line.rstrip().split(',')
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
182 name = elementList[0] #move 1st element to name field
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
183 cladeList.append(name)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
184
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
185 with open(inputAntigenicMaps,'r') as extrAntigMapFile:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
186 """Read antigenic maps as protein SeqRecords and add to list."""
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
187 #read Sequences from fasta file, uppercase and add to seqList
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
188 for record in SeqIO.parse(extrAntigMapFile, "fasta", alphabet=IUPAC.protein):
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
189 record = record.upper()
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
190 seqList.append(record) #add Seq to list of Sequences
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
191
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
192 #print number of sequences to be process as user check
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
193 print("\nCOMPARING %i flu antigenic map sequences to the reference..." % len(seqList))
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
194 #parse each antigenic map sequence object
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
195 for record in seqList:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
196 #assign Sequence to dictionaries according to location in name
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
197 sort_by_location(record)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
198 #sort dictionary keys that access province-segregated lists
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
199 sorted_segregated_list_keys = sorted(segregated_lists.keys())
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
200 print("\nSequence Lists Sorted by Province: ")
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
201 #process each province-segregated SeqRecord list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
202 for listname in sorted_segregated_list_keys:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
203 #acesss list of sequences by the listname key
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
204 a_list = segregated_lists[listname]
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
205 # sort original SeqRecords by record id (i.e. name)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
206 a_list = [f for f in sorted(a_list, key = lambda x : x.id)]
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
207 mod_list = [] # empty temporary list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
208 for record in a_list:
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
209 #replace matching amino acid symbols with dots
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
210 rec = replace_matching_aa_with_dot(record)
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
211 mod_list.append(rec) #populate a list of modified records
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
212 segregated_lists[listname] = mod_list
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
213 print("\n'%s' List (Amino Acids identical to Reference Masked): " % (listname))
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
214 #output the list to csv as non-aggregated linelist
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
215 output_linelist(segregated_lists[listname])
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
216
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
217 extrAntigMapFile.close()
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
218 refMapFile.close()
be856549e863 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
219 lineListFile.close()