3
|
1 ecoli_mlst
|
|
2 ==========
|
|
3
|
|
4 `ecoli_mlst` is a script to determine MLST sequence types for *E. coli* genomes and extract allele sequences.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [Options](#options)
|
|
10 * [Mandatory options](#mandatory-options)
|
|
11 * [Optional options](#optional-options)
|
|
12 * [Output](#output)
|
|
13 * [Run environment](#run-environment)
|
|
14 * [Author - contact](#author---contact)
|
|
15 * [Citation, installation, and license](#citation-installation-and-license)
|
|
16 * [Changelog](#changelog)
|
|
17
|
|
18 # Synopsis
|
|
19
|
|
20 perl ecoli_mlst.pl -a fas -g fasta
|
|
21
|
|
22 # Description
|
|
23
|
|
24 The script searches for multilocus sequence type (MLST) alleles in *E. coli* genomes according to
|
|
25 Mark Achtman's scheme with seven house-keeping genes (*adk*, *fumC*, *gyrB*,
|
|
26 *icd*, *mdh*, *purA*, and *recA*) [Wirth et al., 2006]. *NUCmer* from the
|
|
27 [*MUMmer package*](http://mummer.sourceforge.net/) is used to compare the given allele
|
|
28 sequences to bacterial genomes via nucleotide alignments.
|
|
29
|
|
30 Download the allele files (adk.fas ...) and the sequence type file
|
|
31 ('publicSTs.txt') from this website:
|
|
32 http://mlst.ucc.ie/mlst/dbs/Ecoli
|
|
33
|
|
34 To run `ecoli_mlst.pl` include all *E. coli* genome files (file
|
|
35 extension e.g. 'fasta'), all allele sequence files (file extension
|
|
36 'fas') and 'publicSTs.txt' in the current working directory. The
|
|
37 allele profiles are parsed from the created \*.coord files and written
|
|
38 to a result file, plus additional information from the file
|
|
39 'publicSTs.txt'. Also, the corresponding allele sequences (obtained
|
|
40 from the allele input files) are concatenated for each *E. coli* genome
|
|
41 into a result multi-fasta file. Option **-c** can be used to initiate
|
|
42 an alignment for this multi-fasta file with [*ClustalW*](http://www.clustal.org/clustal2/) (standard
|
|
43 alignment parameters; has to be in the `$PATH` or change variable
|
|
44 `$clustal_call`). The alignment fasta output file can be used
|
|
45 directly for [*RAxML*](http://sco.h-its.org/exelixis/web/software/raxml/index.html). CAREFUL the Phylip alignment format from
|
|
46 *ClustalW* allows only 10 characters per strain ID.
|
|
47
|
|
48 `ecoli_mlst.pl` works with complete and draft genomes. However, several genomes cannot be included in a single input file!
|
|
49
|
|
50 Obviously, only for those genomes whose allele sequences have been
|
|
51 deposited in Achtman's allele database results can be obtained. If an
|
|
52 allele is not found in a genome it is marked by a '?' in the result
|
|
53 profile file and a place holder 'XXX' in the result fasta file. For
|
|
54 these cases a manual *NUCmer* or *BLASTN* might be useful to fill the
|
|
55 gaps and [`run_sub_seq.pl`](/run_sub_seq) to get the corresponding 'new' allele
|
|
56 sequences.
|
|
57
|
|
58 Non-NCBI fasta headers for the genome files have to have a
|
|
59 unique ID directly following the '>' (e.g. 'Sakai', '55989' ...).
|
|
60
|
|
61 # Usage
|
|
62
|
|
63 perl ecoli_mlst.pl -a fas -g fasta -c
|
|
64
|
|
65 # Options
|
|
66
|
|
67 ## Mandatory options
|
|
68
|
|
69 - -a, -alleles
|
|
70
|
|
71 File extension of the MLST allele fasta files, e.g. 'fas' (<=> **-g**).
|
|
72
|
|
73 - -g, -genomes
|
|
74
|
|
75 File extension of the *E. coli* genome fasta files, e.g. 'fasta' (<=> **-a**).
|
|
76
|
|
77 ## Optional options
|
|
78
|
|
79 - -h, -help
|
|
80
|
|
81 Help (perldoc POD)
|
|
82
|
|
83 - -c, -clustalw
|
|
84
|
|
85 Call [*ClustalW*](http://www.clustal.org/clustal2/) for alignment
|
|
86
|
|
87 # Output
|
|
88
|
|
89 - ecoli_mlst_profile.txt
|
|
90
|
|
91 Tab-separated allele profiles for the *E. coli* genomes, plus additional info from 'publicSTs.txt'
|
|
92
|
|
93 - ecoli_mlst_seq.fasta
|
|
94
|
|
95 Multi-fasta file of all concatenated allele sequences for each genome
|
|
96
|
|
97 - *.coord
|
|
98
|
|
99 Text files that contain the coordinates of the *NUCmer* hits for each genome and allele
|
|
100
|
|
101 - (errors.txt)
|
|
102
|
|
103 Error file, summarizing number of not found alleles or unclear *NUCmer* hits
|
|
104
|
|
105 - (ecoli_mlst_seq_aln.fasta)
|
|
106
|
|
107 Optional, [*ClustalW*](http://www.clustal.org/clustal2/) alignment in Phylip format
|
|
108
|
|
109 - (ecoli_mlst_seq_aln.dnd)
|
|
110
|
|
111 Optional, *ClustalW* alignment guide tree
|
|
112
|
|
113 ## Run environment
|
|
114
|
|
115 The Perl script runs only under UNIX flavors.
|
|
116
|
|
117 ## Author - contact
|
|
118
|
|
119 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
120
|
|
121 ## Citation, installation, and license
|
|
122
|
|
123 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
124
|
|
125 ## Changelog
|
|
126
|
|
127 * v0.3 (30.01.2013)
|
|
128 - additional info in POD
|
|
129 - check if result files already exist and ask user what to do
|
|
130 - changed script name from `ecoli_mlst_alleles.pl` to `ecoli_mlst.pl`
|
|
131 * v0.2 (20.10.2012)
|
|
132 - included a POD
|
|
133 - options with Getopt::Long
|
|
134 - don't consider input *E. coli* genome query files, which are too big (set cutoff at 9 MB for a fasta *E. coli* file)
|
|
135 - draft *E. coli* genomes can now be used as input query files
|
|
136 - additional info in 'publicSTs.txt' now associated to found ST types in output
|
|
137 - give text to STDOUT which files were created
|
|
138 - new option **-c** to align the resulting allele sequences via *ClustalW*
|
|
139 * v0.1 (25.10.2011)
|