comparison COG/bac-genomics-scripts/ecoli_mlst/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 ecoli_mlst
2 ==========
3
4 `ecoli_mlst` is a script to determine MLST sequence types for *E. coli* genomes and extract allele sequences.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Options](#options)
10 * [Mandatory options](#mandatory-options)
11 * [Optional options](#optional-options)
12 * [Output](#output)
13 * [Run environment](#run-environment)
14 * [Author - contact](#author---contact)
15 * [Citation, installation, and license](#citation-installation-and-license)
16 * [Changelog](#changelog)
17
18 # Synopsis
19
20 perl ecoli_mlst.pl -a fas -g fasta
21
22 # Description
23
24 The script searches for multilocus sequence type (MLST) alleles in *E. coli* genomes according to
25 Mark Achtman's scheme with seven house-keeping genes (*adk*, *fumC*, *gyrB*,
26 *icd*, *mdh*, *purA*, and *recA*) [Wirth et al., 2006]. *NUCmer* from the
27 [*MUMmer package*](http://mummer.sourceforge.net/) is used to compare the given allele
28 sequences to bacterial genomes via nucleotide alignments.
29
30 Download the allele files (adk.fas ...) and the sequence type file
31 ('publicSTs.txt') from this website:
32 http://mlst.ucc.ie/mlst/dbs/Ecoli
33
34 To run `ecoli_mlst.pl` include all *E. coli* genome files (file
35 extension e.g. 'fasta'), all allele sequence files (file extension
36 'fas') and 'publicSTs.txt' in the current working directory. The
37 allele profiles are parsed from the created \*.coord files and written
38 to a result file, plus additional information from the file
39 'publicSTs.txt'. Also, the corresponding allele sequences (obtained
40 from the allele input files) are concatenated for each *E. coli* genome
41 into a result multi-fasta file. Option **-c** can be used to initiate
42 an alignment for this multi-fasta file with [*ClustalW*](http://www.clustal.org/clustal2/) (standard
43 alignment parameters; has to be in the `$PATH` or change variable
44 `$clustal_call`). The alignment fasta output file can be used
45 directly for [*RAxML*](http://sco.h-its.org/exelixis/web/software/raxml/index.html). CAREFUL the Phylip alignment format from
46 *ClustalW* allows only 10 characters per strain ID.
47
48 `ecoli_mlst.pl` works with complete and draft genomes. However, several genomes cannot be included in a single input file!
49
50 Obviously, only for those genomes whose allele sequences have been
51 deposited in Achtman's allele database results can be obtained. If an
52 allele is not found in a genome it is marked by a '?' in the result
53 profile file and a place holder 'XXX' in the result fasta file. For
54 these cases a manual *NUCmer* or *BLASTN* might be useful to fill the
55 gaps and [`run_sub_seq.pl`](/run_sub_seq) to get the corresponding 'new' allele
56 sequences.
57
58 Non-NCBI fasta headers for the genome files have to have a
59 unique ID directly following the '>' (e.g. 'Sakai', '55989' ...).
60
61 # Usage
62
63 perl ecoli_mlst.pl -a fas -g fasta -c
64
65 # Options
66
67 ## Mandatory options
68
69 - -a, -alleles
70
71 File extension of the MLST allele fasta files, e.g. 'fas' (<=> **-g**).
72
73 - -g, -genomes
74
75 File extension of the *E. coli* genome fasta files, e.g. 'fasta' (<=> **-a**).
76
77 ## Optional options
78
79 - -h, -help
80
81 Help (perldoc POD)
82
83 - -c, -clustalw
84
85 Call [*ClustalW*](http://www.clustal.org/clustal2/) for alignment
86
87 # Output
88
89 - ecoli_mlst_profile.txt
90
91 Tab-separated allele profiles for the *E. coli* genomes, plus additional info from 'publicSTs.txt'
92
93 - ecoli_mlst_seq.fasta
94
95 Multi-fasta file of all concatenated allele sequences for each genome
96
97 - *.coord
98
99 Text files that contain the coordinates of the *NUCmer* hits for each genome and allele
100
101 - (errors.txt)
102
103 Error file, summarizing number of not found alleles or unclear *NUCmer* hits
104
105 - (ecoli_mlst_seq_aln.fasta)
106
107 Optional, [*ClustalW*](http://www.clustal.org/clustal2/) alignment in Phylip format
108
109 - (ecoli_mlst_seq_aln.dnd)
110
111 Optional, *ClustalW* alignment guide tree
112
113 ## Run environment
114
115 The Perl script runs only under UNIX flavors.
116
117 ## Author - contact
118
119 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
120
121 ## Citation, installation, and license
122
123 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
124
125 ## Changelog
126
127 * v0.3 (30.01.2013)
128 - additional info in POD
129 - check if result files already exist and ask user what to do
130 - changed script name from `ecoli_mlst_alleles.pl` to `ecoli_mlst.pl`
131 * v0.2 (20.10.2012)
132 - included a POD
133 - options with Getopt::Long
134 - don't consider input *E. coli* genome query files, which are too big (set cutoff at 9 MB for a fasta *E. coli* file)
135 - draft *E. coli* genomes can now be used as input query files
136 - additional info in 'publicSTs.txt' now associated to found ST types in output
137 - give text to STDOUT which files were created
138 - new option **-c** to align the resulting allele sequences via *ClustalW*
139 * v0.1 (25.10.2011)