Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/ecoli_mlst/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
2:97e4e3e818b6 | 3:e42d30da7a74 |
---|---|
1 ecoli_mlst | |
2 ========== | |
3 | |
4 `ecoli_mlst` is a script to determine MLST sequence types for *E. coli* genomes and extract allele sequences. | |
5 | |
6 * [Synopsis](#synopsis) | |
7 * [Description](#description) | |
8 * [Usage](#usage) | |
9 * [Options](#options) | |
10 * [Mandatory options](#mandatory-options) | |
11 * [Optional options](#optional-options) | |
12 * [Output](#output) | |
13 * [Run environment](#run-environment) | |
14 * [Author - contact](#author---contact) | |
15 * [Citation, installation, and license](#citation-installation-and-license) | |
16 * [Changelog](#changelog) | |
17 | |
18 # Synopsis | |
19 | |
20 perl ecoli_mlst.pl -a fas -g fasta | |
21 | |
22 # Description | |
23 | |
24 The script searches for multilocus sequence type (MLST) alleles in *E. coli* genomes according to | |
25 Mark Achtman's scheme with seven house-keeping genes (*adk*, *fumC*, *gyrB*, | |
26 *icd*, *mdh*, *purA*, and *recA*) [Wirth et al., 2006]. *NUCmer* from the | |
27 [*MUMmer package*](http://mummer.sourceforge.net/) is used to compare the given allele | |
28 sequences to bacterial genomes via nucleotide alignments. | |
29 | |
30 Download the allele files (adk.fas ...) and the sequence type file | |
31 ('publicSTs.txt') from this website: | |
32 http://mlst.ucc.ie/mlst/dbs/Ecoli | |
33 | |
34 To run `ecoli_mlst.pl` include all *E. coli* genome files (file | |
35 extension e.g. 'fasta'), all allele sequence files (file extension | |
36 'fas') and 'publicSTs.txt' in the current working directory. The | |
37 allele profiles are parsed from the created \*.coord files and written | |
38 to a result file, plus additional information from the file | |
39 'publicSTs.txt'. Also, the corresponding allele sequences (obtained | |
40 from the allele input files) are concatenated for each *E. coli* genome | |
41 into a result multi-fasta file. Option **-c** can be used to initiate | |
42 an alignment for this multi-fasta file with [*ClustalW*](http://www.clustal.org/clustal2/) (standard | |
43 alignment parameters; has to be in the `$PATH` or change variable | |
44 `$clustal_call`). The alignment fasta output file can be used | |
45 directly for [*RAxML*](http://sco.h-its.org/exelixis/web/software/raxml/index.html). CAREFUL the Phylip alignment format from | |
46 *ClustalW* allows only 10 characters per strain ID. | |
47 | |
48 `ecoli_mlst.pl` works with complete and draft genomes. However, several genomes cannot be included in a single input file! | |
49 | |
50 Obviously, only for those genomes whose allele sequences have been | |
51 deposited in Achtman's allele database results can be obtained. If an | |
52 allele is not found in a genome it is marked by a '?' in the result | |
53 profile file and a place holder 'XXX' in the result fasta file. For | |
54 these cases a manual *NUCmer* or *BLASTN* might be useful to fill the | |
55 gaps and [`run_sub_seq.pl`](/run_sub_seq) to get the corresponding 'new' allele | |
56 sequences. | |
57 | |
58 Non-NCBI fasta headers for the genome files have to have a | |
59 unique ID directly following the '>' (e.g. 'Sakai', '55989' ...). | |
60 | |
61 # Usage | |
62 | |
63 perl ecoli_mlst.pl -a fas -g fasta -c | |
64 | |
65 # Options | |
66 | |
67 ## Mandatory options | |
68 | |
69 - -a, -alleles | |
70 | |
71 File extension of the MLST allele fasta files, e.g. 'fas' (<=> **-g**). | |
72 | |
73 - -g, -genomes | |
74 | |
75 File extension of the *E. coli* genome fasta files, e.g. 'fasta' (<=> **-a**). | |
76 | |
77 ## Optional options | |
78 | |
79 - -h, -help | |
80 | |
81 Help (perldoc POD) | |
82 | |
83 - -c, -clustalw | |
84 | |
85 Call [*ClustalW*](http://www.clustal.org/clustal2/) for alignment | |
86 | |
87 # Output | |
88 | |
89 - ecoli_mlst_profile.txt | |
90 | |
91 Tab-separated allele profiles for the *E. coli* genomes, plus additional info from 'publicSTs.txt' | |
92 | |
93 - ecoli_mlst_seq.fasta | |
94 | |
95 Multi-fasta file of all concatenated allele sequences for each genome | |
96 | |
97 - *.coord | |
98 | |
99 Text files that contain the coordinates of the *NUCmer* hits for each genome and allele | |
100 | |
101 - (errors.txt) | |
102 | |
103 Error file, summarizing number of not found alleles or unclear *NUCmer* hits | |
104 | |
105 - (ecoli_mlst_seq_aln.fasta) | |
106 | |
107 Optional, [*ClustalW*](http://www.clustal.org/clustal2/) alignment in Phylip format | |
108 | |
109 - (ecoli_mlst_seq_aln.dnd) | |
110 | |
111 Optional, *ClustalW* alignment guide tree | |
112 | |
113 ## Run environment | |
114 | |
115 The Perl script runs only under UNIX flavors. | |
116 | |
117 ## Author - contact | |
118 | |
119 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) | |
120 | |
121 ## Citation, installation, and license | |
122 | |
123 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). | |
124 | |
125 ## Changelog | |
126 | |
127 * v0.3 (30.01.2013) | |
128 - additional info in POD | |
129 - check if result files already exist and ask user what to do | |
130 - changed script name from `ecoli_mlst_alleles.pl` to `ecoli_mlst.pl` | |
131 * v0.2 (20.10.2012) | |
132 - included a POD | |
133 - options with Getopt::Long | |
134 - don't consider input *E. coli* genome query files, which are too big (set cutoff at 9 MB for a fasta *E. coli* file) | |
135 - draft *E. coli* genomes can now be used as input query files | |
136 - additional info in 'publicSTs.txt' now associated to found ST types in output | |
137 - give text to STDOUT which files were created | |
138 - new option **-c** to align the resulting allele sequences via *ClustalW* | |
139 * v0.1 (25.10.2011) |