diff COG/bac-genomics-scripts/ecoli_mlst/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/COG/bac-genomics-scripts/ecoli_mlst/README.md	Thu May 30 11:52:25 2024 +0000
@@ -0,0 +1,139 @@
+ecoli_mlst
+==========
+
+`ecoli_mlst` is a script to determine MLST sequence types for *E. coli* genomes and extract allele sequences.
+
+* [Synopsis](#synopsis)
+* [Description](#description)
+* [Usage](#usage)
+* [Options](#options)
+  * [Mandatory options](#mandatory-options)
+  * [Optional options](#optional-options)
+* [Output](#output)
+* [Run environment](#run-environment)
+* [Author - contact](#author---contact)
+* [Citation, installation, and license](#citation-installation-and-license)
+* [Changelog](#changelog)
+
+# Synopsis
+
+    perl ecoli_mlst.pl -a fas -g fasta
+
+# Description
+
+The script searches for multilocus sequence type (MLST) alleles in *E. coli* genomes according to
+Mark Achtman's scheme with seven house-keeping genes (*adk*, *fumC*, *gyrB*,
+*icd*, *mdh*, *purA*, and *recA*) [Wirth et al., 2006]. *NUCmer* from the
+[*MUMmer package*](http://mummer.sourceforge.net/) is used to compare the given allele
+sequences to bacterial genomes via nucleotide alignments.
+
+Download the allele files (adk.fas ...) and the sequence type file
+('publicSTs.txt') from this website:
+    http://mlst.ucc.ie/mlst/dbs/Ecoli
+
+To run `ecoli_mlst.pl` include all *E. coli* genome files (file
+extension e.g. 'fasta'), all allele sequence files (file extension
+'fas') and 'publicSTs.txt' in the current working directory. The
+allele profiles are parsed from the created \*.coord files and written
+to a result file, plus additional information from the file
+'publicSTs.txt'. Also, the corresponding allele sequences (obtained
+from the allele input files) are concatenated for each *E. coli* genome
+into a result multi-fasta file. Option **-c** can be used to initiate
+an alignment for this multi-fasta file with [*ClustalW*](http://www.clustal.org/clustal2/) (standard
+alignment parameters; has to be in the `$PATH` or change variable
+`$clustal_call`). The alignment fasta output file can be used
+directly for [*RAxML*](http://sco.h-its.org/exelixis/web/software/raxml/index.html). CAREFUL the Phylip alignment format from
+*ClustalW* allows only 10 characters per strain ID.
+
+`ecoli_mlst.pl` works with complete and draft genomes. However, several genomes cannot be included in a single input file!
+
+Obviously, only for those genomes whose allele sequences have been
+deposited in Achtman's allele database results can be obtained. If an
+allele is not found in a genome it is marked by a '?' in the result
+profile file and a place holder 'XXX' in the result fasta file. For
+these cases a manual *NUCmer* or *BLASTN* might be useful to fill the
+gaps and [`run_sub_seq.pl`](/run_sub_seq) to get the corresponding 'new' allele
+sequences.
+
+Non-NCBI fasta headers for the genome files have to have a
+unique ID directly following the '>' (e.g. 'Sakai', '55989' ...).
+
+# Usage
+
+    perl ecoli_mlst.pl -a fas -g fasta -c
+
+# Options
+
+## Mandatory options
+
+- -a, -alleles
+
+    File extension of the MLST allele fasta files, e.g. 'fas' (<=> **-g**).
+
+- -g, -genomes
+
+    File extension of the *E. coli* genome fasta files, e.g. 'fasta' (<=> **-a**).
+
+## Optional options
+
+- -h, -help
+
+    Help (perldoc POD)
+
+- -c, -clustalw
+
+    Call [*ClustalW*](http://www.clustal.org/clustal2/) for alignment
+
+# Output
+
+- ecoli_mlst_profile.txt
+
+    Tab-separated allele profiles for the *E. coli* genomes, plus additional info from 'publicSTs.txt'
+
+- ecoli_mlst_seq.fasta
+
+    Multi-fasta file of all concatenated allele sequences for each genome
+
+- *.coord
+
+    Text files that contain the coordinates of the *NUCmer* hits for each genome and allele
+
+- (errors.txt)
+
+    Error file, summarizing number of not found alleles or unclear *NUCmer* hits
+
+- (ecoli_mlst_seq_aln.fasta)
+
+    Optional, [*ClustalW*](http://www.clustal.org/clustal2/) alignment in Phylip format
+
+- (ecoli_mlst_seq_aln.dnd)
+
+    Optional, *ClustalW* alignment guide tree
+
+## Run environment
+
+The Perl script runs only under UNIX flavors.
+
+## Author - contact
+
+Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
+
+## Citation, installation, and license
+
+For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
+
+## Changelog
+
+* v0.3 (30.01.2013)
+    - additional info in POD
+    - check if result files already exist and ask user what to do
+    - changed script name from `ecoli_mlst_alleles.pl` to `ecoli_mlst.pl`
+* v0.2 (20.10.2012)
+    - included a POD
+    - options with Getopt::Long
+    - don't consider input *E. coli* genome query files, which are too big (set cutoff at 9 MB for a fasta *E. coli* file)
+    - draft *E. coli* genomes can now be used as input query files
+    - additional info in 'publicSTs.txt' now associated to found ST types in output
+    - give text to STDOUT which files were created
+    - new option **-c** to align the resulting allele sequences via *ClustalW*
+* v0.1 (25.10.2011)