annotate COG/bac-genomics-scripts/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.215824.svg)](http://dx.doi.org/10.5281/zenodo.215824)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 bac-genomics-scripts
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 ====================
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 A collection of scripts intended for **bacterial genomics** (some might also be useful for eukaryotes) from **high-throughput sequencing** (aka next-generation sequencing).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Summary](#summary)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [Introduction](#introduction)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Installation recommendations](#installation-recommendations)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [Dependencies](#dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [UNIX loops](#unix-loops)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Windows - UNIX linebreak problems](#windows---unix-linebreak-problems)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Citation](#citation)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [License](#license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 ## Summary
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 * Basic stats for bases and reads in FASTQ files: [`calc_fastq-stats`](/calc_fastq-stats)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 * Concatenate multi-sequence files (RichSeq EMBL or GENBANK format, or FASTA format) to a single artificial file: [`cat_seq`](/cat_seq)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 * COG ([cluster of orthologous groups](http://www.ncbi.nlm.nih.gov/COG/)) classification of proteins: [`cdd2cog`](/cdd2cog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 * Extraction of protein/nucleotide sequences from CDSs: [`cds_extractor`](/cds_extractor)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 * MLST (multilocus sequence typing) assignment and allele extraction for *Escherichia coli* ([Achtman scheme](http://mlst.warwick.ac.uk/mlst/)): [`ecoli_mlst`](/ecoli_mlst)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 * Create a feature table for all annotated primary features in RichSeq (EMBL or GENBANK format) files: [`genomes_feature_table`](/genomes_feature_table)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 * **Deprecated!** Batch downloading of sequences from NCBI's FTP server: [`ncbi_ftp_download`](/ncbi_ftp_download)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 * Order sequence entries in FASTA/FASTQ files according to an ID list: [`order_fastx`](/order_fastx)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 * Create an ortholog/paralog annotation comparison matrix from [*Proteinortho5*](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output: [`po2anno`](/po2anno)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 * Calculate stats and plot venn diagrams for genome groups according to orthologs/paralogs from [*Proteinortho5*](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output, i.e. overall presence/absence statistics for groups of genomes and not simply single genomes: [`po2group_stats`](/po2group_stats)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 * Strain panel query protein search with **BLASTP** plus concise hit summary, optional alignment, and presence/absence matrix. Also included, scripts to transpose the matrix and calculate overall presence/absence statistics for groups of columns in the matrix: [`prot_finder`](/prot_finder)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 * Rename FASTA ID lines and optionally numerate them: [`rename_fasta_id`](/rename_fasta_id)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 * Reverse complement (multi-)sequence files (RichSeq EMBL or GENBANK format, or FASTA format): [`revcom_seq`](/revcom_seq)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 * Regions of difference (ROD) detection in genomes with **BLASTN**: [`rod_finder`](/rod_finder)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 * NGS paired-end library insert size estimation from BAM/SAM: [`sam_insert-size`](/sam_insert-size)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 * Randomly subsample FASTA, FASTQ, or TEXT files with [*reservoir sampling*](https://en.wikipedia.org/wiki/Reservoir_sampling): [`sample_fastx-txt`](/sample_fastx-txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 * Convert a sequence file to another format with [BioPerl](http://www.bioperl.org): [`seq_format-converter`](/seq_format-converter)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 * Manual curation of annotation in NCBI's TBL format (e.g. from [Prokka](http://www.vicbioinformatics.com/software.prokka.shtml) automatic annotation) in a spreadsheet software: [`tbl2tab`](/tbl2tab)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 * Truncate sequence files (RichSeq EMBL or GENBANK format, or FASTA format) according to given coordinates: [`trunc_seq`](/trunc_seq)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 * And an assortment of smaller scripts for tasks like (not yet uploaded to GitHub): alignment format converters, dnadiff, GC% calculation etc.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 ## Introduction
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 All the scripts here are written in [**Perl**](https://www.perl.org/) (some include bash shell wrappers).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 Each script is hosted in its own folder, so that a separate *README.md* can be included for more information. However, all of the Perl scripts include additionally a usage/help text or a comprehensive [POD](http://perldoc.perl.org/perlpod.html) (Plain Old Documentation) by calling the script either without arguments/options or option **-h|-help**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 The scripts are only tested under UNIX, some won't run in a Windows environment (because of included UNIX commands). If you are on Windows an alternative might be [Cygwin](http://cygwin.com/).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 ## Installation recommendations
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 To download the repository, use either the '[Download ZIP](https://github.com/aleimba/bac-genomics-scripts/archive/master.zip)' link after clicking the green 'Clone or download' button at the top or clone the repository with `git`:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 git clone https://github.com/aleimba/bac-genomics-scripts.git
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 If there is an update to this GitHub repository (see above [commits](https://github.com/aleimba/bac-genomics-scripts/commits/master) and [releases](https://github.com/aleimba/bac-genomics-scripts/releases)), you can refresh your **local** repository by using the following command **inside** the local folder:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 git pull
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 To install the scripts, copy them e.g. to a home */bin* folder in your *PATH* and make them executable
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 $ find . \( -name '*.pl' -o -name '*.sh' -o -name '*.fas' -o -name '*.txt' \) -exec cp {} ~/bin \;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 $ chmod u+x ~/bin/*.pl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 the scripts can then be run everywhere on your system. Of course you can just call them directly by prefexing `perl` to the command or a './' for bash wrappers:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 $ perl /path/to/script/script.pl <options>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 $ ./script.sh <arguments>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72 **Single** scripts can be downloaded as well. For this purpose click on the folder you're interested in and then on the link of the script. There click on the **Raw** button and save this page to a file (without **Raw** you'll get an unusable html file). This is also true for other files (e.g. PDFs etc.).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 ## Dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 All scripts are tested with Perl v5.22.1.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 Most of the Perl scripts include modules from [BioPerl](http://www.bioperl.org) as stated in their respective *README.md* or POD, which as a consequence has to be installed on your system. For BioPerl installation instructions see the website ([**Installation**](http://bioperl.org/INSTALL.html)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 Some scripts need additional Perl modules, which will be stated in the associated *README.md* or POD. If they're not installed yet on your system get them from [CPAN](http://www.cpan.org/) (installation instructions can be found on the website, see e.g. [**Getting Started...Installing Perl Modules**](http://www.cpan.org/modules/INSTALL.html) or [**FAQ**](http://www.cpan.org/misc/cpan-faq.html#How_install_Perl_modules)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 Furthermore, some scripts call upon statistical computing language [**R**](http://www.r-project.org/) and dependent packages for plotting purposes (again see the respective *README.md* or POD).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 ## UNIX loops
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 A very handy tip, if you want to run a script on all files in the current working directory you can use a **loop** in UNIX, e.g.:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 $ for file in *.fasta; do perl script.pl "$file"; done
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 ## Windows - UNIX linebreak problems
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 At last, some of the scripts don't like Windows formatted line breaks, you might consider running these input files through a nifty UNIX utility called [dos2unix](http://dos2unix.sourceforge.net/):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 $ dos2unix input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 ## Citation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 For now cite the latest major release (tag: [***bovine_ecoli_mastitis***](https://github.com/aleimba/bac-genomics-scripts/releases)) hosted on [Zenodo](https://zenodo.org/):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 **Leimbach A**. 2016. bac-genomics-scripts: Bovine *E. coli* mastitis comparative genomics edition. Zenodo. <http://dx.doi.org/10.5281/zenodo.215824>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 Also, all scripts have a version number (see option **-v**), which might be included in a materials and methods section.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 ## License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 All scripts are licensed under GPLv3 which is contained in the file [*LICENSE*](./LICENSE).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 For help, suggestions, bugs etc. use the GitHub issues or write an email to aleimba [at] gmx [dot] de.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 Andreas Leimbach (Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)