3
|
1 genomes_feature_table
|
|
2 =====================
|
|
3
|
|
4 `genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [Options](#options)
|
|
10 * [Output](#output)
|
|
11 * [Run environment](#run-environment)
|
|
12 * [Dependencies](#dependencies)
|
|
13 * [Author - contact](#author---contact)
|
|
14 * [Citation, installation, and license](#citation-installation-and-license)
|
|
15 * [Changelog](#changelog)
|
|
16
|
|
17 ## Synopsis
|
|
18
|
|
19 perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv
|
|
20
|
|
21 ## Description
|
|
22
|
|
23 A genome feature table lists basic stats/info (e.g. genome size, GC
|
|
24 content, coding percentage, accession number(s)) and the numbers of
|
|
25 annotated primary features (e.g. CDS, genes, RNAs) of genomes. It
|
|
26 can be used to have an overview of these features in different
|
|
27 genomes, e.g. in comparative genomics publications.
|
|
28
|
|
29 `genomes_feature_table.pl` is designed to extract (or calculate)
|
|
30 these basic stats and **all** annotated primary features from RichSeq
|
|
31 files (**EMBL** or **GENBANK** format) in a specified directory (with the
|
|
32 correct file extension, see option **-e**). The **default** directory
|
|
33 is the current working directory. The primary features are
|
|
34 counted and the results for each genome printed in tab-separated
|
|
35 format. It is a requirement that each file contains **only one**
|
|
36 genome (complete or draft, with or without plasmids).
|
|
37
|
|
38 The most important features will be listed first, like genome
|
|
39 description, genome size, GC content, coding percentage (calculated
|
|
40 based on non-pseudo CDS annotation), CDS and gene numbers, accession
|
|
41 number(s) (first..last in the sequence file), RNAs (rRNA, tRNA,
|
|
42 tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are
|
|
43 annotated in a sequence file, the number of plasmids are
|
|
44 counted and listed as well (needs a */plasmid="plasmid_name"* tag in the
|
|
45 *source* primary tag, see e.g. Genbank accession number
|
|
46 [CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p**
|
|
47 to list plasmids as separate entries (lines) in the feature table.
|
|
48
|
|
49 For draft genomes the number of contigs/scaffolds are counted. All
|
|
50 contigs/scaffolds of draft genomes should be marked with the *WGS*
|
|
51 keyword (see e.g. draft NCBI Genbank entry
|
|
52 [JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is
|
|
53 not the case for your file(s) you can add those keywords to each
|
|
54 sequence entry with the following Perl one-liners (will
|
|
55 edit files in place). For files in **GENBANK** format if 'KEYWORDS .' is present
|
|
56
|
|
57 perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file
|
|
58
|
|
59 or if 'KEYWORDS' isn't present at all
|
|
60
|
|
61 perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file
|
|
62
|
|
63 For files in **EMBL** format if 'KW .' is present
|
|
64
|
|
65 perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file
|
|
66
|
|
67 or if 'KW' isn't present at all
|
|
68
|
|
69 perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file
|
|
70
|
|
71 ## Usage
|
|
72
|
|
73 perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv
|
|
74
|
|
75 perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv
|
|
76
|
|
77 ## Options
|
|
78
|
|
79 - -h, -help
|
|
80
|
|
81 Help (perldoc POD)
|
|
82
|
|
83 - -e, -extensions
|
|
84
|
|
85 File extensions to include in the analysis (EMBL or GENBANK format),
|
|
86 either comma-separated list or multiple occurences of the option
|
|
87 [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
|
|
88
|
|
89 - -p, -plasmids
|
|
90
|
|
91 Optionally list plasmids as extra entries in the feature table, if
|
|
92 they are annotated with a */plasmid="plasmid_name"* tag in the
|
|
93 *source* primary tag
|
|
94
|
|
95 - -v, -version
|
|
96
|
|
97 Print version number to *STDERR*
|
|
98
|
|
99 ## Output
|
|
100
|
|
101 - *STDOUT*
|
|
102
|
|
103 The resulting feature table is printed to *STDOUT*. Redirect or
|
|
104 pipe into another tool as needed (e.g. `cut`, `grep`, or `head`).
|
|
105
|
|
106 ## Run environment
|
|
107
|
|
108 The Perl script runs under Windows and UNIX flavors.
|
|
109
|
|
110 ## Dependencies
|
|
111
|
|
112 - [BioPerl](http://www.bioperl.org) (tested version 1.006923)
|
|
113
|
|
114 ## Author - contact
|
|
115
|
|
116 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
117
|
|
118 ## Citation, installation, and license
|
|
119
|
|
120 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
121
|
|
122 ## Changelog
|
|
123
|
|
124 - v0.5 (14.09.2015)
|
|
125 - changed script name to `genomes_feature_table.pl`
|
|
126 - included a POD
|
|
127 - options with Getopt::Long
|
|
128 - included `pod2usage` with Pod::Usage
|
|
129 - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes
|
|
130 - changed input options to get folder path from STDIN
|
|
131 - as a consequence new option **-e|-extensions**
|
|
132 - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file
|
|
133 - draft genomes should include 'WGS' keyword (warning if not)
|
|
134 - option **-p|-plasmids** works now correctly with complete and draft genomes
|
|
135 - count plasmids without option **-p**
|
|
136 - v0.4 (11.08.2013)
|
|
137 - included 'use autodie;' pragma
|
|
138 - included version switch
|
|
139 - v0.3 (05.11.2012)
|
|
140 - new option **p** to report plasmid features in multi-sequence draft files separately
|
|
141 - v0.2 (19.09.2012)
|
|
142 - v0.1 (25.11.2011)
|
|
143 - **original** script name: `get_genome_features.pl`
|