comparison COG/bac-genomics-scripts/genomes_feature_table/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 genomes_feature_table
2 =====================
3
4 `genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Options](#options)
10 * [Output](#output)
11 * [Run environment](#run-environment)
12 * [Dependencies](#dependencies)
13 * [Author - contact](#author---contact)
14 * [Citation, installation, and license](#citation-installation-and-license)
15 * [Changelog](#changelog)
16
17 ## Synopsis
18
19 perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv
20
21 ## Description
22
23 A genome feature table lists basic stats/info (e.g. genome size, GC
24 content, coding percentage, accession number(s)) and the numbers of
25 annotated primary features (e.g. CDS, genes, RNAs) of genomes. It
26 can be used to have an overview of these features in different
27 genomes, e.g. in comparative genomics publications.
28
29 `genomes_feature_table.pl` is designed to extract (or calculate)
30 these basic stats and **all** annotated primary features from RichSeq
31 files (**EMBL** or **GENBANK** format) in a specified directory (with the
32 correct file extension, see option **-e**). The **default** directory
33 is the current working directory. The primary features are
34 counted and the results for each genome printed in tab-separated
35 format. It is a requirement that each file contains **only one**
36 genome (complete or draft, with or without plasmids).
37
38 The most important features will be listed first, like genome
39 description, genome size, GC content, coding percentage (calculated
40 based on non-pseudo CDS annotation), CDS and gene numbers, accession
41 number(s) (first..last in the sequence file), RNAs (rRNA, tRNA,
42 tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are
43 annotated in a sequence file, the number of plasmids are
44 counted and listed as well (needs a */plasmid="plasmid_name"* tag in the
45 *source* primary tag, see e.g. Genbank accession number
46 [CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p**
47 to list plasmids as separate entries (lines) in the feature table.
48
49 For draft genomes the number of contigs/scaffolds are counted. All
50 contigs/scaffolds of draft genomes should be marked with the *WGS*
51 keyword (see e.g. draft NCBI Genbank entry
52 [JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is
53 not the case for your file(s) you can add those keywords to each
54 sequence entry with the following Perl one-liners (will
55 edit files in place). For files in **GENBANK** format if 'KEYWORDS    .' is present
56
57 perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file
58
59 or if 'KEYWORDS' isn't present at all
60
61 perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file
62
63 For files in **EMBL** format if 'KW   .' is present
64
65 perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file
66
67 or if 'KW' isn't present at all
68
69 perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file
70
71 ## Usage
72
73 perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv
74
75 perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv
76
77 ## Options
78
79 - -h, -help
80
81 Help (perldoc POD)
82
83 - -e, -extensions
84
85 File extensions to include in the analysis (EMBL or GENBANK format),
86 either comma-separated list or multiple occurences of the option
87 [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
88
89 - -p, -plasmids
90
91 Optionally list plasmids as extra entries in the feature table, if
92 they are annotated with a */plasmid="plasmid_name"* tag in the
93 *source* primary tag
94
95 - -v, -version
96
97 Print version number to *STDERR*
98
99 ## Output
100
101 - *STDOUT*
102
103 The resulting feature table is printed to *STDOUT*. Redirect or
104 pipe into another tool as needed (e.g. `cut`, `grep`, or `head`).
105
106 ## Run environment
107
108 The Perl script runs under Windows and UNIX flavors.
109
110 ## Dependencies
111
112 - [BioPerl](http://www.bioperl.org) (tested version 1.006923)
113
114 ## Author - contact
115
116 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
117
118 ## Citation, installation, and license
119
120 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
121
122 ## Changelog
123
124 - v0.5 (14.09.2015)
125 - changed script name to `genomes_feature_table.pl`
126 - included a POD
127 - options with Getopt::Long
128 - included `pod2usage` with Pod::Usage
129 - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes
130 - changed input options to get folder path from STDIN
131 - as a consequence new option **-e|-extensions**
132 - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file
133 - draft genomes should include 'WGS' keyword (warning if not)
134 - option **-p|-plasmids** works now correctly with complete and draft genomes
135 - count plasmids without option **-p**
136 - v0.4 (11.08.2013)
137 - included 'use autodie;' pragma
138 - included version switch
139 - v0.3 (05.11.2012)
140 - new option **p** to report plasmid features in multi-sequence draft files separately
141 - v0.2 (19.09.2012)
142 - v0.1 (25.11.2011)
143 - **original** script name: `get_genome_features.pl`