annotate COG/bac-genomics-scripts/genomes_feature_table/README.md @ 14:5a5c9a6b047b draft

Uploaded
author dereeper
date Tue, 10 Dec 2024 16:20:53 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 genomes_feature_table
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 =====================
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Dependencies](#dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 A genome feature table lists basic stats/info (e.g. genome size, GC
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 content, coding percentage, accession number(s)) and the numbers of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 annotated primary features (e.g. CDS, genes, RNAs) of genomes. It
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 can be used to have an overview of these features in different
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 genomes, e.g. in comparative genomics publications.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 `genomes_feature_table.pl` is designed to extract (or calculate)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 these basic stats and **all** annotated primary features from RichSeq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 files (**EMBL** or **GENBANK** format) in a specified directory (with the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 correct file extension, see option **-e**). The **default** directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 is the current working directory. The primary features are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 counted and the results for each genome printed in tab-separated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 format. It is a requirement that each file contains **only one**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 genome (complete or draft, with or without plasmids).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 The most important features will be listed first, like genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 description, genome size, GC content, coding percentage (calculated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 based on non-pseudo CDS annotation), CDS and gene numbers, accession
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 number(s) (first..last in the sequence file), RNAs (rRNA, tRNA,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 annotated in a sequence file, the number of plasmids are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 counted and listed as well (needs a */plasmid="plasmid_name"* tag in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 *source* primary tag, see e.g. Genbank accession number
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 [CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 to list plasmids as separate entries (lines) in the feature table.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 For draft genomes the number of contigs/scaffolds are counted. All
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 contigs/scaffolds of draft genomes should be marked with the *WGS*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 keyword (see e.g. draft NCBI Genbank entry
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 [JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 not the case for your file(s) you can add those keywords to each
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 sequence entry with the following Perl one-liners (will
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 edit files in place). For files in **GENBANK** format if 'KEYWORDS    .' is present
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 or if 'KEYWORDS' isn't present at all
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 For files in **EMBL** format if 'KW   .' is present
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 or if 'KW' isn't present at all
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 - -h, -help
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 - -e, -extensions
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 File extensions to include in the analysis (EMBL or GENBANK format),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 either comma-separated list or multiple occurences of the option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 - -p, -plasmids
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 Optionally list plasmids as extra entries in the feature table, if
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 they are annotated with a */plasmid="plasmid_name"* tag in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 *source* primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 - -v, -version
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 The resulting feature table is printed to *STDOUT*. Redirect or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 pipe into another tool as needed (e.g. `cut`, `grep`, or `head`).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 The Perl script runs under Windows and UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 ## Dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 - [BioPerl](http://www.bioperl.org) (tested version 1.006923)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 - v0.5 (14.09.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 - changed script name to `genomes_feature_table.pl`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 - included a POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 - options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 - included `pod2usage` with Pod::Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 - changed input options to get folder path from STDIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 - as a consequence new option **-e|-extensions**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 - draft genomes should include 'WGS' keyword (warning if not)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 - option **-p|-plasmids** works now correctly with complete and draft genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 - count plasmids without option **-p**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 - v0.4 (11.08.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 - included 'use autodie;' pragma
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 - included version switch
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 - v0.3 (05.11.2012)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 - new option **p** to report plasmid features in multi-sequence draft files separately
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 - v0.2 (19.09.2012)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 - v0.1 (25.11.2011)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 - **original** script name: `get_genome_features.pl`