Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/genomes_feature_table/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
2:97e4e3e818b6 | 3:e42d30da7a74 |
---|---|
1 genomes_feature_table | |
2 ===================== | |
3 | |
4 `genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format. | |
5 | |
6 * [Synopsis](#synopsis) | |
7 * [Description](#description) | |
8 * [Usage](#usage) | |
9 * [Options](#options) | |
10 * [Output](#output) | |
11 * [Run environment](#run-environment) | |
12 * [Dependencies](#dependencies) | |
13 * [Author - contact](#author---contact) | |
14 * [Citation, installation, and license](#citation-installation-and-license) | |
15 * [Changelog](#changelog) | |
16 | |
17 ## Synopsis | |
18 | |
19 perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv | |
20 | |
21 ## Description | |
22 | |
23 A genome feature table lists basic stats/info (e.g. genome size, GC | |
24 content, coding percentage, accession number(s)) and the numbers of | |
25 annotated primary features (e.g. CDS, genes, RNAs) of genomes. It | |
26 can be used to have an overview of these features in different | |
27 genomes, e.g. in comparative genomics publications. | |
28 | |
29 `genomes_feature_table.pl` is designed to extract (or calculate) | |
30 these basic stats and **all** annotated primary features from RichSeq | |
31 files (**EMBL** or **GENBANK** format) in a specified directory (with the | |
32 correct file extension, see option **-e**). The **default** directory | |
33 is the current working directory. The primary features are | |
34 counted and the results for each genome printed in tab-separated | |
35 format. It is a requirement that each file contains **only one** | |
36 genome (complete or draft, with or without plasmids). | |
37 | |
38 The most important features will be listed first, like genome | |
39 description, genome size, GC content, coding percentage (calculated | |
40 based on non-pseudo CDS annotation), CDS and gene numbers, accession | |
41 number(s) (first..last in the sequence file), RNAs (rRNA, tRNA, | |
42 tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are | |
43 annotated in a sequence file, the number of plasmids are | |
44 counted and listed as well (needs a */plasmid="plasmid_name"* tag in the | |
45 *source* primary tag, see e.g. Genbank accession number | |
46 [CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p** | |
47 to list plasmids as separate entries (lines) in the feature table. | |
48 | |
49 For draft genomes the number of contigs/scaffolds are counted. All | |
50 contigs/scaffolds of draft genomes should be marked with the *WGS* | |
51 keyword (see e.g. draft NCBI Genbank entry | |
52 [JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is | |
53 not the case for your file(s) you can add those keywords to each | |
54 sequence entry with the following Perl one-liners (will | |
55 edit files in place). For files in **GENBANK** format if 'KEYWORDS .' is present | |
56 | |
57 perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file | |
58 | |
59 or if 'KEYWORDS' isn't present at all | |
60 | |
61 perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file | |
62 | |
63 For files in **EMBL** format if 'KW .' is present | |
64 | |
65 perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file | |
66 | |
67 or if 'KW' isn't present at all | |
68 | |
69 perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file | |
70 | |
71 ## Usage | |
72 | |
73 perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv | |
74 | |
75 perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv | |
76 | |
77 ## Options | |
78 | |
79 - -h, -help | |
80 | |
81 Help (perldoc POD) | |
82 | |
83 - -e, -extensions | |
84 | |
85 File extensions to include in the analysis (EMBL or GENBANK format), | |
86 either comma-separated list or multiple occurences of the option | |
87 [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank] | |
88 | |
89 - -p, -plasmids | |
90 | |
91 Optionally list plasmids as extra entries in the feature table, if | |
92 they are annotated with a */plasmid="plasmid_name"* tag in the | |
93 *source* primary tag | |
94 | |
95 - -v, -version | |
96 | |
97 Print version number to *STDERR* | |
98 | |
99 ## Output | |
100 | |
101 - *STDOUT* | |
102 | |
103 The resulting feature table is printed to *STDOUT*. Redirect or | |
104 pipe into another tool as needed (e.g. `cut`, `grep`, or `head`). | |
105 | |
106 ## Run environment | |
107 | |
108 The Perl script runs under Windows and UNIX flavors. | |
109 | |
110 ## Dependencies | |
111 | |
112 - [BioPerl](http://www.bioperl.org) (tested version 1.006923) | |
113 | |
114 ## Author - contact | |
115 | |
116 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) | |
117 | |
118 ## Citation, installation, and license | |
119 | |
120 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). | |
121 | |
122 ## Changelog | |
123 | |
124 - v0.5 (14.09.2015) | |
125 - changed script name to `genomes_feature_table.pl` | |
126 - included a POD | |
127 - options with Getopt::Long | |
128 - included `pod2usage` with Pod::Usage | |
129 - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes | |
130 - changed input options to get folder path from STDIN | |
131 - as a consequence new option **-e|-extensions** | |
132 - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file | |
133 - draft genomes should include 'WGS' keyword (warning if not) | |
134 - option **-p|-plasmids** works now correctly with complete and draft genomes | |
135 - count plasmids without option **-p** | |
136 - v0.4 (11.08.2013) | |
137 - included 'use autodie;' pragma | |
138 - included version switch | |
139 - v0.3 (05.11.2012) | |
140 - new option **p** to report plasmid features in multi-sequence draft files separately | |
141 - v0.2 (19.09.2012) | |
142 - v0.1 (25.11.2011) | |
143 - **original** script name: `get_genome_features.pl` |