3
|
1 tbl2tab
|
|
2 =======
|
|
3
|
|
4 `tbl2tab.pl` is a script to convert tbl to tab-separated format and back.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [Options](#options)
|
|
10 * [Mandatory options](#mandatory-options)
|
|
11 * [Optional options](#optional-options)
|
|
12 * [Output](#output)
|
|
13 * [Run environment](#run-environment)
|
|
14 * [Author - contact](#author---contact)
|
|
15 * [Citation, installation, and license](#citation-installation-and-license)
|
|
16 * [Changelog](#changelog)
|
|
17
|
|
18 ## Synopsis
|
|
19
|
|
20 perl tbl2tab.pl -m tbl2tab -i feature_table.tbl -s -l locus_prefix
|
|
21
|
|
22 **or**
|
|
23
|
|
24 perl tbl2tab.pl -m tab2tbl -i feature_table.tab -g -l locus_prefix -p "gnl|dbname|"
|
|
25
|
|
26 ## Description
|
|
27
|
|
28 NCBI's feature table (**tbl**) format is needed for the submission of genomic data to GenBank with the NCBI tools [Sequin](http://www.ncbi.nlm.nih.gov/Sequin/) or [tbl2asn](http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2). tbl files can be created with automatic annotation systems like [Prokka](http://www.vicbioinformatics.com/software.prokka.shtml). `tbl2tab.pl` can convert a tbl file to a tab-separated format (tab) and back to the tbl format. The tab-delimited format is useful to manipulate the data more comfortably in a spreadsheet software (e.g. LibreOffice or MS Excel). For a conversion back to tbl format save the file in the spreadsheet software as a tab-delimited text file. The script is intended for microbial genomes, but might also be useful for eukaryotes.
|
|
29
|
|
30 Regular expressions are applied in mode '**tbl2tab**' to correct gene names and words in '/product' values to lowercase initials (with the exception of 'Rossman' and 'Willebrand'). The resulting tab file can then be used to check for possible errors.
|
|
31
|
|
32 The first four header columns of the **tab** format are mandatory, 'seq_id' for the SeqID, and for each primary tag/feature (e.g. CDS, RNAs, repeat_region etc.), 'start', 'stop', and 'primary_tag'. These mandatory columns have to be filled in every row in the tab file. All the following columns will be included as tags/qualifiers (e.g. '/locus_tag', '/product', '/EC_number', '/note' etc.) in the conversion to the tbl file if a value is present.
|
|
33
|
|
34 There are three special cases:
|
|
35
|
|
36 **First**, '/pseudo' will be included as a tag if *any* value (the script uses 'T' for true) is present in the **tab** format. If a primary tag is indicated as pseudo both the primary tag and the accessory 'gene' primary tag (for CDS/RNA features with option **-g**) will include a '/pseudo' qualifier in the resulting **tbl** file. *Pseudo-genes* are indicated by 'pseudo' in the 'primary_tag' column, thus the 'pseudo' column is ignored in these cases.
|
|
37
|
|
38 **Second**, tag '/gene_desc' is reserved for the 'product' values of pseudo-genes, thus a 'gene_desc' column in a tab file will be ignored in the conversion to tbl.
|
|
39
|
|
40 **Third**, column 'protein_id' in a tab file will also be ignored in the conversion. '/protein_id' values are created from option **-p** and the locus_tag for each CDS primary feature.
|
|
41
|
|
42 Furthermore, with option **-s** G2L-style spreadsheet formulas ([Goettingen Genomics Laboratory](http://appmibio.uni-goettingen.de/)) can be included with additional columns, 'spreadsheet_locus_tag', 'position', 'distance', 'gene_number', and 'contig_order'. These columns will not be included in a conversion to the tbl format. Thus, if you want to include e.g. the locus_tags from the formula in column 'spreadsheet_locus_tag' in the resulting tbl file copy the *values* to the column 'locus_tag'!
|
|
43
|
|
44 To illustrate the process two example files are included in the repository, 'example.tbl' and 'example2.tab', which are interconvertible (see "[USAGE](#usage)" below).
|
|
45
|
|
46 **Warning**, be aware of possible errors introduced by automatic format conversions using a spreadsheet software like MS Excel, see e.g. Zeeberg *et al.* 2004 (http://www.ncbi.nlm.nih.gov/pubmed/15214961).
|
|
47
|
|
48 For more information regarding the feature table and the submission process see NCBI's [prokaryotic annotation guide](http://www.ncbi.nlm.nih.gov/genbank/genomesubmit) and the [bacterial genome submission guide](http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation).
|
|
49
|
|
50 ## Usage
|
|
51
|
|
52 ### Conversion from tbl to tab format
|
|
53
|
|
54 perl tbl2tab.pl -m tbl2tab -i example.tbl -s -l EPE
|
|
55
|
|
56 ### Conversion from tab to tbl format
|
|
57
|
|
58 perl tbl2tab.pl -m tab2tbl -i example2.tab -g -l EPE
|
|
59
|
|
60 ## Options
|
|
61
|
|
62 ### Mandatory options
|
|
63
|
|
64 * -m, -mode
|
|
65
|
|
66 Conversion mode, either 'tbl2tab' or 'tab2tbl' [default = 'tbl2tab']
|
|
67
|
|
68 * -i, -input
|
|
69
|
|
70 Input tbl or tab file to be converted to the other format
|
|
71
|
|
72 ### Optional options
|
|
73
|
|
74 * -h, -help
|
|
75
|
|
76 Help (perldoc POD)
|
|
77
|
|
78 * -v, -version
|
|
79
|
|
80 Print version number to *STDERR*
|
|
81
|
|
82 #### Mode *tbl2tab*
|
|
83
|
|
84 * -l, -locus_prefix
|
|
85
|
|
86 Only in combination with option **-s** and there mandatory to include the locus_tag prefix in the formula for column 'spreadsheet_locus_tag'
|
|
87
|
|
88 * -c, -concat
|
|
89
|
|
90 Concatenate values of identical tags within one primary tag with '~' (e.g. several '/EC_number' or '/inference' tags)
|
|
91
|
|
92 * -e, -empty
|
|
93
|
|
94 String used for primary features without value for a tag [default = '']
|
|
95
|
|
96 * -s, -spreadsheet
|
|
97
|
|
98 Include formulas for spreadsheet editing
|
|
99
|
|
100 * -f, -formula_lang
|
|
101
|
|
102 Syntax language of the spreadsheet formulas, either 'English' or 'German'. If you're still encountering problems with the formulas set the decimal and thousands separator manually in the options of the spreadsheet software (instead of using the operating system separators). [default = 'e']
|
|
103
|
|
104 #### Mode *tab2tbl*
|
|
105
|
|
106 * -l, -locus_prefix
|
|
107
|
|
108 Prefix to the SeqID if not present already in the SeqID
|
|
109
|
|
110 * -g, -gene
|
|
111
|
|
112 Include accessory 'gene' primary tags (with '/gene', '/locus_tag' and possibly '/pseudo' tags) for 'CDS/RNA' primary tags; NCBI standard
|
|
113
|
|
114 * -t, -tags_full
|
|
115
|
|
116 Only in combination with option **-g**, include '/gene' and '/locus_tag' tags additionally in primary tag, not only in accessory 'gene' primary tag
|
|
117
|
|
118 * -p, -protein_id_prefix
|
|
119
|
|
120 Prefix for '/protein_id' tags; don't forget the double quotes for the string, otherwise the shell will intepret as pipe [default = 'gnl|goetting|']
|
|
121
|
|
122 ## Output
|
|
123
|
|
124 * *.tab|tbl
|
|
125
|
|
126 Result file in the opposite format
|
|
127
|
|
128 * (hypo_putative_genes.txt)
|
|
129
|
|
130 Created in mode **tab2tbl**, indicates if CDSs are annotated as
|
|
131 'hypothetical/putative/predicted protein' but still have a gene name
|
|
132
|
|
133 ## Run environment
|
|
134
|
|
135 The Perl script runs under Windows and UNIX flavors.
|
|
136
|
|
137 ## Author - contact
|
|
138
|
|
139 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
140
|
|
141 ## Citation, installation, and license
|
|
142
|
|
143 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
144
|
|
145 ## Changelog
|
|
146
|
|
147 * v0.2 (29.10.2014)
|
|
148 * fixed bug: message which file was created was mixed up
|
|
149 * *hypo_putative_genes.txt* includes now also 'predicted protein' annotations
|
|
150 * additions and syntax changes to POD and README.md
|
|
151 * v0.1 (24.06.2014)
|