annotate COG/bac-genomics-scripts/genomes_feature_table/genomes_feature_table.pl @ 13:152d7c43478b draft default tip

Uploaded
author dereeper
date Thu, 30 May 2024 20:07:55 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<genomes_feature_table.pl> - create a feature table for genomes in EMBL and GENBANK format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 C<perl genomes_feature_table.pl path/to/genome_dir E<gt> feature_table.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 A genome feature table lists basic stats/info (e.g. genome size, GC
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 content, coding percentage, accession number(s)) and the numbers of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 annotated primary features (e.g. CDS, genes, RNAs) of genomes. It
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 can be used to have an overview of these features in different
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 genomes, e.g. in comparative genomics publications.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 C<genomes_feature_table.pl> is designed to extract (or calculate)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 these basic stats and B<all> annotated primary features from RichSeq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 files (B<EMBL> or B<GENBANK> format) in a specified directory (with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 the correct file extension, see option B<-e>). The B<default> directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 is the current working directory. The primary features are counted
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 and the results for each genome printed in tab-separated format. It
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 is a requirement that each file contains B<only one> genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 (complete or draft, with or without plasmids).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 The most important features will be listed first, like genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 description, genome size, GC content, coding percentage (calculated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 based on non-pseudo CDS annotation), CDS and gene numbers, accession
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 number(s) (first..last in the sequence file), RNAs (rRNA, tRNA,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 annotated in a sequence file, the number of plasmids are counted
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 and listed as well (needs a I</plasmid="plasmid_name"> tag in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 I<source> primary tag, see e.g. Genbank accession number
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 L<CP009167|http://www.ncbi.nlm.nih.gov/nuccore/CP009167>). Use
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 option B<-p> to list plasmids as separate entries (lines) in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 feature table.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 For draft genomes the number of contigs/scaffolds are counted. All
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 contigs/scaffolds of draft genomes should be marked with the I<WGS>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 keyword (see e.g. draft NCBI Genbank entry
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 L<JSAY00000000|http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 If this is not the case for your file(s) you can add those keywords
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 to each sequence entry with the following Perl one-liners (will edit
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 files in place). For files in B<GENBANK> format if 'KEYWORDS .' is
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 present
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 C<perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 or if 'KEYWORDS' isn't present at all
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 C<perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 For files in B<EMBL> format if 'KW .' is present
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 C<perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 or if 'KW' isn't present at all
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 C<perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 =item B<-e>=I<str>, B<-extensions>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 File extensions to include in the analysis (EMBL or GENBANK format),
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 either comma-separated list or multiple occurences of the option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 =item B<-p>, B<-plasmids>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 Optionally list plasmids as extra entries in the feature table, if
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 they are annotated with a I</plasmid="plasmid_name"> tag in the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 I<source> primary tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 =item C<STDOUT>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 The resulting feature table is printed to C<STDOUT>. Redirect or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 pipe into another tool as needed (e.g. C<cut>, C<grep>, or C<head>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 =item C<perl genomes_feature_table.pl -p -e gb,gbk E<gt> feature_table_plasmids.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 =item C<perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl E<gt> feature_table.tsv>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 =head1 DEPENDENCIES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 =item B<L<BioPerl|http://www.bioperl.org>)>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 Tested with BioPerl version 1.006923
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 0.5 update: 14-09-2015
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 0.1 25-11-2011
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 the Free Software Foundation; either version 3 (GPLv3) of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 License, or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 use Bio::SeqIO; # BioPerl module to handle sequence input/output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 use Bio::SeqFeatureI; # BioPerl module to handle features in a sequence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 ### Get the options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 my @Extensions; # optionally give list of file-extensions to search in (otherwise defaults will be set below); comma-separated or several occurences of the option
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 my $Opt_Plasmids; # optionally list plasmids in multi-seq files as extra entries in final feature table
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 my $VERSION = 0.5;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 GetOptions ('extensions=s' => \@Extensions,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 'plasmids' => \$Opt_Plasmids,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 'help|?' => \$Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 ### Run perldoc on POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 warn "\n### Warning:\nToo many arguments given, only the path to the directory with the RichSeq files (EMBL or GENBANK format) is needed!\n" if (@ARGV > 1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 my $Genome_Dir = '.'; # directory with RichSeq genome files in EMBL or GENBANK format; default current directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 $Genome_Dir = shift if (@ARGV);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 if (!-d $Genome_Dir) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 my $warning = "\n### Fatal error: Directory '$Genome_Dir' does not exist: $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190 ### Get optional file-extensions or set defaults
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 if (@Extensions) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 @Extensions = split(/,/,join(',', @Extensions)); # split (potential comma-separated lists) and join (potential several occurences) of file-extensions
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194 @Extensions = qw(ebl emb embl gb gbf gbff gbank gbk genbank); # default file extensions
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 my $Extensions = join('|', @Extensions); # join for regex below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200 ### Save all primary features from all seq-files (to include all possibilities) and count them for each genome/replicon individually
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201 my $File_Count = 0; # Count number of files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 my %Strain_Features; # hash of hash; store all primary features of each strain, counted; additional genome/plasmid description (DEFINITION line), sequence length, gc_content, coding percentage, accession numbers, contigs/scaffolds (for drafts), unresolved bases ('n/N's), and plasmids (without option '-p')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204 opendir (my $Dir_Fh, $Genome_Dir);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205 while (my $file = readdir($Dir_Fh)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206 if ($file =~ /.+\.($Extensions)$/) { # only include lines with given files extensions
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 $File_Count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208 my $Id; # filename used as unique ID (optionally with appended plasmid name) for hashes; need to declare here for checks after 'while next_seq' loop
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 my $seqio_obj = Bio::SeqIO->new(-file => "<$Genome_Dir/$file");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211 die "\n### Fatal error: File '$file' is not a RichSeq file in EMBL or GENBANK format!\n" if (ref($seqio_obj) !~ /Bio\:\:SeqIO\:\:[genbank|embl]/); # check if correct file format; Bio::SeqIO::genbank or Bio::SeqIO::embl object
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 my %concat_seq; # store the concatenated sequences for all replicons or contigs/scaffolds of a genome with key '$Id'; only GC-content and unresolved Ns saved in '%Strain_Features', thus extra hash declared here to remove content and save memory for each file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214 my %coding_bases; # base count to subsequently calculate coding percentage with key '$Id'; only coding percentage saved in '%Strain_Features', thus extra hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215 my %plasmid_count; # count the number of plasmids for current genome; hash with key as plasmid name needed for draft plasmids with SEVERAL contigs for ONE plasmid (with the same plasmid name of course) in one multi seq-file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216 my $desc; # '$seq_obj->desc' in case of only-plasmid seq-file but not option '-p' use the (last) plasmid description for feature table
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218 my $seq_count = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 while (my $seq_obj = $seqio_obj->next_seq) { # for multi-seq files with several seq entries (e.g. several replicons or contigs/scaffolds)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 $seq_count++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 $Id = $file; # set unique $Id, based on filename; has to be set for every seq entry in case of plasmids and option '-p' (see below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222 warn "\n### Warning:\nMore than 10 sequence entries in file '$file', but no 'WGS' keyword found. Sure this is not a draft genome? Otherwise please see the help with option '-h'!\n\n" if ($seq_count == 10 && !$Strain_Features{$Id}{'contigs_scaffolds'});
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224 my ($plasmid) = grep { $_->primary_tag eq 'source' && $_->has_tag('plasmid') } $seq_obj->get_SeqFeatures; # check if current seq entry is a plasmid, needs '/plasmid="plasmid_name"' tag in primary tag 'source'; $plasmid contains now a primary tag 'source' feat_obj
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 ($plasmid) = $plasmid->get_tag_values('plasmid') if ($plasmid); # get plasmid name; replace the 'feat_obj' in $plasmid with the tag value of '/plasmid'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226 $plasmid_count{$plasmid} = 1 if ($plasmid); # count number of plasmids
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228 $Id .= $plasmid if ($Opt_Plasmids && $plasmid); # for plasmids with option '-p' concatenate $Id with plasmid name
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 $desc = $seq_obj->desc;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231 $desc =~ s/(\s*(DNA|chromosome)*, complete genome\.*|\s*(chromosome)*, complete sequence\.*|\s*complete genome, strain.*|\s*, whole genome shotgun sequence\.*|\.)$// if ($desc); # shorten strain description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232 $Strain_Features{$Id}{'desc'} = $desc if ((!$Opt_Plasmids && !$plasmid) || $Opt_Plasmids); # don't use a plasmid description without '-p'; with '-p' take all descs; overwrite until last seq entry
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234 $Strain_Features{$Id}{'contigs_scaffolds'}++ if ($seq_obj->keywords =~ /WGS/); # count contigs/scaffolds and indicate that draft
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236 if (!$concat_seq{$Id}) { # first seq entry for $Id in potential multi-seq file ('!$Strain_Features{$Id}{'length'}' or '!$Strain_Features{$Id}{'first_acc'}' can also be used)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237 $Strain_Features{$Id}{'first_acc'} = $seq_obj->accession_number;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 print STDERR "\nProcessing ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239 if ($Strain_Features{$Id}{'contigs_scaffolds'}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240 print STDERR "draft contigs/scaffolds ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242 print STDERR "complete replicon(s) ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 print STDERR "in '$file': $desc; $Strain_Features{$Id}{'first_acc'}\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 $Strain_Features{$Id}{'length'} = $seq_obj->length;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246 $concat_seq{$Id} = $seq_obj->seq; # save the sequence to concatenate to potential following seq entries of same $Id in multi-seq files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 $coding_bases{$Id} = 0; # need to declare for addition below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 } else { # further contigs/scaffolds of drafts or multi-seq complete genome without option '-p'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 $Strain_Features{$Id}{'last_acc'} = $seq_obj->accession_number; # give the range of acc#s for multi-seq files; overwrite until last sequence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 $Strain_Features{$Id}{'length'} += $seq_obj->length; # add all previous lengths to the current for '$Id'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251 $concat_seq{$Id} .= $seq_obj->seq; # concatenate all sequences for '$Id'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 ($Strain_Features{$Id}{'gc'}, $Strain_Features{$Id}{'unresolved_n'}) = gc_content($concat_seq{$Id}); # subroutine to calculate the GC-content and count 'N's in seq; unfortunately has to be calculated every time (don't know what is last seq entry for each '$Id' or plasmid/chromosome etc. in file)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258 foreach my $feat_obj ($seq_obj->get_SeqFeatures) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 if ($feat_obj->primary_tag eq 'source') { # exclude source primary tag, has no feature annotation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 } elsif ($feat_obj->primary_tag eq 'gene') { # count 'gene' primary tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262 $Strain_Features{$Id}{'gene'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 $Strain_Features{$Id}{'pseudo_gene'}++ if ($feat_obj->has_tag('pseudo'));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264 } elsif ($feat_obj->primary_tag eq 'CDS') {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 $Strain_Features{$Id}{'CDS'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 if ($feat_obj->has_tag('pseudo')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267 $Strain_Features{$Id}{'pseudo_CDS'}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268 } elsif (!$feat_obj->has_tag('pseudo')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 $coding_bases{$Id} = ($feat_obj->location->end - $feat_obj->location->start) + 1 + $coding_bases{$Id}; # coding_perc only calculated based on non-pseudo CDS annotation
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 } else { # get ALL the other primary tags, shouldn't have 'pseudo' tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272 $Strain_Features{$Id}{$feat_obj->primary_tag}++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 $Strain_Features{$Id}{'coding_perc'} = ($coding_bases{$Id}/$Strain_Features{$Id}{'length'})*100;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 $Strain_Features{$Id}{'coding_perc'} = sprintf("%.2f", $Strain_Features{$Id}{'coding_perc'}); # round percentage to two decimals
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 $Strain_Features{$Id}{'coding_perc'} = 100 if ($Strain_Features{$Id}{'coding_perc'} > 100); # > 100 can happen with very small replicons/contigs having overlapping CDSs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281 $Strain_Features{$Id}{'plasmids'} = keys %plasmid_count if (!$Opt_Plasmids && keys %plasmid_count > 0); # number of plasmids in the current genome, only needed without option '-p'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 $Strain_Features{$Id}{'desc'} = $desc if (!$Strain_Features{$Id}{'desc'}); # in case of plasmid-only seq-files and not option '-p'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 #die "\n### Fatal error:\nFile '$file' contains a complete genome with several replicons. However, more than one replicon (which should only be the chromosome) doesn't have a '/plasmid=\"plasmid_name\"' tag in primary tag 'source'.\nIf it is a complete genome include these tags, or if it is a draft genome include the 'WGS' keyword in ALL contigs/scaffolds as described in the help (option '-h')!\n" if ($seq_count - keys %plasmid_count > 1 && !$Strain_Features{$Id}{'contigs_scaffolds'}); # too strict? Also, overlaps with warn 'Warning:\nMore than 10 sequence' above!
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285 #die "\n### Fatal error:\nFile '$file' contains a draft genome with only one contig.\nIf it is a complete genome remove the 'WGS' keyword from the only present replicon (see '-h')!\n" if ($seq_count == 1 && $Strain_Features{$Id}{'contigs_scaffolds'}); # too strict? e.g. uncircularized single-contig draft seq files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 closedir $Dir_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 die "\n### Fatal error: No EMBL or GENBANK format files found!\n" if (!$File_Count);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 ### Get all annotated primary features present in all files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294 my %Primary_Features;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 foreach my $id (keys %Strain_Features) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 foreach (keys %{$Strain_Features{$id}}) { # de-references hash of hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 $Primary_Features{$_} = 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 ### Print header of output with all existent primary features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 print "# name (last contig/scaffold for drafts)\tsize [bp]\tGC-content [%]\tcoding percentage [%]\tCDS (pseudo)\tgenes (pseudo)\trRNA\ttRNA\ttmRNA\tncRNA\taccession (first..last)\tcontigs/scaffolds\tunresolved bases (Ns)\t"; # specific print order for the most common primary features, '#' indicates comment line
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 my $Common_Primary_Feats = 'desc|length|gc|coding_perc|CDS|pseudo_CDS|gene|pseudo_gene|rRNA|tRNA|tmRNA|ncRNA|first_acc|last_acc|contigs_scaffolds|unresolved_n'; # the common primaries for the regexs and 'split' for print out below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 my $Common_Primary_Count = grep(/^($Common_Primary_Feats)$/, keys %Primary_Features); # count occurence of common primary features (not all present in all seq entries) for tab or newline print below
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 my $i = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 foreach (sort keys %Primary_Features) { # print the residual primary features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 if ($_ =~ /^($Common_Primary_Feats)$/) { # exclude the print order ones
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313 $i++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314 print "$_";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315 print "\t" if ($i < (keys %Primary_Features) - $Common_Primary_Count); # don't print tab for last element (but newline below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 print "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 print "# NA = tag, key, or qualifier not existent\n"; # comment line
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324 ### Print the primary features for each strain
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 foreach my $id (sort {$Strain_Features{$a}{'desc'} cmp $Strain_Features{$b}{'desc'}} keys %Strain_Features) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 foreach (split(/\|/, $Common_Primary_Feats)) { # split '$Common_Primary_Feats' string to print according to print order above
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 print " (" if (/^pseudo/); # 'pseudo_CDS' and 'pseudo_gene' are printed in parantheses
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328 if ($Strain_Features{$id}{$_}) { # print value only if defined
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 print ".." if (/last_acc/); # for 'first_acc..last_acc'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 if ($Strain_Features{$id}{$_} =~ /unknown/) { # BioPerl returns 'unknown' for non-existent acc-nr
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 print "NA";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 print "$Strain_Features{$id}{$_}";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 } else { # feature not defined for this '$Id'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336 next if ($_ =~ /last_acc/); # skip 'last_acc' if not defined (otherwise will print 'NA' for it)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 print "NA";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 print ")" if (/^pseudo/); # 'pseudo_CDS' and 'pseudo_gene' are printed in parantheses
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340 print "\t" unless (($_ =~ /^(CDS|gene)$/) || ($_ =~ /first_acc/ && $Strain_Features{$id}{'last_acc'})); # CDS and gene are followed by parantheses for possible pseudos, 'first_acc' might be followed by '..last_acc'; thus no tab for these
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 $i = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 foreach (sort keys %Primary_Features) { # print the residual primary features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345 if ($_ =~ /^($Common_Primary_Feats)$/) { # exclude the above ones
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347 } elsif ($Strain_Features{$id}{$_}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348 $i++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 print "$Strain_Features{$id}{$_}";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 $i++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352 print "NA";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 print "\t" if ($i < (keys %Primary_Features) - $Common_Primary_Count); # don't print tab for last element
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357 print "\n"; # newline for each line in the feature table
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 print STDERR "\n$File_Count RichSeq file(s) was/were processed!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 # Subroutines #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370 ### Subroutine to calculate GC-content
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 sub gc_content {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 my $seq = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 my $a = ($seq =~ tr/[aA]//); # transliterations don't accept modifiers like case-insensitive 'i'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 my $c = ($seq =~ tr/[cC]//);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375 my $g = ($seq =~ tr/[gG]//);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 my $t = ($seq =~ tr/[tT]//);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 my $n = ($seq =~ tr/[nN]//);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 my $gc_content = (($c + $g)/($a + $c + $g + $t))*100;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 $gc_content = sprintf("%.2f", $gc_content); # round percentage to two decimals
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 return ($gc_content, $n);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381 }