annotate COG/bac-genomics-scripts/rod_finder/blast_rod_finder.pl @ 9:24676ef2945d draft

Uploaded
author dereeper
date Thu, 30 May 2024 16:18:04 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 blast_rod_finder.pl 07-11-2011
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 C<perl blast_rod_finder.pl -q query.embl -r blastn.out -m 2000>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 This script is intended to identify region of differences (RODs)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 between a nucleotide query and a nucleotide subject/reference sequence.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 In order to do so, a BLASTN (L<http://blast.ncbi.nlm.nih.gov/Blast.cgi>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 needs to be performed beforehand with the query and the subject sequences.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 I<blast_rod_finder.pl> is mainly designed to work with bacterial genomes,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 while a query genome can be blasted against several subject sequences to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 detect RODs over a number of references. Although the results are
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 optimized towards a complete query genome, both the reference(s) as well
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 as the query can be used in draft form. To create artificial genomes use
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 I<cat_seq.pl> or the EMBOSS application union (L<http://emboss.sourceforge.net/>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 The BLASTN report file, the query sequence file (preferably in embl or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 genbank format with annotation, see below) and a minimum size for ROD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 detection have to be provided for I<blast_rod_finder.pl> to run.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 Subsequently, RODs are summarized in a tab-separated result file, a gff3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 (usable e.g. in Artemis/DNAPlotter,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 L<http://www.sanger.ac.uk/resources/software/artemis/>) and a BRIG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 (BLAST Ring Image Generator, L<http://brig.sourceforge.net/>) output file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 Nucleotide sequences of each ROD are written to a multi-fasta file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 The query sequence can be provided in embl or genbank format, but has to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 correspond to the fasta file used in querying the BLAST database (the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 accession numbers have to correspond to the fasta headers). Use
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 I<seq_format-converter.pl> to create a corresponding fasta file from
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 embl|genbank files for BLASTN if needed. With annotated query files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 additional info is given in the result summary and the amino acid
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 sequences of all non-pseudo CDSs, which are contained or overlap a ROD,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 are written to a result file. Furthermore, all detected RODs are saved in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 individual sequence files in the corresponding query sequence format.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 Run BLASTN and the script I<blast_rod_finder.pl> manually or use the bash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 shell scripts I<blast_rod_finder*.sh> (see examples below) to perform the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 pipeline consecutively in one folder.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 The Perl script runs on BioPerl (L<http://www.bioperl.org>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 =over 21
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 =item B<-m>=I<int>, B<-min>=I<int> Minimum size of RODs that are reported
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 =item B<-q>=I<str>, B<-query>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 Query sequence file [fasta, embl, or genbank format]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 =item B<-r>=I<str>, B<-report>=I<str> BLASTN report/output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 =item B<-h>, B<-help> Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 =over 17
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 =item F<./results>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 All output files are stored in this result folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 =item F<rod_summary.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 Summary of detected ROD regions (for embl/genbank queries includes annotation), tab-separated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 =item F<rod.gff>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 GFF3 file with ROD coordinates to use in Artemis/DNAPlotter etc.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 =item F<rod_BRIG.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 ROD coordinates to use in BRIG (BLAST Ring Image Generator), tab-separated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 =item F<rod_seq.fasta>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 Nucleotide sequences of ROD regions (>ROD# size start..stop), multi-fasta
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 =item (F<rod_aa_fasta.txt>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 Only present if query is given with annotation, i.e. embl|genbank format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 Amino acid sequences of all CDSs that are contained in or overlap a ROD region
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 in multi-fasta format (>locus_tag gene product). RODs are seperated in the file via
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 '~~' (~~ROD# size start..stop).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 =item (F<ROD#.embl|gbk>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 Only present if query is given with annotation, i.e. embl|genbank format.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 Each identified ROD is written to an individual sequence file (in the same format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 as the query).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 =head2 Legacy BLASTN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 =item C<formatdb -p F -i subject.fasta -n blast_db>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 =item C<blastall -p blastn -d blast_db -i query.fasta -o blastn.out -e 2e-11 -F F>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 =head2 BLASTN+
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 =item C<makeblastdb -in subject.fasta -input_type fasta -dbtype nucl -out blast_db>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 =item C<blastn -db blast_db -query query.fasta -out blastn.out -evalue 2e-11 -soft_masking false>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 =head2 blast_rod_finder.pl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 =item C<perl blast_rod_finder.pl -q query.embl|gbk|fasta -r blastn.out -m 5000>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 =head2 All-in-one with unix bash-shell scripts
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 =item C<./blast_rod_finder_legacy.sh subject.fasta query.fasta query.embl|gbk|fasta 5000>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 =item C<./blast_rod_finder_plus.sh subject.fasta query.fasta query.embl|gbk|fasta 5000>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 0.4 update: 13-02-2013
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 =head1 AUTHORS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 David Studholme D[dot]J[dot]Studholme[at]exeter[dot]ac[dot]uk
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 the Free Software Foundation; either version 3 (GPLv3) of the License,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 use Getopt::Long; # module to get options from the command line
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194 use Bio::SearchIO; # bioperl module to handle blast reports
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 use Bio::SeqIO; # bioperl module to handle sequence input/output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 use Bio::Seq; # bioperl module to handle sequences with features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197 use Bio::SeqFeatureI; # bioperl module to handle features in a sequence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198 use Bio::SeqUtils; # bioperl module with additional methods (including features) for Bio::Seq objects (e.g. revcom, truncate, concatenate)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201 ### Get the options with Getopt::Long, works also abbreviated and with two "--": -r, --r, -report ...
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 my $usage = "perldoc $0";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203 my $blast_report = ''; # blastn report/output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204 my $q_file = ''; # query sequence file to get the RODs and possible including annotations
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205 my $minimum_size = ''; # minimum size of uncovered regions (RODs) that are reported
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206 my $help = ''; # run perldoc on the POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 GetOptions ('report=s' => \$blast_report, 'query=s' => \$q_file, 'min=s' => \$minimum_size, 'help' => \$help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 ### Run perldoc on POD if -h|--h|--help is given as argument or arguments are not given
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211 if (!$blast_report || !$q_file || !$minimum_size) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212 die system($usage);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 } elsif ($help) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214 die system($usage);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218 ### Parse the blast report/output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 my %q_covered; # stores for each seq position (key) of a query accession if it falls within a hsp hit (value set to 1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 my @blast_q_acc; # store query accessions to test if identical in blast-parse and query-seq (seqio below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 my $blast_db; # add to GFF3 print out
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222 my $parser = new Bio::SearchIO(-file => "<$blast_report", -format => 'blast'); # Bio::SearchIO object
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223 $|++; # turn on autoflush, forces STDOUT to flush right away
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224 print "\nParsing blast report file."; # status display (with autoflush)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 while(my $result = $parser->next_result) { # several query sequences possible (result = entire analysis for a single query seq)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226 my $query_acc = $result->query_accession;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 $query_acc =~ s/\.$//; # rm a '.' if present at the end of the string (for non-NCBI fasta headers)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228 $query_acc =~ s/\.\d$//; # rm version nr. from NCBI query acc (to fit it to the acc.-nr. from Bio::SeqIO below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 push(@blast_q_acc, $query_acc);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 $blast_db = $result->database_name;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231 chop $blast_db; # always an empty character after the name?
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232 while (my $hit = $result->next_hit) { # several subject sequences in the database might have hits to a single query
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233 print '.'; # status display
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234 while(my $hsp = $hit->next_hsp) { # each hit might have one or more hsps (the alignments shown in a blast report)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 my ($query_start, $query_end) = $hsp->range('query'); # query hsp start/stop coords
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236 foreach my $i ($query_start .. $query_end) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237 $q_covered{$query_acc}{$i}++; # changes for each hsp hit the value from 'undef' to '1'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 ### Read the query sequence file into RAM, for Bio::Seq::RichSeq files (e.g. embl/genbank) also read features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 my %q_seqobj; # combine all seqobj (values; for multi-fasta/embl/gbk queries) to acc.-nr.s (keys )
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246 my %features; # stores all features of a query RichSeq file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 my $multi = 0; # test if multi-seq query (multi-fasta/embl/genbank) >= 2; and counter
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 my $seqio_obj = Bio::SeqIO->new(-file => "<$q_file"); # no '-format' to leave it to bioperl guessing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 print "\nParsing query sequence file and potential annotation."; # status display (with autoflush)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 while (my $seq_obj = $seqio_obj->next_seq) { # Bio::Seq object, query might be multi-fasta/embl/genbank
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251 print '.'; # status display
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 my $query_acc;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253 if (ref($seqio_obj) =~ m/\:\:fasta$/i) { # $seqio_obj blessed, thus ref returns package name (here= Bio::SeqIO::fasta) from bioperl guessing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254 $query_acc = $seq_obj->display_id; # method '->accession_number' doesn't work with fasta files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 $query_acc =~ s/gi\|\d+\|(emb|gb|dbj|ref)\|(.+)\|/$2/; # get acc. nr. from NCBI fasta ID lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256 $query_acc =~ s/\.\d$//; # rm version nr., to fit it to blast acc. nr. above
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257 } else { # Bio::Seq::RichSeq file (embl, genbank) inherited from Bio::SeqIO::embl/genbank
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258 $query_acc = $seq_obj->accession_number;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 @{$features{$query_acc}} = $seq_obj->get_all_SeqFeatures; # store all Bio::SeqFeatureI objects in anonymous array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 if ($query_acc ne $blast_q_acc[$multi]) { # check if acc.s fit; SAME order of acc.s in blast-report and seq-obj
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262 die "\n\n###Fatal error:\nBlast query accession \'$blast_q_acc[$multi]\' doesn't fit to query sequence accession \'$query_acc\'!\n\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264 $q_seqobj{$query_acc} = $seq_obj;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 $multi++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 ### Get the contiguous unaligned regions of the query seq(s)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270 my %uncovered_regions; # stores coords of uncovered regions (RODs)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 print "\nLooking for RODs."; # status display (with autoflush)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272 foreach my $acc (keys %q_seqobj) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 print "."; # status display
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 my $start;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275 my $seq = $q_seqobj{$acc}->seq; # Bio::SeqIO object method 'seq'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 my $previous_state = 1; # begin at base #1 of query seq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 foreach my $i (1 .. length($seq)) { # loop through the whole query seq and look for undef positions
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 my $current_state = $q_covered{$acc}{$i};
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279 if (!defined $current_state and defined $previous_state) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280 # warn "We have just entered an unaligned region $acc: $i\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281 $start = $i;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 } elsif (defined $current_state and !defined $previous_state) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283 # warn "We have just exited an unaligned region $acc: ", $i-1, "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 my $end = $i - 1; # position before current_state was undef
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285 my $length = $end - $start + 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286 push @{$uncovered_regions{$length}{$acc}}, $start; # anonymous array in hash-in-hash data structure
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 undef $start;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 } elsif ($i == length($seq) and !defined $previous_state) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 # warn "We have just exited an unaligned region at end of contig $acc: $i\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 my $end = $i;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 my $length = $end - $start + 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292 push @{$uncovered_regions{$length}{$acc}}, $start;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 undef $start;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 $previous_state = $current_state;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300 ### Create results directory, where output files are written to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301 my $out_dir = './results/';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302 if (-e $out_dir) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 print "\n\n###Directory \'$out_dir\' already exists! Replace the directory and all its contents? [y|n] ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 my $user_ask = <STDIN>;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 if ($user_ask =~ /y/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306 unlink glob $out_dir."*"; # remove all files in results directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 rmdir $out_dir; # remove the empty directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 die "Script abborted!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312 mkdir $out_dir or die "Can't create directory \"$out_dir\": $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315 ### List the longest uncovered regions and print results in output files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 my $rod_result = 'rod_summary.txt'; # overview of parsed ROD results and possible ROD annotations
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 open(ROD, ">$out_dir"."$rod_result") or die "Failed to create $rod_result file: $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 my $seq_out = 'rod_seq.fasta'; # multi-fasta nucleotide file of ROD seqs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319 open(SEQ, ">$out_dir"."$seq_out") or die "Failed to create $seq_out file: $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 my $gff_out = 'rod.gff'; # GFF3 file of ROD regions for use in artemis/DNAPlotter etc.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321 open(GFF, ">$out_dir"."$gff_out") or die "Failed to create $gff_out file: $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 my $brig_out = 'rod_BRIG.txt'; # tab-separated ROD file to load into BRIG
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323 open (BRIG, ">$out_dir"."$brig_out") or die "Failed to create $brig_out file: $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324 print ROD "$q_file regions with no blast hits (RODs) in $blast_report\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 print ROD "Rank\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 if ($multi >=2) { # for multi-seq query files include acc. nr. column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 print ROD "Accession\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 print ROD "Length\tROD position";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 my $cds_out; # only for RichSeq query files (embl/genbank)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 if (ref($seqio_obj) !~ m/\:\:fasta$/i) { # query NOT a fasta file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 print ROD "\tFeature primary_tag\tFeature locus_tag\tgene\tFeature product\tFeature position";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 $cds_out = 'rod_aa_fasta.txt'; # CDS amino acid sequence output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 open(CDS, ">$out_dir"."$cds_out") or die "Failed to create $cds_out file: $!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336 print ROD "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 print GFF "##gff-version 3\n"; # header, shows it's a gff version 3 file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338 print GFF "#$q_file regions with no blast hits (RODs) in $blast_report\n"; # comment line for description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 print BRIG "#Start\tStop\tLabel\n"; # column headers (commented)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341 print "\nPreparing results."; # status display (with autoflush)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342 my $rank = 0; # number of ROD regions
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 foreach my $length (sort {$b<=>$a} keys %uncovered_regions) { # sort RODs from large length to small length
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 if ($length >= $minimum_size) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345 print "."; # status display
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346 foreach my $acc (sort keys %{$uncovered_regions{$length}}) { # de-reference hash-in-hash, sort by acc. nr.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347 foreach my $start (sort @{$uncovered_regions{$length}{$acc}}) { # de-reference anonymous array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348 my $end = $start + $length - 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 my $pos = "$start..$end";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350 $rank++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 print GFF "$acc\tBLASTN\tsequence_difference\t$start\t$end\t.\t+\t.\tName=ROD$rank;Target=$blast_db;color=2\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352 print BRIG "$start\t$end\tROD$rank\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 print ROD "$rank\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 if ($multi >=2) { # for multi-seq query files include acc. nr.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 print ROD "$acc\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357 print ROD "$length\t$pos";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 print SEQ ">ROD$rank $length $pos\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 print SEQ $q_seqobj{$acc}->subseq($start,$end), "\n\n"; # subseq method of Bio::Seq object
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 ### fill ROD feature columns in 'rod_result_summary.txt', print CDS aa seqs to 'rod_aa_fasta.txt' for RichSeq files,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362 ### and print each ROD as a separate sequence file (with Bio::Sequtils->'trunc_with_features')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363 if (ref($seqio_obj) !~ m/\:\:fasta$/i) { # RichSeq query file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364 if ($rank >1) { # blank line in front of the next ROD aa seq block, if it's not the first block
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365 print CDS "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 my $truncseq = Bio::SeqUtils->trunc_with_features($q_seqobj{$acc}, $start, $end); # truncate current Bio::Seq object to coordinates
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 my $seqout; # variable to hold Bio::SeqIO object
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369 if (ref($seqio_obj) =~ /genbank/) { # print the truncated ROD sequences in the same '-format' as $q_file; Bio::SeqIO::genbank object
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370 $seqout = Bio::SeqIO->new(-file => ">$out_dir"."ROD$rank".".gbk", -format => 'genbank');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 } elsif (ref($seqio_obj) =~ /embl/) { # Bio::SeqIO::embl object
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 $seqout = Bio::SeqIO->new(-file => ">$out_dir"."ROD$rank".".embl", -format => 'embl');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 $seqout->write_seq($truncseq); # write truncated sequence object
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375 my $loop = 0; # indicates that more than one feature in the ROD range
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 print CDS "~~ROD$rank $length $pos\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 foreach my $feature (@{$features{$acc}}) { # de-reference anonymous array in hash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 if ($feature->location->start >= $start && $feature->location->end <= $end) { # features that are fully within ROD region
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 $loop = print_CDSs($feature, $loop); # subroutine 'print_CDSs'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 } elsif (($feature->location->start <= $start && ($feature->location->end > $start && $feature->location->end <= $end)) || (($feature->location->start >= $start && $feature->location->start < $end) && $feature->location->end > $end)) { # features that overlap the ROD region
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381 $loop = print_CDSs($feature, $loop, 1); # 1 defines $overlap in subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 if ($loop == 0) { # print "\n" if no features exist in the ROD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385 print ROD "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 } else { # fasta files don't contain feature information
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 print ROD "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394 close ROD;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395 close SEQ;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396 close GFF;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397 close BRIG;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 if (ref($seqio_obj) !~ m/\:\:fasta$/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399 close CDS;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401 $| = 0; # turn off autoflush
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404 ### Print which files have been created!
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405 print "\n\nThe following files were created in the \"./results\" directory:\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
406 print "- $rod_result: Tab-separated summary of the found regions of difference (RODs)\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
407 print "- $gff_out: GFF3 file with ROD coordinates to use in Artemis/DNAPlotter ...\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
408 print "- $brig_out: Tab-separated file for BRIG (BLAST Ring Image Generator)\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
409 print "- $seq_out: Nucleotide sequences of ROD regions in multi-fasta format\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
410 if (!(ref($seqio_obj) =~ m/\:\:fasta$/i)) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
411 print "- $cds_out: Amino acid sequences of CDSs in ROD regions\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
412 print "- ROD#.format: Each ROD as a separate sequence file in the query file format\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
413 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
414 print "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
415
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
416 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
417
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
418
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
419 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
420 # Subroutines #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
421 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
422
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
423 ### Subroutine to print the position of features according to leading and lagging strand
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
424 sub position {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
425 my $feature = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
426 if ($feature->strand == 1) { # leading strand
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
427 print ROD "\t", $feature->location->start, "..", $feature->location->end;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
428 } elsif ($feature->strand == -1) { # lagging strand
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
429 print ROD "\t", $feature->location->end, "..", $feature->location->start;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
430 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
431 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
432 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
433
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
434
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
435 ### Subroutine that prints feature information from RichSeq files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
436 sub print_CDSs {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
437 my $feature = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
438 my $loop = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
439 my $overlap = shift; # defined if feature overlaps ROD edge
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
440 my $primary_tag = $feature->primary_tag;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
441 if (($primary_tag eq 'CDS') || ($primary_tag eq 'misc_RNA') || ($primary_tag eq 'ncRNA') || ($primary_tag eq 'rRNA') || ($primary_tag eq 'tRNA') || ($primary_tag eq 'tmRNA') || ($primary_tag eq 'misc_feature') || ($primary_tag eq 'repeat_region')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
442 skip_columns($loop); # sub to skip to correct column in print out
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
443 print ROD "\t$primary_tag";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
444 if ($feature->has_tag('locus_tag')) { # TRUE if tag 'locus_tag' exists for this feature
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
445 print ROD "\t", $feature->get_tag_values('locus_tag');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
446 print CDS ">", $feature->get_tag_values('locus_tag') if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo')); # include only CDSs with "/translation" in file 'rod_aa_fasta.txt' (exclude pseudo-genes, RNAs ...)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
447 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
448 print ROD "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
449 print CDS ">" if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo'));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
450 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
451 if ($feature->has_tag('gene')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
452 print ROD "\t", $feature->get_tag_values('gene');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
453 print CDS " ", $feature->get_tag_values('gene') if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo'));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
454 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
455 print ROD "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
456 print CDS " " if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo'));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
457 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
458 if ($feature->has_tag('product')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
459 my ($product) = $feature->get_tag_values('product');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
460 print ROD "\t", $product;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
461 print CDS " ", replace_white($product) if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo')); # subroutine 'replace_white'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
462 } elsif ($feature->has_tag('rpt_family')) { # don't include feature 'rpt_family' in the aa output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
463 print ROD "\t", $feature->get_tag_values('rpt_family');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
464 } elsif ($feature->has_tag('note')) { # only include feature 'note' if 'product' doesn't exist
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
465 my ($note) = $feature->get_tag_values('note');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
466 print ROD "\t", $note;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
467 print CDS " ", replace_white($note) if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo'));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
468 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
469 position($feature); # subroutine 'position' to print coordinates of features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
470 if (defined $overlap) { # include '(overlap)' if feature overlaps edge of ROD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
471 print ROD " (overlap)\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
472 print CDS " (overlap)\n" if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo'));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
473 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
474 print ROD "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
475 print CDS "\n" if ($primary_tag eq 'CDS' && !$feature->has_tag('pseudo'));
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
476 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
477 if ($primary_tag eq 'CDS' && $feature->has_tag('translation')) { # print translations of CDSs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
478 print CDS $feature->get_tag_values('translation'), "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
479 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
480 $loop++; # Indicates that more than one feature in the ROD range
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
481 } elsif ($primary_tag eq 'mobile_element') { # include 'mobile_element' features only in ROD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
482 skip_columns($loop);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
483 print ROD "\t$primary_tag";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
484 if ($feature->has_tag('mobile_element_type')) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
485 print ROD "\t\t\t", $feature->get_tag_values('mobile_element_type');
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
486 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
487 print ROD "\t\t\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
488 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
489 position($feature); # sub to print positions of features
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
490 if (defined $overlap) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
491 print ROD " (overlap)\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
492 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
493 print ROD "\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
494 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
495 $loop++;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
496 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
497 return $loop;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
498 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
499
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
500
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
501 ### Subroutine to replace whitespaces in a string
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
502 sub replace_white {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
503 my $string = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
504 $string =~ s/\s/_/g;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
505 return $string;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
506 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
507
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
508
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
509 ### Subroutine to skip to the correct column for features in tab-separated result file 'rod_result_summary.txt'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
510 sub skip_columns {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
511 my $loop = shift; # $multi is in the same scope ... (actually also $loop)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
512 if ($loop >= 1) { # for the features after the first ROD line skip to the 'feature primary_tag' column in the tab-separated summary output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
513 print ROD "\t\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
514 if ($multi >=2) { # multi-seq query files include acc. nr. column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
515 print ROD "\t";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
516 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
517 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
518 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
519 }