annotate COG/bac-genomics-scripts/order_fastx/order_fastx.pl @ 13:152d7c43478b draft default tip

Uploaded
author dereeper
date Thu, 30 May 2024 20:07:55 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<order_fastx.pl> - order sequences in FASTA or FASTQ files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 C<perl order_fastx.pl -i infile.fasta -l order_id_list.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 E<gt> ordered.fasta>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 Order sequence entries in FASTA or FASTQ sequence files according to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 an ID list with a given order. Beware, the IDs in the order list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 have to be B<identical> to the entire IDs in the sequence file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 However, the ">" or "@" ID identifiers of FASTA or FASTQ files,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 respectively, can be omitted in the ID list.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 The file type is detected automatically. But, you can set the file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 type manually with option B<-f>. FASTQ format assumes B<four> lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 per read, if this is not the case run the FASTQ file through
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 L<C<fastx_fix.pl>|/fastx_fix> or use Heng Li's L<C<seqtk
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 seq>|https://github.com/lh3/seqtk>:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 C<seqtk seq -l 0 infile.fq E<gt> outfile.fq>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 The script can also be used to pull a subset of sequences in the ID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 list from the sequence file. Probably best to set option flag B<-s>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 in this case, see L<"Optional options"> below. But, rather use
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 L<C<filter_fastx.pl>|/filter_fastx>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 =item B<-i>=I<str>, B<-input>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 Input FASTA or FASTQ file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 =item B<-l>=I<str>, B<-list>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 List with sequence IDs in specified order
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 =item B<-f>=I<fasta|fastq>, B<-file_type>=I<fasta|fastq>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 Set the file type manually
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 =item B<-e>, B<-error_files>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 Write missing IDs in the seq file or the order ID list without an
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 equivalent in the other to error files instead of C<STDERR> (see
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72 L<"OUTPUT"> below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 =item B<-s>, B<-skip_errors>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 Skip missing ID error statements, excludes option B<-e>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 =item C<STDOUT>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 The newly ordered sequences are printed to C<STDOUT>. Redirect or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 =item (F<order_ids_missing.txt>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 If IDs in the order list are missing in the sequence file with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 option B<-e>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 =item (F<seq_ids_missing.txt>)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 If IDs in the sequence file are missing in the order ID list with
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 option B<-e>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 =item C<perl order_fastx.pl -i infile.fq -l order_id_list.txt -s -f
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 fastq E<gt> ordered.fq>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 =item C<perl order_fastx.pl -i infile.fasta -l order_id_list.txt -e
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 E<gt> ordered.fasta>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 0.1 20-11-2014
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 the Free Software Foundation; either version 3 (GPLv3) of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 License, or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153 ### Get the options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 my $Seq_File; # sequence file to order sequences in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 my $Order_List; # order ID list for seq file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 my $File_Type; # set file type; otherwise detect file type by file extension
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157 my $Opt_Error_Files; # print missing IDs not found in order list or seq file to error files instead of STDERR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 my $Opt_Skip_Errors; # skip missing IDs error statements/files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159 my $VERSION = 0.1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161 GetOptions ('input=s' => \$Seq_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 'list=s' => \$Order_List,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 'file_type=s' => \$File_Type,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 'error_files' => \$Opt_Error_Files,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 'skip_errors' => \$Opt_Skip_Errors,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 'help|?' => \$Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 ### Run perldoc on POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 if (!$Seq_File || !$Order_List) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 my $warning = "\n### Fatal error: Options '-i' or '-l' or their arguments are missing!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 ### Enforce mandatory or optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 die "\n### Fatal error:\nUnknown file type '$File_Type' given with option '-f'. Please choose from either 'fasta' or 'fastq'!\n" if ($File_Type && $File_Type !~ /(fasta|fastq)/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 warn "\n### Warning:\nIgnoring option flag '-e', because option '-s' set at the same time!\n\n" if ($Opt_Error_Files && $Opt_Skip_Errors);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 ### Order input FASTA/FASTQ file according to a given list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 open (my $Order_List_Fh, "<", "$Order_List");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 open (my $Input_Fh, "<", "$Seq_File"); # pipe from STDIN not working because of 'seek' on filehandle
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190 get_file_type() if (!$File_Type); # subroutine to determine file type by file extension
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 my %Order_List_IDs; # store order IDs and indicate if found in seq file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 my %Seq_File_IDs; # store seq file IDs and indicate if present in order list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 my $Next_Fasta_ID; # for multi-line FASTA input files to store next entry header/ID line while parsing in subroutine 'get_fastx_entry'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 my $Parse_Run = 1; # indicate FIRST parsing cycle through seq file to collect all seq IDs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198 while (my $ord_id = <$Order_List_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199 chomp $ord_id;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200 next if ($ord_id =~ /^\s*$/); # skip emtpy lines in order list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201 $ord_id =~ s/^(>|@)//; # remove ">/@" for WHOLE string regex match -> ID in order list can be given with ">/@" or without (will be appended again in print)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203 if ($Order_List_IDs{$ord_id}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204 die "\n### Fatal error:\n'$ord_id' exists several times in '$Order_List' and IDs should be unique!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206 $Order_List_IDs{$ord_id} = 1; # changes to 2 if ID was found in seq file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209 while (<$Input_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 if (/^\s*$/) { # skip empty lines in input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211 warn "\n### Warning:\nFASTQ file includes empty lines, which is unusual. Parsing the FASTQ reads might fail so check the output file afterwards if the script didn't quit with a fatal error. However, consider running the input FASTQ file through 'fix_fastx.pl'!\n\n" if ($File_Type =~ /fastq/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216 # FASTA file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217 if ($File_Type =~ /fasta/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218 $_ = get_fastx_entry($_); # subroutine to read one FASTA sequence entry (seq in multi-line or not), returns anonymous array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 # FASTQ file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 } elsif ($File_Type =~ /fastq/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222 $_ = get_fastx_entry($_); # subroutine to read one FASTQ read composed of FOUR mandatory lines, returns reference to array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 if ($Seq_File_IDs{$_->[0]} && $Parse_Run == 1) { # only for first parse cycle, subsequent parsings of course will find the same IDs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226 die "\n### Fatal error:\n'$_->[0]' exists several times in '$Seq_File' and IDs should be unique!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 } elsif (!$Seq_File_IDs{$_->[0]}) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228 $Seq_File_IDs{$_->[0]} = 1; # changes to 2 if present in order list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231 if ($ord_id =~ /^$_->[0]$/) { # order ID hit in seq file with the WHOLE string; de-reference array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232 $Order_List_IDs{$ord_id} = 2; # set to ID found
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233 $Seq_File_IDs{$_->[0]} = 2;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 print ">$_->[0]\n$_->[1]\n\n" if ($File_Type =~ /fasta/i); # print seq entry to STDOUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236 print "\@$_->[0]\n$_->[1]\n$_->[2]\n$_->[3]\n" if ($File_Type =~ /fastq/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 next if ($Parse_Run == 1); # parse the complete seq file once (skip 'last' below) to collect all seq IDs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239 last; # jump out of seq file 'while'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243 # rewind seq file for next order list ID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 $Next_Fasta_ID = '';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 seek $Input_Fh, 0, 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246 $. = 0; # set line number of seq file to 0 (seek doesn't do it automatically)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 $Parse_Run = 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 close $Input_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 close $Order_List_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254 ### Print order and seq IDs that were not found in seq file or in order list, resp.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 if (!$Opt_Skip_Errors) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256 # order IDs not found in seq file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257 missing_IDs(\%Order_List_IDs, 'order_ids_missing.txt', 'order'); # subroutine to identify and print missing IDs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 # seq file IDs not found in order list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 missing_IDs(\%Seq_File_IDs, 'seq_ids_missing.txt', 'sequence'); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267 #Subroutines#
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268 #############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270 ### Test for output file existence and give warning to STDERR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 sub file_exist {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272 my $file = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 if (-e $file) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 warn "\n### Warning:\nThe error file '$file' exists already, the current errors will be appended to the existing file!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 return 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 ### Get sequence entries from FASTA/Q file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283 sub get_fastx_entry {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 my $line = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286 # possible multi-line seq in FASTA
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 if ($File_Type =~ /fasta/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 my ($seq, $header);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 if ($. == 1) { # first line of file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 die "\n### Fatal error:\nNot a FASTA input file, first line of file should be a FASTA ID/header line and start with a '>':\n$line\n" if ($line !~ /^>/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 $header = $line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292 } elsif ($Next_Fasta_ID) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 $header = $Next_Fasta_ID;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294 $seq = $line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 while (<$Input_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 $Next_Fasta_ID = $_ if (/^>/); # store ID/header for next seq entry
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 $header =~ s/^>//; # remove '>' for WHOLE string regex match in MAIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300 return [$header, $seq] if (/^>/); # return anonymous array with current header and seq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301 $seq .= $_; # concatenate multi-line seq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 $header =~ s/^>//; # see above
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 return [$header, $seq] if eof;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306 # FASTQ: FOUR lines for each FASTQ read (seq-ID, sequence, qual-ID [optional], qual)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 } elsif ($File_Type =~ /fastq/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 my @fastq_read;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 # read name/ID, line 1
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 my $seq_id = $line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312 die "\n### Fatal error:\nThis read doesn't have a sequence identifier/read name according to FASTQ specs, it should begin with a '\@':\n$seq_id\n" if ($seq_id !~ /^@.+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313 $seq_id =~ s/^@//; # remove '@' to make comparable to $qual_id and for WHOLE string regex match in MAIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314 push(@fastq_read, $seq_id);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 # sequence, line 2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 chomp (my $seq = <$Input_Fh>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /\s+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319 die "\n### Fatal error:\nRead '$seq_id' has a IUPAC degenerate base (except for 'N') or non-nucleotide character in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /[^acgtun]/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 push(@fastq_read, $seq);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 # optional quality ID, line 3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323 chomp (my $qual_id = <$Input_Fh>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324 die "\n### Fatal error:\nThe optional sequence identifier/read name for the quality line of read '$seq_id' is not according to FASTQ specs, it should begin with a '+':\n$qual_id\n" if ($qual_id !~ /^\+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 push(@fastq_read, $qual_id);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 $qual_id =~ s/^\+//; # if optional ID is present check if equal to $seq_id in line 1
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 die "\n### Fatal error:\nThe sequence identifier/read name of read '$seq_id' doesn't fit to the optional ID in the quality line:\n$qual_id\n" if ($qual_id && $qual_id ne $seq_id);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 # quality, line 4
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 chomp (my $qual = <$Input_Fh>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /\s+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 die "\n### Fatal error:\nRead '$seq_id' has a non-ASCII character in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /[^[:ascii]]/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 die "\n### Fatal error:\nThe quality line of read '$seq_id' doesn't have the same number of symbols as letters in the sequence:\n$seq\n$qual\n" if (length $qual != length $seq);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 push(@fastq_read, $qual);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336 return \@fastq_read; # return array-ref
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338 return 0;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 ### Determine file type via file extension (FASTA or FASTQ)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 sub get_file_type {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345 if ($Seq_File =~ /.+\.(fa|fas|fasta|ffn|fna|frn|fsa)$/) { # use "|fsa)(\.gz)*$" if unzip inside script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346 $File_Type = 'fasta';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347 } elsif ($Seq_File =~ /.+\.(fastq|fq)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348 $File_Type = 'fastq';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 die "\n### Fatal error:\nFile type could not be automatically detected. Sure this is a FASTA/Q file? If yes, you can force the file type by setting option '-f' to either 'fasta' or 'fastq'!\n" if (!$File_Type);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352 print STDERR "Detected file type: $File_Type\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 ### Identify and print IDs with no hit in order list or seq file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 sub missing_IDs {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 my ($hash_ref, $error_file, $mode) = @_;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362 my @missed = grep ($hash_ref->{$_} == 1, keys %$hash_ref); # set to 2 if hit, 1 if "only" present
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364 if (@missed) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365 file_exist($error_file) if ($Opt_Error_Files); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366 open (my $error_fh, ">>", $error_file) if ($Opt_Error_Files);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 print STDERR "\n### Warning:\nSome $mode IDs were not found in '";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369 if ($mode eq 'order') {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370 print STDERR "$Seq_File";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 } elsif ($mode eq 'sequence') {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 print STDERR "$Order_List";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 print STDERR "', listed ";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375 print STDERR "below:\n" if (!$Opt_Error_Files);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 print STDERR "in error file '$error_file'!\n" if ($Opt_Error_Files);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 foreach (sort @missed) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 if (!$Opt_Error_Files) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 print STDERR "$_\t"; # separated by tab
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381 } elsif ($Opt_Error_Files) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 print $error_fh "$_\n"; # separated by newline
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385 print STDERR "\n" if (!$Opt_Error_Files); # final newline for STDERR print
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 close $error_fh if ($Opt_Error_Files);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 }