annotate COG/bac-genomics-scripts/sample_fastx-txt/sample_fastx-txt.pl @ 14:5a5c9a6b047b draft

Uploaded
author dereeper
date Tue, 10 Dec 2024 16:20:53 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 #!/usr/bin/perl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 # POD #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5 #######
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 =pod
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 =head1 NAME
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 C<sample_fastx-txt.pl> - random subsampling of FASTA, FASTQ, or TEXT files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 =head1 SYNOPSIS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 C<perl sample_fastx-txt.pl -i infile.fasta -n 100 E<gt> subsample.fasta>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 B<or>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 C<zcat reads.fastq.gz | perl sample_fastx-txt.pl -i - -n 100000
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 E<gt> subsample.fastq>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 =head1 DESCRIPTION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 Randomly subsample FASTA, FASTQ, and TEXT files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 Empty lines in the input files will be skipped and not included in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 sampling. Format TEXT assumes one entry per single line. FASTQ
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 format assumes B<four> lines per read, if this is not the case run
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 the FASTQ file through L<C<fastx_fix.pl>|/fastx_fix> or use Heng
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 Li's L<C<seqtk seq>|https://github.com/lh3/seqtk>:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 C<seqtk seq -l 0 infile.fq E<gt> outfile.fq>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 The file type is detected automatically. However, if automatic
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 detection fails, TEXT format is assumed. As a last resort, you can
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 set the file type manually with option B<-f>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 This script is an implementation of the I<reservoir sampling>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 algorithm (or I<Algorithm R (3.4.2)>) described in Donald Knuth's
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 L<I<The Art of Computer Programming>|https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 It is designed to randomly pull a small sample size from a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 (potential) huge input file of indeterminate size, which
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 (potentially) doesn't fit into main memory. The beauty of reservoir
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 sampling is that it requires only one pass through the input file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 The memory consumption of the algorithm is proportional to the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 sample size, thus large sample sizes will consume lots of memory as
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 the whole sample will be held in memory. On the other hand, the size
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 of the initial file is irrelevant.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 An alternative tool, which is a lot faster, is C<seqtk sample> from
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 the L<I<seqtk toolkit>|https://github.com/lh3/seqtk>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 =head1 OPTIONS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 =head2 Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 =item B<-i>=I<str>, B<-input>=I<str>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 Input FASTA/Q or TEXT file, or piped C<STDIN> (-)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 =item B<-n>=I<int>, B<-num>=I<int>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 Number of entries/reads to subsample
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 =head2 Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 =item B<-h>, B<-help>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 =item B<-f>=I<fasta|fastq|text>, B<-file_type>=I<fasta|fastq|text>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 Set the file type manually
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 =item B<-s>=I<int>, B<-seed>=I<int>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 Set starting random seed. For B<paired-end> read data use the B<same
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 random seed> for both FASTQ files with option B<-s> to retain
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 pairing (see L<"EXAMPLES"> below).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 =item B<-t>=I<int>, B<-title_skip>=I<int>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 Skip the specified number of header lines in TEXT files before
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 subsampling and append them again afterwards. If you want to get rid
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 of the header as well, pipe the subsample output to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 L<C<sed>|https://www.gnu.org/software/sed/manual/sed.html> (see
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 C<man sed> and L<"EXAMPLES"> below).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 =item B<-v>, B<-version>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 Print version number to C<STDERR>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 =head1 OUTPUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 =over 20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 =item C<STDOUT>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 The subsample of the input file is printed to C<STDOUT>. Redirect or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 =head1 EXAMPLES
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 =over
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 =item C<perl sample_fastx-txt.pl -i read-pair_1.fq -n 1000000 -s 123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 E<gt> sub-pair_1.fq>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 =item C<perl sample_fastx-txt.pl -i read-pair_2.fq -n 1000000 -s 123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 E<gt> sub-pair_2.fq>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 =item C<perl sample_fastx-txt.pl -i infile.txt -n 100 -f text -t 3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 E<gt> subsample.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 =item C<perl sample_fastx-txt.pl -i infile.txt -n 350 -t 2 | sed
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 '1,2d' E<gt> sub_no-header.txt>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 =back
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 =head1 VERSION
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 0.1 18-11-2014
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 =head1 AUTHOR
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 Andreas Leimbach aleimba[at]gmx[dot]de
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 =head1 ACKNOWLEDGEMENTS
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 I got the idea for reservoir sampling from Sean Eddy's keynote at
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 the Janelia meeting on L<I<High Throughput Sequencing for
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 Neuroscience>|http://cryptogenomicon.wordpress.com/2014/11/01/high-throughput-sequencing-for-neuroscience/>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 which he posted in his blog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 L<I<Cryptogenomicon>|http://cryptogenomicon.wordpress.com/>. The
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 L<I<Wikipedia
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 article>|https://en.wikipedia.org/wiki/Reservoir_sampling> and the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 L<I<PerlMonks>|http://www.perlmonks.org/index.pl?node_id=177092>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 implementation helped a lot, as well.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 =head1 LICENSE
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 This program is free software: you can redistribute it and/or modify
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153 it under the terms of the GNU General Public License as published by
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 the Free Software Foundation; either version 3 (GPLv3) of the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 License, or (at your option) any later version.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157 This program is distributed in the hope that it will be useful, but
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 WITHOUT ANY WARRANTY; without even the implied warranty of
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 General Public License for more details.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 You should have received a copy of the GNU General Public License
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 along with this program. If not, see L<http://www.gnu.org/licenses/>.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 =cut
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 # MAIN #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 ########
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 use strict;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 use warnings;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 use autodie;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 use Getopt::Long;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 use Pod::Usage;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 ### Get options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179 my $Input_File; # input file to subsample from
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 my $Sample_Num; # number of sequence entries/reads/lines to sample
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 my $File_Type; # set file type; otherwise detect file type by file extension, or by looking at the first line of the file if input is piped from STDIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 my $Seed; # give starting seed number for 'rand' to make results repeatable; also needed for paired FASTA/Q files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 my $Title_Skip; # optionally skip given number of header lines of TEXT files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 my $VERSION = 0.1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 my ($Opt_Version, $Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186 GetOptions ('input=s' => \$Input_File,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 'num=i' => \$Sample_Num,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 'file_type=s' => \$File_Type,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 'seed=i' => \$Seed,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190 'title_skip=i' => \$Title_Skip,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 'version' => \$Opt_Version,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 'help|?' => \$Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 ### Run perldoc on POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197 pod2usage(-verbose => 2) if ($Opt_Help);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198 die "$0 $VERSION\n" if ($Opt_Version);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199 if (!$Input_File || !$Sample_Num) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200 my $warning = "\n### Fatal error: Options '-i' or '-n' or their arguments are missing!\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201 pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203 die "\n### Fatal error:\nUnknown file type '$File_Type' given with option '-f'. Please choose from either 'fasta', 'fastq', or 'text'!\n" if ($File_Type && $File_Type !~ /(fasta|fastq|text)/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 ### Pipe input from STDIN or open input file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208 my $Input_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209 if ($Input_File eq '-') { # file input via STDIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210 $Input_Fh = *STDIN; # capture typeglob of STDIN
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211 } else { # input via input file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212 open ($Input_Fh, "<", "$Input_File");
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 get_file_type('ext') if (!$File_Type); # subroutine to determine file type; mode 'ext' (extension) for input files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218 ### Reservoir sampling
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 my @Rsvr_Array; # sample reservoir (array for TEXT files, array-in-array for FASTA/Q files)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220 my @Title; # store header for TEXT files to append to the file later on, see option flag '-title_skip'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 my $Next_Fasta_ID; # for multi-line FASTA input files to store next entry header/ID line while parsing in subroutine 'get_fastx_entry'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223 # 1. fill reservoir
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224 while (<$Input_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 if (/^\s*$/) { # skip empty lines in input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226 warn "\n### FASTQ file includes empty lines, which is unusual. Sampling FASTQ reads might fail so check the output file afterwards if the script didn't quit with a fatal error. However, consider running the input FASTQ file through 'fix_fastx.pl'!\n\n" if ($File_Type =~ /fastq/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230 get_file_type('stdin', $_) if (!$File_Type && $Input_File eq '-' && $. == 1); # subroutine to determine file type; mode 'stdin' piped input with first line of file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231 warn "\n### Ignoring option '-t|-title_skip' as it has no effect on FASTA/Q files! Use the option only for TEXT files!\n\n" if($Title_Skip && $File_Type !~ /text/i && $. == 1);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233 # FASTA file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234 if ($File_Type =~ /fasta/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 $_ = get_fastx_entry($_); # subroutine to read one FASTA sequence entry (seq in multi-line or not), returns anonymous array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237 # FASTQ file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238 } elsif ($File_Type =~ /fastq/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239 $_ = get_fastx_entry($_); # subroutine to read one FASTQ read composed of FOUR mandatory lines, returns reference to array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 # "single-line" TEXT file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242 } elsif ($File_Type =~ /text/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243 warn "\n### Sure this is a TEXT file? The first line suspiciously looks like a FASTA ID/header line:\n$_\nBut, proceeding with file type 'text' ...\n\n" if ($. == 1 && $_ =~ /^>.*/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244 warn "\n### Sure this is a TEXT file? The first line suspiciously looks like a FASTQ sequence ID line:\n$_\nBut, proceeding with file type 'text' ...\n\n" if ($. == 1 && $_ =~ /^@.+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 if ($Title_Skip) { # skip possible header lines of TEXT file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246 while ($Title_Skip) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 push(@Title, $_); # store header line
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248 chomp($_ = <$Input_Fh>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 $Title_Skip--;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254 push(@Rsvr_Array, $_); # fill reservoir
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 last if (@Rsvr_Array == $Sample_Num); # reservoir is filled
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257 die "\n### Fatal error:\nInsufficient records in input for sample size '$Sample_Num'!\n" if (@Rsvr_Array < $Sample_Num);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 # 2. randomly replace elements in the reservoir with a decreasing probability
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260 srand($Seed) if ($Seed); # force seed value
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 while (<$Input_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262 if (/^\s*$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 warn "\n### FASTQ file includes empty lines, which is unusual. Sampling FASTQ reads might fail so check the output file afterwards if the script didn't quit with a fatal error. However, consider running the input FASTQ file through 'fix_fastx.pl'!\n\n" if ($File_Type =~ /fastq/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264 next;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266 my $rand_num = int(rand($.)); # choose an integer between 0 and eof-1; inclusive because array zero-based
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268 # replace elements in reservoir array
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 if ($rand_num < @Rsvr_Array) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272 # FASTA file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 if ($File_Type =~ /fasta/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274 $_ = get_fastx_entry($_); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276 # FASTQ file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 } elsif ($File_Type =~ /fastq/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278 $_ = get_fastx_entry($_); # subroutine
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280 $Rsvr_Array[$rand_num] = $_; # TEXT files are single-line based
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282 } elsif ($rand_num >= @Rsvr_Array) { # skip residual entry/read lines of FASTA/Q files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283 if ($File_Type =~ /fasta/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284 get_fastx_entry($_); # subroutine without storing returning anonymous array (to skip FASTA multi-line seq)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286 } elsif ($File_Type =~ /fastq/i) { # skip second, third, and fourth line of FASTQ file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 <$Input_Fh>; <$Input_Fh>; <$Input_Fh>;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 close $Input_Fh;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 ### Print subsample to STDOUT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296 if (@Title) { # put header back in TEXT output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 foreach (@Title) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298 print "$_\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301 foreach (@Rsvr_Array) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302 if ($File_Type =~ /fast/i) { # for both FASTA/Q file types
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 foreach (@$_) { # de-reference array-ref for iteration
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304 print "$_\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 # single-line TEXT file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 print "$_\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313 exit;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 # Subroutines #
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318 ###############
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320 ### Get sequence entries/reads from FASTA/Q file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321 sub get_fastx_entry {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322 my $line = shift;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324 # possible multi-line seq in FASTA
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 if ($File_Type =~ /fasta/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326 my ($seq, $header);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 if ($. == 1) { # first line of file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328 die "\n### Fatal error:\nNot a FASTA input file, first line of file should be a FASTA ID/header line and start with a '>':\n$line\n" if ($line !~ /^>/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 $header = $line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330 } elsif ($Next_Fasta_ID) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 $header = $Next_Fasta_ID;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332 $seq = $line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334 while (<$Input_Fh>) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 chomp;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336 $Next_Fasta_ID = $_ if (/^>/); # store ID/header for next seq entry
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 return [$header, $seq] if (/^>/); # return anonymous array with current header and seq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338 $seq .= $_; # concatenate multi-line seq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340 return [$header, $seq] if eof;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342 # FASTQ: FOUR lines for each FASTQ read (seq-ID, sequence, qual-ID [optional], qual)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 } elsif ($File_Type =~ /fastq/i) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344 my @fastq_read;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346 # read name/ID, line 1
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347 my $seq_id = $line;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348 die "\n### Fatal error:\nThis read doesn't have a sequence identifier/read name according to FASTQ specs, it should begin with a '\@':\n$seq_id\n" if ($seq_id !~ /^@.+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 push(@fastq_read, $seq_id);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350 $seq_id =~ s/^@//; # remove '@' to make comparable to $qual_id
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352 # sequence, line 2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 chomp (my $seq = <$Input_Fh>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354 die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /\s+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 die "\n### Fatal error:\nRead '$seq_id' has a IUPAC degenerate base (except for 'N') or non-nucleotide character in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /[^acgtun]/i);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356 push(@fastq_read, $seq);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358 # optional quality ID, line 3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 chomp (my $qual_id = <$Input_Fh>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360 die "\n### Fatal error:\nThe optional sequence identifier/read name for the quality line of read '$seq_id' is not according to FASTQ specs, it should begin with a '+':\n$qual_id\n" if ($qual_id !~ /^\+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 push(@fastq_read, $qual_id);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362 $qual_id =~ s/^\+//; # if optional ID is present check if equal to $seq_id in line 1
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363 die "\n### Fatal error:\nThe sequence identifier/read name of read '$seq_id' doesn't fit to the optional ID in the quality line:\n$qual_id\n" if ($qual_id && $qual_id ne $seq_id);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365 # quality, line 4
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366 chomp (my $qual = <$Input_Fh>);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /\s+/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368 die "\n### Fatal error:\nRead '$seq_id' has a non-ASCII character in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /[^[:ascii]]/);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369 die "\n### Fatal error:\nThe quality line of read '$seq_id' doesn't have the same number of symbols as letters in the sequence:\n$seq\n$qual\n" if (length $qual != length $seq);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370 push(@fastq_read, $qual);
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372 return \@fastq_read; # return array-ref
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 ### Determine file type (FASTA, FASTQ, or TEXT)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379 sub get_file_type {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 my ($mode, $line) = @_; # mode either 'ext' or 'stdin' ('stdin' needs first line of input file)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 # determine file type via file extension
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383 if ($mode eq 'ext') {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 if ($Input_File =~ /.+\.(fa|fas|fasta|ffn|fna|frn|fsa)$/) { # use "|fsa)(\.gz)*$" if unzip inside script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385 $File_Type = 'fasta';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 } elsif ($Input_File =~ /.+\.(fastq|fq)$/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387 $File_Type = 'fastq';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389 $File_Type = 'text';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392 # determine by looking at first line of file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393 } elsif ($mode eq 'stdin') {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394 if ($line =~ /^>.*/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395 $File_Type = 'fasta';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396 } elsif ($line =~ /^@.+/) {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397 $File_Type = 'fastq';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 } else {
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399 $File_Type = 'text';
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401 }
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403 print STDERR "Detected file type: $File_Type\n";
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404 return 1;
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405 }