3
|
1 sample_fastx-txt
|
|
2 ================
|
|
3
|
|
4 `sample_fastx-txt.pl` is a script to randomly subsample FASTA, FASTQ, or TEXT files.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing)
|
|
10 * [Subsample TEXT file and skip three header lines during subsampling](#subsample-text-file-and-skip-three-header-lines-during-subsampling)
|
|
11 * [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output)
|
|
12 * [Options](#options)
|
|
13 * [Mandatory options](#mandatory-options)
|
|
14 * [Optional options](#optional-options)
|
|
15 * [Output](#output)
|
|
16 * [Run environment](#run-environment)
|
|
17 * [Author - contact](#author---contact)
|
|
18 * [Acknowledgements](#acknowledgements)
|
|
19 * [Citation, installation, and license](#citation-installation-and-license)
|
|
20 * [Changelog](#changelog)
|
|
21
|
|
22 ## Synopsis
|
|
23
|
|
24 perl sample_fastx-txt.pl -i infile.fasta -n 100 > subsample.fasta
|
|
25
|
|
26 **or**
|
|
27
|
|
28 zcat reads.fastq.gz | perl sample_fastx-txt.pl -i - -n 100000 > subsample.fastq
|
|
29
|
|
30 ## Description
|
|
31
|
|
32 Randomly subsample FASTA, FASTQ, and TEXT files.
|
|
33
|
|
34 Empty lines in the input files will be skipped and not included in
|
|
35 sampling. Format TEXT assumes one entry per single line. FASTQ
|
|
36 format assumes **four** lines per read, if this is not the case run
|
|
37 the FASTQ file through [`fastx_fix.pl`](/fastx_fix) or use Heng
|
|
38 Li's [`seqtk seq`](https://github.com/lh3/seqtk):
|
|
39
|
|
40 seqtk seq -l 0 infile.fq > outfile.fq
|
|
41
|
|
42 The file type is detected automatically. However, if automatic
|
|
43 detection fails, TEXT format is assumed. As a last resort, you can
|
|
44 set the file type manually with option **-f**.
|
|
45
|
|
46 This script is an implementation of the *reservoir sampling*
|
|
47 algorithm (or *Algorithm R (3.4.2)*) described in Donald Knuth's
|
|
48 [*The Art of Computer Programming*](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming).
|
|
49 It is designed to randomly pull a small sample size from a
|
|
50 (potential) huge input file of indeterminate size, which
|
|
51 (potentially) doesn't fit into main memory. The beauty of reservoir
|
|
52 sampling is that it requires only one pass through the input file.
|
|
53 The memory consumption of the algorithm is proportional to the
|
|
54 sample size, thus large sample sizes will consume lots of memory as
|
|
55 the whole sample will be held in memory. On the other hand, the size
|
|
56 of the initial file is irrelevant.
|
|
57
|
|
58 An alternative tool, which is a lot faster, is `seqtk sample` from
|
|
59 the [*seqtk toolkit*](https://github.com/lh3/seqtk>).
|
|
60
|
|
61 ## Usage
|
|
62
|
|
63 ### Subsample paired-end read data and retain pairing
|
|
64
|
|
65 perl sample_fastx-txt.pl -i read-pair_1.fq -n 1000000 -s 123 > sub-pair_1.fq
|
|
66
|
|
67 perl sample_fastx-txt.pl -i read-pair_2.fq -n 1000000 -s 123 > sub-pair_2.fq
|
|
68
|
|
69 ### Subsample TEXT file and skip three header lines during subsampling
|
|
70
|
|
71 perl sample_fastx-txt.pl -i infile.txt -n 100 -f text -t 3 > subsample.txt
|
|
72
|
|
73 ### Subsample TEXT file and remove two header lines for final output
|
|
74
|
|
75 perl sample_fastx-txt.pl -i infile.txt -n 350 -t 2 | sed '1,2d' > sub_no-header.txt
|
|
76
|
|
77 ## Options
|
|
78
|
|
79 ### Mandatory options
|
|
80
|
|
81 - -i, -input
|
|
82
|
|
83 Input FASTA/Q or TEXT file, or piped *STDIN* (-)
|
|
84
|
|
85 - -n, -num
|
|
86
|
|
87 Number of entries/reads to subsample
|
|
88
|
|
89 ### Optional options
|
|
90
|
|
91 - -h, -help
|
|
92
|
|
93 Help (perldoc POD)
|
|
94
|
|
95 - -f, -file_type
|
|
96
|
|
97 Set the file type manually [fasta|fastq|text]
|
|
98
|
|
99 - -s, -seed
|
|
100
|
|
101 Set starting random seed. For **paired-end** read data use the **same random seed** for both FASTQ files with option **-s** to retain pairing (see [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing) above).
|
|
102
|
|
103 - -t, -title_skip
|
|
104
|
|
105 Skip the specified number of header lines in TEXT files before subsampling and append them again afterwards. If you want to get rid of the header as well, pipe the subsample output to [`sed`](https://www.gnu.org/software/sed/manual/sed.html) (see `man sed` and [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output) above).
|
|
106
|
|
107 - -v, -version
|
|
108
|
|
109 Print version number to *STDERR*
|
|
110
|
|
111 ## Output
|
|
112
|
|
113 - *STDOUT*
|
|
114
|
|
115 The subsample of the input file is printed to *STDOUT*. Redirect or pipe into another tool as needed.
|
|
116
|
|
117 ## Run environment
|
|
118
|
|
119 The Perl script runs under Windows and UNIX flavors.
|
|
120
|
|
121 ## Author - contact
|
|
122
|
|
123 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
124
|
|
125 ## Acknowledgements
|
|
126
|
|
127 I got the idea for reservoir sampling from Sean Eddy's keynote at
|
|
128 the Janelia meeting on [*High Throughput Sequencing for Neuroscience*](http://cryptogenomicon.wordpress.com/2014/11/01/high-throughput-sequencing-for-neuroscience/)
|
|
129 which he posted in his blog
|
|
130 [*Cryptogenomicon*](http://cryptogenomicon.wordpress.com/). The [*Wikipedia article*](https://en.wikipedia.org/wiki/Reservoir_sampling) and the
|
|
131 [*PerlMonks*](http://www.perlmonks.org/index.pl?node_id=177092) implementation helped a lot, as well.
|
|
132
|
|
133 ## Citation, installation, and license
|
|
134
|
|
135 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
136
|
|
137 ## Changelog
|
|
138
|
|
139 - v0.1 (18.11.2014)
|