comparison COG/bac-genomics-scripts/sample_fastx-txt/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 sample_fastx-txt
2 ================
3
4 `sample_fastx-txt.pl` is a script to randomly subsample FASTA, FASTQ, or TEXT files.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing)
10 * [Subsample TEXT file and skip three header lines during subsampling](#subsample-text-file-and-skip-three-header-lines-during-subsampling)
11 * [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output)
12 * [Options](#options)
13 * [Mandatory options](#mandatory-options)
14 * [Optional options](#optional-options)
15 * [Output](#output)
16 * [Run environment](#run-environment)
17 * [Author - contact](#author---contact)
18 * [Acknowledgements](#acknowledgements)
19 * [Citation, installation, and license](#citation-installation-and-license)
20 * [Changelog](#changelog)
21
22 ## Synopsis
23
24 perl sample_fastx-txt.pl -i infile.fasta -n 100 > subsample.fasta
25
26 **or**
27
28 zcat reads.fastq.gz | perl sample_fastx-txt.pl -i - -n 100000 > subsample.fastq
29
30 ## Description
31
32 Randomly subsample FASTA, FASTQ, and TEXT files.
33
34 Empty lines in the input files will be skipped and not included in
35 sampling. Format TEXT assumes one entry per single line. FASTQ
36 format assumes **four** lines per read, if this is not the case run
37 the FASTQ file through [`fastx_fix.pl`](/fastx_fix) or use Heng
38 Li's [`seqtk seq`](https://github.com/lh3/seqtk):
39
40 seqtk seq -l 0 infile.fq > outfile.fq
41
42 The file type is detected automatically. However, if automatic
43 detection fails, TEXT format is assumed. As a last resort, you can
44 set the file type manually with option **-f**.
45
46 This script is an implementation of the *reservoir sampling*
47 algorithm (or *Algorithm R (3.4.2)*) described in Donald Knuth's
48 [*The Art of Computer Programming*](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming).
49 It is designed to randomly pull a small sample size from a
50 (potential) huge input file of indeterminate size, which
51 (potentially) doesn't fit into main memory. The beauty of reservoir
52 sampling is that it requires only one pass through the input file.
53 The memory consumption of the algorithm is proportional to the
54 sample size, thus large sample sizes will consume lots of memory as
55 the whole sample will be held in memory. On the other hand, the size
56 of the initial file is irrelevant.
57
58 An alternative tool, which is a lot faster, is `seqtk sample` from
59 the [*seqtk toolkit*](https://github.com/lh3/seqtk>).
60
61 ## Usage
62
63 ### Subsample paired-end read data and retain pairing
64
65 perl sample_fastx-txt.pl -i read-pair_1.fq -n 1000000 -s 123 > sub-pair_1.fq
66
67 perl sample_fastx-txt.pl -i read-pair_2.fq -n 1000000 -s 123 > sub-pair_2.fq
68
69 ### Subsample TEXT file and skip three header lines during subsampling
70
71 perl sample_fastx-txt.pl -i infile.txt -n 100 -f text -t 3 > subsample.txt
72
73 ### Subsample TEXT file and remove two header lines for final output
74
75 perl sample_fastx-txt.pl -i infile.txt -n 350 -t 2 | sed '1,2d' > sub_no-header.txt
76
77 ## Options
78
79 ### Mandatory options
80
81 - -i, -input
82
83 Input FASTA/Q or TEXT file, or piped *STDIN* (-)
84
85 - -n, -num
86
87 Number of entries/reads to subsample
88
89 ### Optional options
90
91 - -h, -help
92
93 Help (perldoc POD)
94
95 - -f, -file_type
96
97 Set the file type manually [fasta|fastq|text]
98
99 - -s, -seed
100
101 Set starting random seed. For **paired-end** read data use the **same random seed** for both FASTQ files with option **-s** to retain pairing (see [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing) above).
102
103 - -t, -title_skip
104
105 Skip the specified number of header lines in TEXT files before subsampling and append them again afterwards. If you want to get rid of the header as well, pipe the subsample output to [`sed`](https://www.gnu.org/software/sed/manual/sed.html) (see `man sed` and [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output) above).
106
107 - -v, -version
108
109 Print version number to *STDERR*
110
111 ## Output
112
113 - *STDOUT*
114
115 The subsample of the input file is printed to *STDOUT*. Redirect or pipe into another tool as needed.
116
117 ## Run environment
118
119 The Perl script runs under Windows and UNIX flavors.
120
121 ## Author - contact
122
123 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
124
125 ## Acknowledgements
126
127 I got the idea for reservoir sampling from Sean Eddy's keynote at
128 the Janelia meeting on [*High Throughput Sequencing for Neuroscience*](http://cryptogenomicon.wordpress.com/2014/11/01/high-throughput-sequencing-for-neuroscience/)
129 which he posted in his blog
130 [*Cryptogenomicon*](http://cryptogenomicon.wordpress.com/). The [*Wikipedia article*](https://en.wikipedia.org/wiki/Reservoir_sampling) and the
131 [*PerlMonks*](http://www.perlmonks.org/index.pl?node_id=177092) implementation helped a lot, as well.
132
133 ## Citation, installation, and license
134
135 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
136
137 ## Changelog
138
139 - v0.1 (18.11.2014)