annotate COG/bac-genomics-scripts/sample_fastx-txt/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 sample_fastx-txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 ================
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `sample_fastx-txt.pl` is a script to randomly subsample FASTA, FASTQ, or TEXT files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Subsample TEXT file and skip three header lines during subsampling](#subsample-text-file-and-skip-three-header-lines-during-subsampling)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Mandatory options](#mandatory-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Optional options](#optional-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 * [Acknowledgements](#acknowledgements)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 perl sample_fastx-txt.pl -i infile.fasta -n 100 > subsample.fasta
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 zcat reads.fastq.gz | perl sample_fastx-txt.pl -i - -n 100000 > subsample.fastq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 Randomly subsample FASTA, FASTQ, and TEXT files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 Empty lines in the input files will be skipped and not included in
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 sampling. Format TEXT assumes one entry per single line. FASTQ
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 format assumes **four** lines per read, if this is not the case run
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 the FASTQ file through [`fastx_fix.pl`](/fastx_fix) or use Heng
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 Li's [`seqtk seq`](https://github.com/lh3/seqtk):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 seqtk seq -l 0 infile.fq > outfile.fq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 The file type is detected automatically. However, if automatic
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 detection fails, TEXT format is assumed. As a last resort, you can
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 set the file type manually with option **-f**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 This script is an implementation of the *reservoir sampling*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 algorithm (or *Algorithm R (3.4.2)*) described in Donald Knuth's
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 [*The Art of Computer Programming*](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 It is designed to randomly pull a small sample size from a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 (potential) huge input file of indeterminate size, which
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 (potentially) doesn't fit into main memory. The beauty of reservoir
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 sampling is that it requires only one pass through the input file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 The memory consumption of the algorithm is proportional to the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 sample size, thus large sample sizes will consume lots of memory as
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 the whole sample will be held in memory. On the other hand, the size
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 of the initial file is irrelevant.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 An alternative tool, which is a lot faster, is `seqtk sample` from
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 the [*seqtk toolkit*](https://github.com/lh3/seqtk>).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 ### Subsample paired-end read data and retain pairing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 perl sample_fastx-txt.pl -i read-pair_1.fq -n 1000000 -s 123 > sub-pair_1.fq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 perl sample_fastx-txt.pl -i read-pair_2.fq -n 1000000 -s 123 > sub-pair_2.fq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 ### Subsample TEXT file and skip three header lines during subsampling
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 perl sample_fastx-txt.pl -i infile.txt -n 100 -f text -t 3 > subsample.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 ### Subsample TEXT file and remove two header lines for final output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 perl sample_fastx-txt.pl -i infile.txt -n 350 -t 2 | sed '1,2d' > sub_no-header.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 ### Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 - -i, -input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 Input FASTA/Q or TEXT file, or piped *STDIN* (-)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 - -n, -num
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 Number of entries/reads to subsample
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 ### Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 - -h, -help
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 - -f, -file_type
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 Set the file type manually [fasta|fastq|text]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 - -s, -seed
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 Set starting random seed. For **paired-end** read data use the **same random seed** for both FASTQ files with option **-s** to retain pairing (see [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing) above).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 - -t, -title_skip
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 Skip the specified number of header lines in TEXT files before subsampling and append them again afterwards. If you want to get rid of the header as well, pipe the subsample output to [`sed`](https://www.gnu.org/software/sed/manual/sed.html) (see `man sed` and [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output) above).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 - -v, -version
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 The subsample of the input file is printed to *STDOUT*. Redirect or pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 The Perl script runs under Windows and UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 ## Acknowledgements
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 I got the idea for reservoir sampling from Sean Eddy's keynote at
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 the Janelia meeting on [*High Throughput Sequencing for Neuroscience*](http://cryptogenomicon.wordpress.com/2014/11/01/high-throughput-sequencing-for-neuroscience/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 which he posted in his blog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 [*Cryptogenomicon*](http://cryptogenomicon.wordpress.com/). The [*Wikipedia article*](https://en.wikipedia.org/wiki/Reservoir_sampling) and the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 [*PerlMonks*](http://www.perlmonks.org/index.pl?node_id=177092) implementation helped a lot, as well.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 - v0.1 (18.11.2014)