3
|
1 calc_fastq-stats
|
|
2 ================
|
|
3
|
|
4 `calc_fastq-stats.pl` is a script to calculate basic statistics for bases and reads in a FASTQ file.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [Options](#options)
|
|
10 * [Mandatory options](#mandatory-options)
|
|
11 * [Optional options](#optional-options)
|
|
12 * [Output](#output)
|
|
13 * [Run environment](#run-environment)
|
|
14 * [Dependencies](#dependencies)
|
|
15 * [Author - contact](#author---contact)
|
|
16 * [Citation, installation, and license](#citation-installation-and-license)
|
|
17 * [Changelog](#changelog)
|
|
18
|
|
19 ## Synopsis
|
|
20
|
|
21 perl calc_fastq-stats.pl -i reads.fastq
|
|
22
|
|
23 **or**
|
|
24
|
|
25 gzip -dc reads.fastq.gz | perl calc_fastq-stats.pl -i -
|
|
26
|
|
27 ## Description
|
|
28
|
|
29 The script calculates some simple statistics, like individual and total base
|
|
30 counts, GC content, and basic stats for the read lengths, and
|
|
31 read/base qualities in a FASTQ file. The GC content calculation does
|
|
32 not include 'N's. Stats are printed to *STDOUT* and optionally to an
|
|
33 output file.
|
|
34
|
|
35 Because the quality of a read degrades over its length with all NGS
|
|
36 machines, it is advisable to also plot the quality for each cycle as
|
|
37 implemented in tools like
|
|
38 [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
|
|
39 or the [fastx-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/).
|
|
40
|
|
41 If the sequence and the quality values are interrupted by line
|
|
42 breaks (i.e. a read is **not** represented by four lines), please fix
|
|
43 with Heng Li's [seqtk](https://github.com/lh3/seqtk):
|
|
44
|
|
45 seqtk seq -l 0 infile.fastq > outfile.fastq
|
|
46
|
|
47 An alternative tool, which is a lot faster, is **fastq-stats** from
|
|
48 [ea-utils](https://code.google.com/p/ea-utils/).
|
|
49
|
|
50 ## Usage
|
|
51
|
|
52 zcat reads.fastq.gz | perl calc_fastq-stats.pl -i - -q 64 -c 175000000 -n 3000000
|
|
53
|
|
54 ## Options
|
|
55
|
|
56 ### Mandatory options
|
|
57
|
|
58 - -i, -input
|
|
59
|
|
60 Input FASTQ file or piped STDIN (-) from a gzipped file
|
|
61
|
|
62 - -q, -qual_offset
|
|
63
|
|
64 ASCII quality offset of the Phred (Sanger) quality values [default 33]
|
|
65
|
|
66 ### Optional options
|
|
67
|
|
68 - -h, -help:
|
|
69
|
|
70 Help (perldoc POD)
|
|
71
|
|
72 - -c, -coverage_limit
|
|
73
|
|
74 Number of bases to sample from the top of the file
|
|
75
|
|
76 - -n, -num_read
|
|
77
|
|
78 Number of reads to sample from the top of the file
|
|
79
|
|
80 - -o, -output
|
|
81
|
|
82 Print stats in addition to *STDOUT* to the specified output file
|
|
83
|
|
84 - -v, -version
|
|
85
|
|
86 Print version number to *STDERR*
|
|
87
|
|
88 ## Output
|
|
89
|
|
90 - *STDOUT*
|
|
91
|
|
92 Calculated stats are printed to *STDOUT*
|
|
93
|
|
94 - (outfile)
|
|
95
|
|
96 Optional outfile for stats
|
|
97
|
|
98 ## Run environment
|
|
99
|
|
100 The Perl script runs under Windows and UNIX flavors.
|
|
101
|
|
102 ## Dependencies
|
|
103
|
|
104 If the following modules are not installed get them from
|
|
105 [CPAN](http://www.cpan.org/):
|
|
106
|
|
107 - `Statistics::Descriptive`
|
|
108
|
|
109 Perl module to calculate basic descriptive statistics
|
|
110
|
|
111 - `Statistics::Descriptive::Discrete`
|
|
112
|
|
113 Perl module to calculate descriptive statistics for discrete data sets
|
|
114
|
|
115 - `Statistics::Descriptive::Weighted`
|
|
116
|
|
117 Perl module to calculate descriptive statistics for weighted variates
|
|
118
|
|
119 ## Author - contact
|
|
120
|
|
121 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
122
|
|
123 ## Citation, installation, and license
|
|
124
|
|
125 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
126
|
|
127 ## Changelog
|
|
128
|
|
129 - v0.1 (28.10.2014)
|