comparison COG/bac-genomics-scripts/calc_fastq-stats/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 calc_fastq-stats
2 ================
3
4 `calc_fastq-stats.pl` is a script to calculate basic statistics for bases and reads in a FASTQ file.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Options](#options)
10 * [Mandatory options](#mandatory-options)
11 * [Optional options](#optional-options)
12 * [Output](#output)
13 * [Run environment](#run-environment)
14 * [Dependencies](#dependencies)
15 * [Author - contact](#author---contact)
16 * [Citation, installation, and license](#citation-installation-and-license)
17 * [Changelog](#changelog)
18
19 ## Synopsis
20
21 perl calc_fastq-stats.pl -i reads.fastq
22
23 **or**
24
25 gzip -dc reads.fastq.gz | perl calc_fastq-stats.pl -i -
26
27 ## Description
28
29 The script calculates some simple statistics, like individual and total base
30 counts, GC content, and basic stats for the read lengths, and
31 read/base qualities in a FASTQ file. The GC content calculation does
32 not include 'N's. Stats are printed to *STDOUT* and optionally to an
33 output file.
34
35 Because the quality of a read degrades over its length with all NGS
36 machines, it is advisable to also plot the quality for each cycle as
37 implemented in tools like
38 [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
39 or the [fastx-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/).
40
41 If the sequence and the quality values are interrupted by line
42 breaks (i.e. a read is **not** represented by four lines), please fix
43 with Heng Li's [seqtk](https://github.com/lh3/seqtk):
44
45 seqtk seq -l 0 infile.fastq > outfile.fastq
46
47 An alternative tool, which is a lot faster, is **fastq-stats** from
48 [ea-utils](https://code.google.com/p/ea-utils/).
49
50 ## Usage
51
52 zcat reads.fastq.gz | perl calc_fastq-stats.pl -i - -q 64 -c 175000000 -n 3000000
53
54 ## Options
55
56 ### Mandatory options
57
58 - -i, -input
59
60 Input FASTQ file or piped STDIN (-) from a gzipped file
61
62 - -q, -qual_offset
63
64 ASCII quality offset of the Phred (Sanger) quality values [default 33]
65
66 ### Optional options
67
68 - -h, -help:
69
70 Help (perldoc POD)
71
72 - -c, -coverage_limit
73
74 Number of bases to sample from the top of the file
75
76 - -n, -num_read
77
78 Number of reads to sample from the top of the file
79
80 - -o, -output
81
82 Print stats in addition to *STDOUT* to the specified output file
83
84 - -v, -version
85
86 Print version number to *STDERR*
87
88 ## Output
89
90 - *STDOUT*
91
92 Calculated stats are printed to *STDOUT*
93
94 - (outfile)
95
96 Optional outfile for stats
97
98 ## Run environment
99
100 The Perl script runs under Windows and UNIX flavors.
101
102 ## Dependencies
103
104 If the following modules are not installed get them from
105 [CPAN](http://www.cpan.org/):
106
107 - `Statistics::Descriptive`
108
109 Perl module to calculate basic descriptive statistics
110
111 - `Statistics::Descriptive::Discrete`
112
113 Perl module to calculate descriptive statistics for discrete data sets
114
115 - `Statistics::Descriptive::Weighted`
116
117 Perl module to calculate descriptive statistics for weighted variates
118
119 ## Author - contact
120
121 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
122
123 ## Citation, installation, and license
124
125 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
126
127 ## Changelog
128
129 - v0.1 (28.10.2014)