Mercurial > repos > dereeper > pangenome_explorer
diff COG/bac-genomics-scripts/calc_fastq-stats/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COG/bac-genomics-scripts/calc_fastq-stats/README.md Thu May 30 11:52:25 2024 +0000 @@ -0,0 +1,129 @@ +calc_fastq-stats +================ + +`calc_fastq-stats.pl` is a script to calculate basic statistics for bases and reads in a FASTQ file. + +* [Synopsis](#synopsis) +* [Description](#description) +* [Usage](#usage) +* [Options](#options) + * [Mandatory options](#mandatory-options) + * [Optional options](#optional-options) +* [Output](#output) +* [Run environment](#run-environment) +* [Dependencies](#dependencies) +* [Author - contact](#author---contact) +* [Citation, installation, and license](#citation-installation-and-license) +* [Changelog](#changelog) + +## Synopsis + + perl calc_fastq-stats.pl -i reads.fastq + +**or** + + gzip -dc reads.fastq.gz | perl calc_fastq-stats.pl -i - + +## Description + +The script calculates some simple statistics, like individual and total base +counts, GC content, and basic stats for the read lengths, and +read/base qualities in a FASTQ file. The GC content calculation does +not include 'N's. Stats are printed to *STDOUT* and optionally to an +output file. + +Because the quality of a read degrades over its length with all NGS +machines, it is advisable to also plot the quality for each cycle as +implemented in tools like +[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) +or the [fastx-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/). + +If the sequence and the quality values are interrupted by line +breaks (i.e. a read is **not** represented by four lines), please fix +with Heng Li's [seqtk](https://github.com/lh3/seqtk): + + seqtk seq -l 0 infile.fastq > outfile.fastq + +An alternative tool, which is a lot faster, is **fastq-stats** from +[ea-utils](https://code.google.com/p/ea-utils/). + +## Usage + + zcat reads.fastq.gz | perl calc_fastq-stats.pl -i - -q 64 -c 175000000 -n 3000000 + +## Options + +### Mandatory options + +- -i, -input + +Input FASTQ file or piped STDIN (-) from a gzipped file + +- -q, -qual_offset + +ASCII quality offset of the Phred (Sanger) quality values [default 33] + +### Optional options + +- -h, -help: + +Help (perldoc POD) + +- -c, -coverage_limit + +Number of bases to sample from the top of the file + +- -n, -num_read + +Number of reads to sample from the top of the file + +- -o, -output + +Print stats in addition to *STDOUT* to the specified output file + +- -v, -version + +Print version number to *STDERR* + +## Output + +- *STDOUT* + +Calculated stats are printed to *STDOUT* + +- (outfile) + +Optional outfile for stats + +## Run environment + +The Perl script runs under Windows and UNIX flavors. + +## Dependencies + +If the following modules are not installed get them from +[CPAN](http://www.cpan.org/): + +- `Statistics::Descriptive` + +Perl module to calculate basic descriptive statistics + +- `Statistics::Descriptive::Discrete` + +Perl module to calculate descriptive statistics for discrete data sets + +- `Statistics::Descriptive::Weighted` + +Perl module to calculate descriptive statistics for weighted variates + +## Author - contact + +Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) + +## Citation, installation, and license + +For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). + +## Changelog + +- v0.1 (28.10.2014)