comparison COG/bac-genomics-scripts/sam_insert-size/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 sam_insert-size
2 ===============
3
4 `sam_insert-size.pl` is a script to calculate insert size and read length statistics for paired-end reads in SAM/BAM format.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Options](#options)
10 * [Mandatory options](#mandatory-options)
11 * [Optional options](#optional-options)
12 * [Output](#output)
13 * [Run environment](#run-environment)
14 * [Dependencies](#dependencies)
15 * [Author - contact](#author---contact)
16 * [Acknowledgements](#acknowledgements)
17 * [Citation, installation, and license](#citation-installation-and-license)
18 * [Changelog](#changelog)
19
20 ## Synopsis
21
22 perl sam_insert-size.pl -i file.sam
23
24 **or**
25
26 samtools view -h file.bam | perl sam_insert-size.pl -i -
27
28 ## Description
29
30 Calculate insert size and read length statistics for paired-end reads
31 in SAM/BAM alignment format. The program gives the arithmetic mean,
32 median, and standard deviation (stdev) among other statistical values.
33
34 Insert size is defined as the total length of the original fragment
35 put into sequencing, i.e. the sequenced DNA fragment between the
36 adaptors. The 16-bit FLAG of the SAM/BAM file is used to filter reads
37 (see the [SAM specifications](http://samtools.sourceforge.net/SAM1.pdf)).
38
39 **Read length** statistics are calculated for all mapped reads
40 (irrespective of their pairing).
41
42 **Insert size** statistics are calculated only for **paired reads**.
43 Typically, the insert size is perturbed by artifacts, like chimeras,
44 structural re-arrangements or alignment errors, which result in a
45 very high maximum insert size measure. As a consequence the mean and
46 stdev can be strongly misleading regarding the real distribution. To
47 avoid this, two methods are implemented that first trim the insert
48 size distribution to a 'core' to calculate the respective statistics.
49 Additionally, secondary alignments for multiple mapping reads and
50 supplementary alignments for chimeric reads, as well as insert sizes
51 of zero are not considered (option **-min_ins_cutoff** is set to
52 **one** by default).
53
54 The **-a|-align** method includes only proper/concordant paired reads
55 in the statistical calculations (as determined by the mapper and the
56 options for insert size minimum and maximum used for mapping). This
57 is the **default** method.
58
59 The **-p|-percentile** method first calculates insert size statistics
60 for all read pairs, where the read and the mate are mapped ('raw
61 data'). Subsequently, the 10th and the 90th percentile are discarded
62 to calculate the 10% truncated mean and stdev. Discarding the lowest
63 and highest 10% of insert sizes gives the advantage of robustness
64 (insensitivity to outliers) and higher efficiency in heavy-tailed
65 distributions.
66
67 Alternative tools, which are a lot faster, are [`CollectInsertSizeMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectInsertSizeMetrics)
68 from [Picard Tools](https://broadinstitute.github.io/picard/) and
69 [`sam-stats`](https://code.google.com/p/ea-utils/wiki/SamStats) from
70 [ea-utils](https://code.google.com/p/ea-utils/).
71
72 ## Usage
73
74 samtools view -h file.bam | perl sam_insert-size.pl -i - -p -d -f -min 50 -max 500 -n 2000000 -xlim_i 350 -xlim_r 200
75
76 ## Options
77
78 ### Mandatory options
79
80 - -i, -input
81
82 Input SAM file or piped *STDIN* (-) from a BAM file e.g. with [`samtools view`](http://www.htslib.org/doc/samtools-1.1.html) from [Samtools](http://www.htslib.org/)
83
84 - -a, -align
85
86 **Default method:** Align method to calculate insert size statistics, includes only reads which are mapped in a proper/concordant pair (as determined by the mapper). Excludes option **-p**.
87
88 **or**
89
90 - -p, -percentile
91
92 Percentile method to calculate insert size statistics, includes only read pairs with an insert size within the 10th and the 90th percentile range of all mapped read pairs. However, the frequency distribution as well as the histogram will be plotted with the 'raw' insert size data before percentile filtering. Excludes option **-a**.
93
94 ### Optional options
95
96 - -h, -help
97
98 Help (perldoc POD)
99
100 - -d, -distro
101
102 Create distribution histograms for the insert sizes and read lengths with [R](http://www.r-project.org/). The calculated median and mean (that are printed to *STDOUT*) are plotted as vertical lines into the histograms. Use it to control the correctness of the statistical calculations.
103
104 - -f, -frequencies
105
106 Print the frequencies of the insert sizes and read lengths to tab-delimited files 'ins_frequency.txt' and 'read_frequency.txt', respectively.
107
108 - -max, -max_ins_cutoff
109
110 Set a maximal insert size cutoff, all insert sizes above this cutoff will be discarded (doesn't affect read length). With **-min** and **-max** you can basically run both methods, by first running the script with **-p** and then using the 10th and 90th percentile of the 'raw data' as **-min** and **-max** for option **-a**.
111
112 - -min, -min_ins_cutoff
113
114 Set a minimal insert size cutoff [default = 1]
115
116 - -n, -num_read
117
118 Number of reads to sample for the calculations from the start of the SAM/BAM file. Significant statistics can usually be calculated from a fraction of the total SAM/BAM alignment file.
119
120 - -xlim_i, -xlim_ins
121
122 Set an upper limit for the x-axis of the **'R' insert size** histogram, overriding automatic truncation of the histogram tail. The default cutoff is one and a half times the third quartile Q3 (75th percentile) value. The minimal cutoff is set to the lowest insert size automatically. Forces option **-d**.
123
124 - -xlim_r, -xlim_read
125
126 Set an upper limit for the x-axis of the optional **'R' read length** histogram. Default value is as in **-xlim_i**. Forces option **-d**.
127
128 - -v, -version
129
130 Print version number to *STDERR*
131
132 ## Output
133
134 - *STDOUT*
135
136 Calculated stats are printed to *STDOUT*
137
138 - ./results
139
140 All **optional** output files are stored in this results folder
141
142 - (./results/ins_frequency.txt)
143
144 Frequencies of insert size 'raw data', tab-delimited
145
146 - (./results/ins_histo.pdf)
147
148 Distribution histogram for the insert size 'raw data'
149
150 - (./results/read_frequency.txt)
151
152 Frequencies of read lengths, tab-delimited
153
154 - (./results/read_histo.pdf)
155
156 Distribution histogram for the read lengths. Not informative if there's no variation in the read lengths.
157
158 ## Run environment
159
160 The Perl script runs under Windows and UNIX flavors.
161
162 ## Dependencies
163
164 - `Statistics::Descriptive`
165
166 Perl module to calculate descriptive statistics, if not installed already get it from [CPAN](http://www.cpan.org/)
167
168 - Statistical computing language [R](http://www.r-project.org/)
169
170 `Rscript` is needed to plot the histograms with option **-d**
171
172 ## Author - contact
173
174 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
175
176 ## Acknowledgements
177
178 References/thanks go to:
179
180 - Tobias Rausch's online courses/workshops (EMBL Heidelberg) on the introduction to SAM files and flags (http://www.embl.de/~rausch/)
181
182 - The CBS NGS Analysis course for the percentile filtering idea (http://www.cbs.dtu.dk/courses/27626/programme.php)
183
184 ## Citation, installation, and license
185
186 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
187
188 ## Changelog
189
190 - v0.2 (29.10.2014)
191 - Fixed bug for options '-min_ins_size' and '-max_ins_size'
192 - warn if result files already exist
193 - simplify prints to R script with Perl function 'select'
194 - minor Perl syntax changes so all Perl scripts conform to the same syntax
195 - minor changes to POD
196 - finally included README.md
197 - v0.1 (27.11.2013)