annotate COG/bac-genomics-scripts/sam_insert-size/README.md @ 13:152d7c43478b draft default tip

Uploaded
author dereeper
date Thu, 30 May 2024 20:07:55 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 sam_insert-size
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 ===============
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `sam_insert-size.pl` is a script to calculate insert size and read length statistics for paired-end reads in SAM/BAM format.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Mandatory options](#mandatory-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [Optional options](#optional-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Dependencies](#dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Acknowledgements](#acknowledgements)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 perl sam_insert-size.pl -i file.sam
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 samtools view -h file.bam | perl sam_insert-size.pl -i -
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 Calculate insert size and read length statistics for paired-end reads
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 in SAM/BAM alignment format. The program gives the arithmetic mean,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 median, and standard deviation (stdev) among other statistical values.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 Insert size is defined as the total length of the original fragment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 put into sequencing, i.e. the sequenced DNA fragment between the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 adaptors. The 16-bit FLAG of the SAM/BAM file is used to filter reads
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 (see the [SAM specifications](http://samtools.sourceforge.net/SAM1.pdf)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 **Read length** statistics are calculated for all mapped reads
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 (irrespective of their pairing).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 **Insert size** statistics are calculated only for **paired reads**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 Typically, the insert size is perturbed by artifacts, like chimeras,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 structural re-arrangements or alignment errors, which result in a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 very high maximum insert size measure. As a consequence the mean and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 stdev can be strongly misleading regarding the real distribution. To
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 avoid this, two methods are implemented that first trim the insert
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 size distribution to a 'core' to calculate the respective statistics.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 Additionally, secondary alignments for multiple mapping reads and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 supplementary alignments for chimeric reads, as well as insert sizes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 of zero are not considered (option **-min_ins_cutoff** is set to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 **one** by default).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 The **-a|-align** method includes only proper/concordant paired reads
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 in the statistical calculations (as determined by the mapper and the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 options for insert size minimum and maximum used for mapping). This
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 is the **default** method.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 The **-p|-percentile** method first calculates insert size statistics
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60 for all read pairs, where the read and the mate are mapped ('raw
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 data'). Subsequently, the 10th and the 90th percentile are discarded
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 to calculate the 10% truncated mean and stdev. Discarding the lowest
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 and highest 10% of insert sizes gives the advantage of robustness
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 (insensitivity to outliers) and higher efficiency in heavy-tailed
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 distributions.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 Alternative tools, which are a lot faster, are [`CollectInsertSizeMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectInsertSizeMetrics)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 from [Picard Tools](https://broadinstitute.github.io/picard/) and
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 [`sam-stats`](https://code.google.com/p/ea-utils/wiki/SamStats) from
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 [ea-utils](https://code.google.com/p/ea-utils/).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 samtools view -h file.bam | perl sam_insert-size.pl -i - -p -d -f -min 50 -max 500 -n 2000000 -xlim_i 350 -xlim_r 200
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 ### Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 - -i, -input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 Input SAM file or piped *STDIN* (-) from a BAM file e.g. with [`samtools view`](http://www.htslib.org/doc/samtools-1.1.html) from [Samtools](http://www.htslib.org/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 - -a, -align
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 **Default method:** Align method to calculate insert size statistics, includes only reads which are mapped in a proper/concordant pair (as determined by the mapper). Excludes option **-p**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 - -p, -percentile
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 Percentile method to calculate insert size statistics, includes only read pairs with an insert size within the 10th and the 90th percentile range of all mapped read pairs. However, the frequency distribution as well as the histogram will be plotted with the 'raw' insert size data before percentile filtering. Excludes option **-a**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 ### Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 - -h, -help
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 - -d, -distro
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 Create distribution histograms for the insert sizes and read lengths with [R](http://www.r-project.org/). The calculated median and mean (that are printed to *STDOUT*) are plotted as vertical lines into the histograms. Use it to control the correctness of the statistical calculations.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 - -f, -frequencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 Print the frequencies of the insert sizes and read lengths to tab-delimited files 'ins_frequency.txt' and 'read_frequency.txt', respectively.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 - -max, -max_ins_cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 Set a maximal insert size cutoff, all insert sizes above this cutoff will be discarded (doesn't affect read length). With **-min** and **-max** you can basically run both methods, by first running the script with **-p** and then using the 10th and 90th percentile of the 'raw data' as **-min** and **-max** for option **-a**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 - -min, -min_ins_cutoff
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 Set a minimal insert size cutoff [default = 1]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 - -n, -num_read
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 Number of reads to sample for the calculations from the start of the SAM/BAM file. Significant statistics can usually be calculated from a fraction of the total SAM/BAM alignment file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 - -xlim_i, -xlim_ins
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 Set an upper limit for the x-axis of the **'R' insert size** histogram, overriding automatic truncation of the histogram tail. The default cutoff is one and a half times the third quartile Q3 (75th percentile) value. The minimal cutoff is set to the lowest insert size automatically. Forces option **-d**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 - -xlim_r, -xlim_read
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 Set an upper limit for the x-axis of the optional **'R' read length** histogram. Default value is as in **-xlim_i**. Forces option **-d**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 - -v, -version
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 Calculated stats are printed to *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 - ./results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 All **optional** output files are stored in this results folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 - (./results/ins_frequency.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 Frequencies of insert size 'raw data', tab-delimited
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 - (./results/ins_histo.pdf)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 Distribution histogram for the insert size 'raw data'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 - (./results/read_frequency.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 Frequencies of read lengths, tab-delimited
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 - (./results/read_histo.pdf)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 Distribution histogram for the read lengths. Not informative if there's no variation in the read lengths.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 The Perl script runs under Windows and UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 ## Dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 - `Statistics::Descriptive`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 Perl module to calculate descriptive statistics, if not installed already get it from [CPAN](http://www.cpan.org/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 - Statistical computing language [R](http://www.r-project.org/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 `Rscript` is needed to plot the histograms with option **-d**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 ## Acknowledgements
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 References/thanks go to:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 - Tobias Rausch's online courses/workshops (EMBL Heidelberg) on the introduction to SAM files and flags (http://www.embl.de/~rausch/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 - The CBS NGS Analysis course for the percentile filtering idea (http://www.cbs.dtu.dk/courses/27626/programme.php)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190 - v0.2 (29.10.2014)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 - Fixed bug for options '-min_ins_size' and '-max_ins_size'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192 - warn if result files already exist
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 - simplify prints to R script with Perl function 'select'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194 - minor Perl syntax changes so all Perl scripts conform to the same syntax
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 - minor changes to POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196 - finally included README.md
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197 - v0.1 (27.11.2013)