0
|
1 .TH samtools 1 "21 April 2011" "samtools-0.1.16" "Bioinformatics tools"
|
|
2 .SH NAME
|
|
3 .PP
|
|
4 samtools - Utilities for the Sequence Alignment/Map (SAM) format
|
|
5 .SH SYNOPSIS
|
|
6 .PP
|
|
7 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
|
|
8 .PP
|
|
9 samtools sort aln.bam aln.sorted
|
|
10 .PP
|
|
11 samtools index aln.sorted.bam
|
|
12 .PP
|
|
13 samtools idxstats aln.sorted.bam
|
|
14 .PP
|
|
15 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
|
|
16 .PP
|
|
17 samtools merge out.bam in1.bam in2.bam in3.bam
|
|
18 .PP
|
|
19 samtools faidx ref.fasta
|
|
20 .PP
|
|
21 samtools pileup -vcf ref.fasta aln.sorted.bam
|
|
22 .PP
|
|
23 samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
|
|
24 .PP
|
|
25 samtools tview aln.sorted.bam ref.fasta
|
|
26
|
|
27 .SH DESCRIPTION
|
|
28 .PP
|
|
29 Samtools is a set of utilities that manipulate alignments in the BAM
|
|
30 format. It imports from and exports to the SAM (Sequence Alignment/Map)
|
|
31 format, does sorting, merging and indexing, and allows to retrieve reads
|
|
32 in any regions swiftly.
|
|
33
|
|
34 Samtools is designed to work on a stream. It regards an input file `-'
|
|
35 as the standard input (stdin) and an output file `-' as the standard
|
|
36 output (stdout). Several commands can thus be combined with Unix
|
|
37 pipes. Samtools always output warning and error messages to the standard
|
|
38 error output (stderr).
|
|
39
|
|
40 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
|
|
41 HTTP server if the BAM file name starts with `ftp://' or `http://'.
|
|
42 Samtools checks the current working directory for the index file and
|
|
43 will download the index upon absence. Samtools does not retrieve the
|
|
44 entire alignment file unless it is asked to do so.
|
|
45
|
|
46 .SH COMMANDS AND OPTIONS
|
|
47
|
|
48 .TP 10
|
|
49 .B view
|
|
50 samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
|
|
51 skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
|
|
52
|
|
53 Extract/print all or sub alignments in SAM or BAM format. If no region
|
|
54 is specified, all the alignments will be printed; otherwise only
|
|
55 alignments overlapping the specified regions will be output. An
|
|
56 alignment may be given multiple times if it is overlapping several
|
|
57 regions. A region can be presented, for example, in the following
|
|
58 format: `chr2' (the whole chr2), `chr2:1000000' (region starting from
|
|
59 1,000,000bp) or `chr2:1,000,000-2,000,000' (region between 1,000,000 and
|
|
60 2,000,000bp including the end points). The coordinate is 1-based.
|
|
61
|
|
62 .B OPTIONS:
|
|
63 .RS
|
|
64 .TP 8
|
|
65 .B -b
|
|
66 Output in the BAM format.
|
|
67 .TP
|
|
68 .BI -f \ INT
|
|
69 Only output alignments with all bits in INT present in the FLAG
|
|
70 field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
|
|
71 .TP
|
|
72 .BI -F \ INT
|
|
73 Skip alignments with bits present in INT [0]
|
|
74 .TP
|
|
75 .B -h
|
|
76 Include the header in the output.
|
|
77 .TP
|
|
78 .B -H
|
|
79 Output the header only.
|
|
80 .TP
|
|
81 .BI -l \ STR
|
|
82 Only output reads in library STR [null]
|
|
83 .TP
|
|
84 .BI -o \ FILE
|
|
85 Output file [stdout]
|
|
86 .TP
|
|
87 .BI -q \ INT
|
|
88 Skip alignments with MAPQ smaller than INT [0]
|
|
89 .TP
|
|
90 .BI -r \ STR
|
|
91 Only output reads in read group STR [null]
|
|
92 .TP
|
|
93 .BI -R \ FILE
|
|
94 Output reads in read groups listed in
|
|
95 .I FILE
|
|
96 [null]
|
|
97 .TP
|
|
98 .B -S
|
|
99 Input is in SAM. If @SQ header lines are absent, the
|
|
100 .B `-t'
|
|
101 option is required.
|
|
102 .TP
|
|
103 .B -c
|
|
104 Instead of printing the alignments, only count them and print the
|
|
105 total number. All filter options, such as
|
|
106 .B `-f',
|
|
107 .B `-F'
|
|
108 and
|
|
109 .B `-q'
|
|
110 , are taken into account.
|
|
111 .TP
|
|
112 .BI -t \ FILE
|
|
113 This file is TAB-delimited. Each line must contain the reference name
|
|
114 and the length of the reference, one line for each distinct reference;
|
|
115 additional fields are ignored. This file also defines the order of the
|
|
116 reference sequences in sorting. If you run `samtools faidx <ref.fa>',
|
|
117 the resultant index file
|
|
118 .I <ref.fa>.fai
|
|
119 can be used as this
|
|
120 .I <in.ref_list>
|
|
121 file.
|
|
122 .TP
|
|
123 .B -u
|
|
124 Output uncompressed BAM. This option saves time spent on
|
|
125 compression/decomprssion and is thus preferred when the output is piped
|
|
126 to another samtools command.
|
|
127 .RE
|
|
128
|
|
129 .TP
|
|
130 .B tview
|
|
131 samtools tview <in.sorted.bam> [ref.fasta]
|
|
132
|
|
133 Text alignment viewer (based on the ncurses library). In the viewer,
|
|
134 press `?' for help and press `g' to check the alignment start from a
|
|
135 region in the format like `chr10:10,000,000' or `=10,000,000' when
|
|
136 viewing the same reference sequence.
|
|
137
|
|
138 .TP
|
|
139 .B mpileup
|
|
140 samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
|
|
141
|
|
142 Generate BCF or pileup for one or multiple BAM files. Alignment records
|
|
143 are grouped by sample identifiers in @RG header lines. If sample
|
|
144 identifiers are absent, each input file is regarded as one sample.
|
|
145
|
|
146 .B OPTIONS:
|
|
147 .RS
|
|
148 .TP 10
|
|
149 .B -A
|
|
150 Do not skip anomalous read pairs in variant calling.
|
|
151 .TP
|
|
152 .B -B
|
|
153 Disable probabilistic realignment for the computation of base alignment
|
|
154 quality (BAQ). BAQ is the Phred-scaled probability of a read base being
|
|
155 misaligned. Applying this option greatly helps to reduce false SNPs
|
|
156 caused by misalignments.
|
|
157 .TP
|
|
158 .BI -C \ INT
|
|
159 Coefficient for downgrading mapping quality for reads containing
|
|
160 excessive mismatches. Given a read with a phred-scaled probability q of
|
|
161 being generated from the mapped position, the new mapping quality is
|
|
162 about sqrt((INT-q)/INT)*INT. A zero value disables this
|
|
163 functionality; if enabled, the recommended value for BWA is 50. [0]
|
|
164 .TP
|
|
165 .BI -d \ INT
|
|
166 At a position, read maximally
|
|
167 .I INT
|
|
168 reads per input BAM. [250]
|
|
169 .TP
|
|
170 .B -D
|
|
171 Output per-sample read depth
|
|
172 .TP
|
|
173 .BI -e \ INT
|
|
174 Phred-scaled gap extension sequencing error probability. Reducing
|
|
175 .I INT
|
|
176 leads to longer indels. [20]
|
|
177 .TP
|
|
178 .B -E
|
|
179 Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt
|
|
180 specificity a little bit.
|
|
181 .TP
|
|
182 .BI -f \ FILE
|
|
183 The reference file [null]
|
|
184 .TP
|
|
185 .B -g
|
|
186 Compute genotype likelihoods and output them in the binary call format (BCF).
|
|
187 .TP
|
|
188 .BI -h \ INT
|
|
189 Coefficient for modeling homopolymer errors. Given an
|
|
190 .IR l -long
|
|
191 homopolymer
|
|
192 run, the sequencing error of an indel of size
|
|
193 .I s
|
|
194 is modeled as
|
|
195 .IR INT * s / l .
|
|
196 [100]
|
|
197 .TP
|
|
198 .B -I
|
|
199 Do not perform INDEL calling
|
|
200 .TP
|
|
201 .BI -l \ FILE
|
|
202 File containing a list of sites where pileup or BCF is outputted [null]
|
|
203 .TP
|
|
204 .BI -L \ INT
|
|
205 Skip INDEL calling if the average per-sample depth is above
|
|
206 .IR INT .
|
|
207 [250]
|
|
208 .TP
|
|
209 .BI -o \ INT
|
|
210 Phred-scaled gap open sequencing error probability. Reducing
|
|
211 .I INT
|
|
212 leads to more indel calls. [40]
|
|
213 .TP
|
|
214 .BI -P \ STR
|
|
215 Comma dilimited list of platforms (determined by
|
|
216 .BR @RG-PL )
|
|
217 from which indel candidates are obtained. It is recommended to collect
|
|
218 indel candidates from sequencing technologies that have low indel error
|
|
219 rate such as ILLUMINA. [all]
|
|
220 .TP
|
|
221 .BI -q \ INT
|
|
222 Minimum mapping quality for an alignment to be used [0]
|
|
223 .TP
|
|
224 .BI -Q \ INT
|
|
225 Minimum base quality for a base to be considered [13]
|
|
226 .TP
|
|
227 .BI -r \ STR
|
|
228 Only generate pileup in region
|
|
229 .I STR
|
|
230 [all sites]
|
|
231 .TP
|
|
232 .B -S
|
|
233 Output per-sample Phred-scaled strand bias P-value
|
|
234 .TP
|
|
235 .B -u
|
|
236 Similar to
|
|
237 .B -g
|
|
238 except that the output is uncompressed BCF, which is preferred for piping.
|
|
239 .RE
|
|
240
|
|
241 .TP
|
|
242 .B reheader
|
|
243 samtools reheader <in.header.sam> <in.bam>
|
|
244
|
|
245 Replace the header in
|
|
246 .I in.bam
|
|
247 with the header in
|
|
248 .I in.header.sam.
|
|
249 This command is much faster than replacing the header with a
|
|
250 BAM->SAM->BAM conversion.
|
|
251
|
|
252 .TP
|
|
253 .B cat
|
|
254 samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]
|
|
255
|
|
256 Concatenate BAMs. The sequence dictionary of each input BAM must be identical,
|
|
257 although this command does not check this. This command uses a similar trick
|
|
258 to
|
|
259 .B reheader
|
|
260 which enables fast BAM concatenation.
|
|
261
|
|
262 .TP
|
|
263 .B sort
|
|
264 samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
|
|
265
|
|
266 Sort alignments by leftmost coordinates. File
|
|
267 .I <out.prefix>.bam
|
|
268 will be created. This command may also create temporary files
|
|
269 .I <out.prefix>.%d.bam
|
|
270 when the whole alignment cannot be fitted into memory (controlled by
|
|
271 option -m).
|
|
272
|
|
273 .B OPTIONS:
|
|
274 .RS
|
|
275 .TP 8
|
|
276 .B -o
|
|
277 Output the final alignment to the standard output.
|
|
278 .TP
|
|
279 .B -n
|
|
280 Sort by read names rather than by chromosomal coordinates
|
|
281 .TP
|
|
282 .BI -m \ INT
|
|
283 Approximately the maximum required memory. [500000000]
|
|
284 .RE
|
|
285
|
|
286 .TP
|
|
287 .B merge
|
|
288 samtools merge [-nur1f] [-h inh.sam] [-R reg] <out.bam> <in1.bam> <in2.bam> [...]
|
|
289
|
|
290 Merge multiple sorted alignments.
|
|
291 The header reference lists of all the input BAM files, and the @SQ headers of
|
|
292 .IR inh.sam ,
|
|
293 if any, must all refer to the same set of reference sequences.
|
|
294 The header reference list and (unless overridden by
|
|
295 .BR -h )
|
|
296 `@' headers of
|
|
297 .I in1.bam
|
|
298 will be copied to
|
|
299 .IR out.bam ,
|
|
300 and the headers of other files will be ignored.
|
|
301
|
|
302 .B OPTIONS:
|
|
303 .RS
|
|
304 .TP 8
|
|
305 .B -1
|
|
306 Use zlib compression level 1 to comrpess the output
|
|
307 .TP
|
|
308 .B -f
|
|
309 Force to overwrite the output file if present.
|
|
310 .TP 8
|
|
311 .BI -h \ FILE
|
|
312 Use the lines of
|
|
313 .I FILE
|
|
314 as `@' headers to be copied to
|
|
315 .IR out.bam ,
|
|
316 replacing any header lines that would otherwise be copied from
|
|
317 .IR in1.bam .
|
|
318 .RI ( FILE
|
|
319 is actually in SAM format, though any alignment records it may contain
|
|
320 are ignored.)
|
|
321 .TP
|
|
322 .B -n
|
|
323 The input alignments are sorted by read names rather than by chromosomal
|
|
324 coordinates
|
|
325 .TP
|
|
326 .BI -R \ STR
|
|
327 Merge files in the specified region indicated by
|
|
328 .I STR
|
|
329 [null]
|
|
330 .TP
|
|
331 .B -r
|
|
332 Attach an RG tag to each alignment. The tag value is inferred from file names.
|
|
333 .TP
|
|
334 .B -u
|
|
335 Uncompressed BAM output
|
|
336 .RE
|
|
337
|
|
338 .TP
|
|
339 .B index
|
|
340 samtools index <aln.bam>
|
|
341
|
|
342 Index sorted alignment for fast random access. Index file
|
|
343 .I <aln.bam>.bai
|
|
344 will be created.
|
|
345
|
|
346 .TP
|
|
347 .B idxstats
|
|
348 samtools idxstats <aln.bam>
|
|
349
|
|
350 Retrieve and print stats in the index file. The output is TAB delimited
|
|
351 with each line consisting of reference sequence name, sequence length, #
|
|
352 mapped reads and # unmapped reads.
|
|
353
|
|
354 .TP
|
|
355 .B faidx
|
|
356 samtools faidx <ref.fasta> [region1 [...]]
|
|
357
|
|
358 Index reference sequence in the FASTA format or extract subsequence from
|
|
359 indexed reference sequence. If no region is specified,
|
|
360 .B faidx
|
|
361 will index the file and create
|
|
362 .I <ref.fasta>.fai
|
|
363 on the disk. If regions are speficified, the subsequences will be
|
|
364 retrieved and printed to stdout in the FASTA format. The input file can
|
|
365 be compressed in the
|
|
366 .B RAZF
|
|
367 format.
|
|
368
|
|
369 .TP
|
|
370 .B fixmate
|
|
371 samtools fixmate <in.nameSrt.bam> <out.bam>
|
|
372
|
|
373 Fill in mate coordinates, ISIZE and mate related flags from a
|
|
374 name-sorted alignment.
|
|
375
|
|
376 .TP
|
|
377 .B rmdup
|
|
378 samtools rmdup [-sS] <input.srt.bam> <out.bam>
|
|
379
|
|
380 Remove potential PCR duplicates: if multiple read pairs have identical
|
|
381 external coordinates, only retain the pair with highest mapping quality.
|
|
382 In the paired-end mode, this command
|
|
383 .B ONLY
|
|
384 works with FR orientation and requires ISIZE is correctly set. It does
|
|
385 not work for unpaired reads (e.g. two ends mapped to different
|
|
386 chromosomes or orphan reads).
|
|
387
|
|
388 .B OPTIONS:
|
|
389 .RS
|
|
390 .TP 8
|
|
391 .B -s
|
|
392 Remove duplicate for single-end reads. By default, the command works for
|
|
393 paired-end reads only.
|
|
394 .TP 8
|
|
395 .B -S
|
|
396 Treat paired-end reads and single-end reads.
|
|
397 .RE
|
|
398
|
|
399 .TP
|
|
400 .B calmd
|
|
401 samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>
|
|
402
|
|
403 Generate the MD tag. If the MD tag is already present, this command will
|
|
404 give a warning if the MD tag generated is different from the existing
|
|
405 tag. Output SAM by default.
|
|
406
|
|
407 .B OPTIONS:
|
|
408 .RS
|
|
409 .TP 8
|
|
410 .B -A
|
|
411 When used jointly with
|
|
412 .B -r
|
|
413 this option overwrites the original base quality.
|
|
414 .TP 8
|
|
415 .B -e
|
|
416 Convert a the read base to = if it is identical to the aligned reference
|
|
417 base. Indel caller does not support the = bases at the moment.
|
|
418 .TP
|
|
419 .B -u
|
|
420 Output uncompressed BAM
|
|
421 .TP
|
|
422 .B -b
|
|
423 Output compressed BAM
|
|
424 .TP
|
|
425 .B -S
|
|
426 The input is SAM with header lines
|
|
427 .TP
|
|
428 .BI -C \ INT
|
|
429 Coefficient to cap mapping quality of poorly mapped reads. See the
|
|
430 .B pileup
|
|
431 command for details. [0]
|
|
432 .TP
|
|
433 .B -r
|
|
434 Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).
|
|
435 .TP
|
|
436 .B -E
|
|
437 Extended BAQ calculation. This option trades specificity for sensitivity, though the
|
|
438 effect is minor.
|
|
439 .RE
|
|
440
|
|
441 .TP
|
|
442 .B targetcut
|
|
443 samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>
|
|
444
|
|
445 This command identifies target regions by examining the continuity of read depth, computes
|
|
446 haploid consensus sequences of targets and outputs a SAM with each sequence corresponding
|
|
447 to a target. When option
|
|
448 .B -f
|
|
449 is in use, BAQ will be applied. This command is
|
|
450 .B only
|
|
451 designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].
|
|
452 .RE
|
|
453
|
|
454 .TP
|
|
455 .B phase
|
|
456 samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>
|
|
457
|
|
458 Call and phase heterozygous SNPs.
|
|
459 .B OPTIONS:
|
|
460 .RS
|
|
461 .TP 8
|
|
462 .B -A
|
|
463 Drop reads with ambiguous phase.
|
|
464 .TP 8
|
|
465 .BI -b \ STR
|
|
466 Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file
|
|
467 .BR STR .0.bam
|
|
468 and phase-1 reads in
|
|
469 .BR STR .1.bam.
|
|
470 Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads
|
|
471 with switch errors will be saved in
|
|
472 .BR STR .chimeric.bam.
|
|
473 [null]
|
|
474 .TP
|
|
475 .B -F
|
|
476 Do not attempt to fix chimeric reads.
|
|
477 .TP
|
|
478 .BI -k \ INT
|
|
479 Maximum length for local phasing. [13]
|
|
480 .TP
|
|
481 .BI -q \ INT
|
|
482 Minimum Phred-scaled LOD to call a heterozygote. [40]
|
|
483 .TP
|
|
484 .BI -Q \ INT
|
|
485 Minimum base quality to be used in het calling. [13]
|
|
486 .RE
|
|
487
|
|
488 .TP
|
|
489 .B pileup
|
|
490 samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
|
|
491 in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
|
|
492 pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
|
|
493 <in.bam>|<in.sam>
|
|
494
|
|
495 Print the alignment in the pileup format. In the pileup format, each
|
|
496 line represents a genomic position, consisting of chromosome name,
|
|
497 coordinate, reference base, read bases, read qualities and alignment
|
|
498 mapping qualities. Information on match, mismatch, indel, strand,
|
|
499 mapping quality and start and end of a read are all encoded at the read
|
|
500 base column. At this column, a dot stands for a match to the reference
|
|
501 base on the forward strand, a comma for a match on the reverse strand,
|
|
502 a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
|
|
503 strand and `acgtn' for a mismatch on the reverse strand. A pattern
|
|
504 `\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
|
|
505 reference position and the next reference position. The length of the
|
|
506 insertion is given by the integer in the pattern, followed by the
|
|
507 inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
|
|
508 represents a deletion from the reference. The deleted bases will be
|
|
509 presented as `*' in the following lines. Also at the read base column, a
|
|
510 symbol `^' marks the start of a read. The ASCII of the character
|
|
511 following `^' minus 33 gives the mapping quality. A symbol `$' marks the
|
|
512 end of a read segment.
|
|
513
|
|
514 If option
|
|
515 .B -c
|
|
516 is applied, the consensus base, Phred-scaled consensus quality, SNP
|
|
517 quality (i.e. the Phred-scaled probability of the consensus being
|
|
518 identical to the reference) and root mean square (RMS) mapping quality
|
|
519 of the reads covering the site will be inserted between the `reference
|
|
520 base' and the `read bases' columns. An indel occupies an additional
|
|
521 line. Each indel line consists of chromosome name, coordinate, a star,
|
|
522 the genotype, consensus quality, SNP quality, RMS mapping quality, #
|
|
523 covering reads, the first alllele, the second allele, # reads supporting
|
|
524 the first allele, # reads supporting the second allele and # reads
|
|
525 containing indels different from the top two alleles.
|
|
526
|
|
527 .B NOTE:
|
|
528 Since 0.1.10, the `pileup' command is deprecated by `mpileup'.
|
|
529
|
|
530 .B OPTIONS:
|
|
531 .RS
|
|
532 .TP 10
|
|
533 .B -B
|
|
534 Disable the BAQ computation. See the
|
|
535 .B mpileup
|
|
536 command for details.
|
|
537 .TP
|
|
538 .B -c
|
|
539 Call the consensus sequence. Options
|
|
540 .BR -T ", " -N ", " -I " and " -r
|
|
541 are only effective when
|
|
542 .BR -c " or " -g
|
|
543 is in use.
|
|
544 .TP
|
|
545 .BI -C \ INT
|
|
546 Coefficient for downgrading the mapping quality of poorly mapped
|
|
547 reads. See the
|
|
548 .B mpileup
|
|
549 command for details. [0]
|
|
550 .TP
|
|
551 .BI -d \ INT
|
|
552 Use the first
|
|
553 .I NUM
|
|
554 reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
|
|
555 .TP
|
|
556 .BI -f \ FILE
|
|
557 The reference sequence in the FASTA format. Index file
|
|
558 .I FILE.fai
|
|
559 will be created if
|
|
560 absent.
|
|
561 .TP
|
|
562 .B -g
|
|
563 Generate genotype likelihood in the binary GLFv3 format. This option
|
|
564 suppresses -c, -i and -s. This option is deprecated by the
|
|
565 .B mpileup
|
|
566 command.
|
|
567 .TP
|
|
568 .B -i
|
|
569 Only output pileup lines containing indels.
|
|
570 .TP
|
|
571 .BI -I \ INT
|
|
572 Phred probability of an indel in sequencing/prep. [40]
|
|
573 .TP
|
|
574 .BI -l \ FILE
|
|
575 List of sites at which pileup is output. This file is space
|
|
576 delimited. The first two columns are required to be chromosome and
|
|
577 1-based coordinate. Additional columns are ignored. It is
|
|
578 recommended to use option
|
|
579 .TP
|
|
580 .BI -m \ INT
|
|
581 Filter reads with flag containing bits in
|
|
582 .I INT
|
|
583 [1796]
|
|
584 .TP
|
|
585 .BI -M \ INT
|
|
586 Cap mapping quality at INT [60]
|
|
587 .TP
|
|
588 .BI -N \ INT
|
|
589 Number of haplotypes in the sample (>=2) [2]
|
|
590 .TP
|
|
591 .BI -r \ FLOAT
|
|
592 Expected fraction of differences between a pair of haplotypes [0.001]
|
|
593 .TP
|
|
594 .B -s
|
|
595 Print the mapping quality as the last column. This option makes the
|
|
596 output easier to parse, although this format is not space efficient.
|
|
597 .TP
|
|
598 .B -S
|
|
599 The input file is in SAM.
|
|
600 .TP
|
|
601 .BI -t \ FILE
|
|
602 List of reference names ane sequence lengths, in the format described
|
|
603 for the
|
|
604 .B import
|
|
605 command. If this option is present, samtools assumes the input
|
|
606 .I <in.alignment>
|
|
607 is in SAM format; otherwise it assumes in BAM format.
|
|
608 .B -s
|
|
609 together with
|
|
610 .B -l
|
|
611 as in the default format we may not know the mapping quality.
|
|
612 .TP
|
|
613 .BI -T \ FLOAT
|
|
614 The theta parameter (error dependency coefficient) in the maq consensus
|
|
615 calling model [0.85]
|
|
616 .RE
|
|
617
|
|
618 .SH SAM FORMAT
|
|
619
|
|
620 SAM is TAB-delimited. Apart from the header lines, which are started
|
|
621 with the `@' symbol, each alignment line consists of:
|
|
622
|
|
623 .TS
|
|
624 center box;
|
|
625 cb | cb | cb
|
|
626 n | l | l .
|
|
627 Col Field Description
|
|
628 _
|
|
629 1 QNAME Query (pair) NAME
|
|
630 2 FLAG bitwise FLAG
|
|
631 3 RNAME Reference sequence NAME
|
|
632 4 POS 1-based leftmost POSition/coordinate of clipped sequence
|
|
633 5 MAPQ MAPping Quality (Phred-scaled)
|
|
634 6 CIAGR extended CIGAR string
|
|
635 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)
|
|
636 8 MPOS 1-based Mate POSistion
|
|
637 9 ISIZE Inferred insert SIZE
|
|
638 10 SEQ query SEQuence on the same strand as the reference
|
|
639 11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
|
|
640 12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE
|
|
641 .TE
|
|
642
|
|
643 .PP
|
|
644 Each bit in the FLAG field is defined as:
|
|
645
|
|
646 .TS
|
|
647 center box;
|
|
648 cb | cb | cb
|
|
649 l | c | l .
|
|
650 Flag Chr Description
|
|
651 _
|
|
652 0x0001 p the read is paired in sequencing
|
|
653 0x0002 P the read is mapped in a proper pair
|
|
654 0x0004 u the query sequence itself is unmapped
|
|
655 0x0008 U the mate is unmapped
|
|
656 0x0010 r strand of the query (1 for reverse)
|
|
657 0x0020 R strand of the mate
|
|
658 0x0040 1 the read is the first read in a pair
|
|
659 0x0080 2 the read is the second read in a pair
|
|
660 0x0100 s the alignment is not primary
|
|
661 0x0200 f the read fails platform/vendor quality checks
|
|
662 0x0400 d the read is either a PCR or an optical duplicate
|
|
663 .TE
|
|
664
|
|
665 .SH EXAMPLES
|
|
666 .IP o 2
|
|
667 Import SAM to BAM when
|
|
668 .B @SQ
|
|
669 lines are present in the header:
|
|
670
|
|
671 samtools view -bS aln.sam > aln.bam
|
|
672
|
|
673 If
|
|
674 .B @SQ
|
|
675 lines are absent:
|
|
676
|
|
677 samtools faidx ref.fa
|
|
678 samtools view -bt ref.fa.fai aln.sam > aln.bam
|
|
679
|
|
680 where
|
|
681 .I ref.fa.fai
|
|
682 is generated automatically by the
|
|
683 .B faidx
|
|
684 command.
|
|
685
|
|
686 .IP o 2
|
|
687 Attach the
|
|
688 .B RG
|
|
689 tag while merging sorted alignments:
|
|
690
|
|
691 perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt
|
|
692 samtools merge -rh rg.txt merged.bam ga.bam 454.bam
|
|
693
|
|
694 The value in a
|
|
695 .B RG
|
|
696 tag is determined by the file name the read is coming from. In this
|
|
697 example, in the
|
|
698 .IR merged.bam ,
|
|
699 reads from
|
|
700 .I ga.bam
|
|
701 will be attached
|
|
702 .IR RG:Z:ga ,
|
|
703 while reads from
|
|
704 .I 454.bam
|
|
705 will be attached
|
|
706 .IR RG:Z:454 .
|
|
707
|
|
708 .IP o 2
|
|
709 Call SNPs and short indels for one diploid individual:
|
|
710
|
|
711 samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
|
|
712 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf
|
|
713
|
|
714 The
|
|
715 .B -D
|
|
716 option of varFilter controls the maximum read depth, which should be
|
|
717 adjusted to about twice the average read depth. One may consider to add
|
|
718 .B -C50
|
|
719 to
|
|
720 .B mpileup
|
|
721 if mapping quality is overestimated for reads containing excessive
|
|
722 mismatches. Applying this option usually helps
|
|
723 .B BWA-short
|
|
724 but may not other mappers.
|
|
725
|
|
726 .IP o 2
|
|
727 Generate the consensus sequence for one diploid individual:
|
|
728
|
|
729 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
|
|
730
|
|
731 .IP o 2
|
|
732 Phase one individual:
|
|
733
|
|
734 samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
|
|
735
|
|
736 The
|
|
737 .B calmd
|
|
738 command is used to reduce false heterozygotes around INDELs.
|
|
739
|
|
740 .IP o 2
|
|
741 Call SNPs and short indels for multiple diploid individuals:
|
|
742
|
|
743 samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf
|
|
744 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf
|
|
745
|
|
746 Individuals are identified from the
|
|
747 .B SM
|
|
748 tags in the
|
|
749 .B @RG
|
|
750 header lines. Individuals can be pooled in one alignment file; one
|
|
751 individual can also be separated into multiple files. The
|
|
752 .B -P
|
|
753 option specifies that indel candidates should be collected only from
|
|
754 read groups with the
|
|
755 .B @RG-PL
|
|
756 tag set to
|
|
757 .IR ILLUMINA .
|
|
758 Collecting indel candidates from reads sequenced by an indel-prone
|
|
759 technology may affect the performance of indel calling.
|
|
760
|
|
761 .IP o 2
|
|
762 Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals:
|
|
763
|
|
764 samtools mpileup -Igf ref.fa *.bam > all.bcf
|
|
765 bcftools view -bl sites.list all.bcf > sites.bcf
|
|
766 bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
|
|
767 bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
|
|
768 bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
|
|
769 ......
|
|
770
|
|
771 where
|
|
772 .I sites.list
|
|
773 contains the list of sites with each line consisting of the reference
|
|
774 sequence name and position. The following
|
|
775 .B bcftools
|
|
776 commands estimate AFS by EM.
|
|
777
|
|
778 .IP o 2
|
|
779 Dump BAQ applied alignment for other SNP callers:
|
|
780
|
|
781 samtools calmd -bAr aln.bam > aln.baq.bam
|
|
782
|
|
783 It adds and corrects the
|
|
784 .B NM
|
|
785 and
|
|
786 .B MD
|
|
787 tags at the same time. The
|
|
788 .B calmd
|
|
789 command also comes with the
|
|
790 .B -C
|
|
791 option, the same as the one in
|
|
792 .B pileup
|
|
793 and
|
|
794 .BR mpileup .
|
|
795 Apply if it helps.
|
|
796
|
|
797 .SH LIMITATIONS
|
|
798 .PP
|
|
799 .IP o 2
|
|
800 Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.
|
|
801 .IP o 2
|
|
802 In merging, the input files are required to have the same number of
|
|
803 reference sequences. The requirement can be relaxed. In addition,
|
|
804 merging does not reconstruct the header dictionaries
|
|
805 automatically. Endusers have to provide the correct header. Picard is
|
|
806 better at merging.
|
|
807 .IP o 2
|
|
808 Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan
|
|
809 reads or ends mapped to different chromosomes). If this is a concern,
|
|
810 please use Picard's MarkDuplicate which correctly handles these cases,
|
|
811 although a little slower.
|
|
812
|
|
813 .SH AUTHOR
|
|
814 .PP
|
|
815 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
|
|
816 Handsaker from the Broad Institute implemented the BGZF library and Jue
|
|
817 Ruan from Beijing Genomics Institute wrote the RAZF library. John
|
|
818 Marshall and Petr Danecek contribute to the source code and various
|
|
819 people from the 1000 Genomes Project have contributed to the SAM format
|
|
820 specification.
|
|
821
|
|
822 .SH SEE ALSO
|
|
823 .PP
|
|
824 Samtools website: <http://samtools.sourceforge.net>
|