0
|
1 .TH samtools 1 "2 September 2009" "samtools-0.1.6" "Bioinformatics tools"
|
|
2 .SH NAME
|
|
3 .PP
|
|
4 samtools - Utilities for the Sequence Alignment/Map (SAM) format
|
|
5 .SH SYNOPSIS
|
|
6 .PP
|
|
7 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
|
|
8 .PP
|
|
9 samtools sort aln.bam aln.sorted
|
|
10 .PP
|
|
11 samtools index aln.sorted.bam
|
|
12 .PP
|
|
13 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
|
|
14 .PP
|
|
15 samtools merge out.bam in1.bam in2.bam in3.bam
|
|
16 .PP
|
|
17 samtools faidx ref.fasta
|
|
18 .PP
|
|
19 samtools pileup -f ref.fasta aln.sorted.bam
|
|
20 .PP
|
|
21 samtools tview aln.sorted.bam ref.fasta
|
|
22
|
|
23 .SH DESCRIPTION
|
|
24 .PP
|
|
25 Samtools is a set of utilities that manipulate alignments in the BAM
|
|
26 format. It imports from and exports to the SAM (Sequence Alignment/Map)
|
|
27 format, does sorting, merging and indexing, and allows to retrieve reads
|
|
28 in any regions swiftly.
|
|
29
|
|
30 Samtools is designed to work on a stream. It regards an input file `-'
|
|
31 as the standard input (stdin) and an output file `-' as the standard
|
|
32 output (stdout). Several commands can thus be combined with Unix
|
|
33 pipes. Samtools always output warning and error messages to the standard
|
|
34 error output (stderr).
|
|
35
|
|
36 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
|
|
37 HTTP server if the BAM file name starts with `ftp://' or `http://'.
|
|
38 Samtools checks the current working directory for the index file and
|
|
39 will download the index upon absence. Samtools does not retrieve the
|
|
40 entire alignment file unless it is asked to do so.
|
|
41
|
|
42 .SH COMMANDS AND OPTIONS
|
|
43
|
|
44 .TP 10
|
|
45 .B import
|
|
46 samtools import <in.ref_list> <in.sam> <out.bam>
|
|
47
|
|
48 Since 0.1.4, this command is an alias of:
|
|
49
|
|
50 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
|
|
51
|
|
52 .TP
|
|
53 .B sort
|
|
54 samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
|
|
55
|
|
56 Sort alignments by leftmost coordinates. File
|
|
57 .I <out.prefix>.bam
|
|
58 will be created. This command may also create temporary files
|
|
59 .I <out.prefix>.%d.bam
|
|
60 when the whole alignment cannot be fitted into memory (controlled by
|
|
61 option -m).
|
|
62
|
|
63 .B OPTIONS:
|
|
64 .RS
|
|
65 .TP 8
|
|
66 .B -n
|
|
67 Sort by read names rather than by chromosomal coordinates
|
|
68 .TP
|
|
69 .B -m INT
|
|
70 Approximately the maximum required memory. [500000000]
|
|
71 .RE
|
|
72
|
|
73 .TP
|
|
74 .B merge
|
|
75 samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam> <in2.bam> [...]
|
|
76
|
|
77 Merge multiple sorted alignments.
|
|
78 The header reference lists of all the input BAM files, and the @SQ headers of
|
|
79 .IR inh.sam ,
|
|
80 if any, must all refer to the same set of reference sequences.
|
|
81 The header reference list and (unless overridden by
|
|
82 .BR -h )
|
|
83 `@' headers of
|
|
84 .I in1.bam
|
|
85 will be copied to
|
|
86 .IR out.bam ,
|
|
87 and the headers of other files will be ignored.
|
|
88
|
|
89 .B OPTIONS:
|
|
90 .RS
|
|
91 .TP 8
|
|
92 .B -h FILE
|
|
93 Use the lines of
|
|
94 .I FILE
|
|
95 as `@' headers to be copied to
|
|
96 .IR out.bam ,
|
|
97 replacing any header lines that would otherwise be copied from
|
|
98 .IR in1.bam .
|
|
99 .RI ( FILE
|
|
100 is actually in SAM format, though any alignment records it may contain
|
|
101 are ignored.)
|
|
102 .TP
|
|
103 .B -n
|
|
104 The input alignments are sorted by read names rather than by chromosomal
|
|
105 coordinates
|
|
106 .RE
|
|
107
|
|
108 .TP
|
|
109 .B index
|
|
110 samtools index <aln.bam>
|
|
111
|
|
112 Index sorted alignment for fast random access. Index file
|
|
113 .I <aln.bam>.bai
|
|
114 will be created.
|
|
115
|
|
116 .TP
|
|
117 .B view
|
|
118 samtools view [-bhuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
|
|
119 skipFlag] [-q minMapQ] [-l library] [-r readGroup] <in.bam>|<in.sam> [region1 [...]]
|
|
120
|
|
121 Extract/print all or sub alignments in SAM or BAM format. If no region
|
|
122 is specified, all the alignments will be printed; otherwise only
|
|
123 alignments overlapping the specified regions will be output. An
|
|
124 alignment may be given multiple times if it is overlapping several
|
|
125 regions. A region can be presented, for example, in the following
|
|
126 format: `chr2', `chr2:1000000' or `chr2:1,000,000-2,000,000'. The
|
|
127 coordinate is 1-based.
|
|
128
|
|
129 .B OPTIONS:
|
|
130 .RS
|
|
131 .TP 8
|
|
132 .B -b
|
|
133 Output in the BAM format.
|
|
134 .TP
|
|
135 .B -u
|
|
136 Output uncompressed BAM. This option saves time spent on
|
|
137 compression/decomprssion and is thus preferred when the output is piped
|
|
138 to another samtools command.
|
|
139 .TP
|
|
140 .B -h
|
|
141 Include the header in the output.
|
|
142 .TP
|
|
143 .B -H
|
|
144 Output the header only.
|
|
145 .TP
|
|
146 .B -S
|
|
147 Input is in SAM. If @SQ header lines are absent, the
|
|
148 .B `-t'
|
|
149 option is required.
|
|
150 .TP
|
|
151 .B -t FILE
|
|
152 This file is TAB-delimited. Each line must contain the reference name
|
|
153 and the length of the reference, one line for each distinct reference;
|
|
154 additional fields are ignored. This file also defines the order of the
|
|
155 reference sequences in sorting. If you run `samtools faidx <ref.fa>',
|
|
156 the resultant index file
|
|
157 .I <ref.fa>.fai
|
|
158 can be used as this
|
|
159 .I <in.ref_list>
|
|
160 file.
|
|
161 .TP
|
|
162 .B -o FILE
|
|
163 Output file [stdout]
|
|
164 .TP
|
|
165 .B -f INT
|
|
166 Only output alignments with all bits in INT present in the FLAG
|
|
167 field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
|
|
168 .TP
|
|
169 .B -F INT
|
|
170 Skip alignments with bits present in INT [0]
|
|
171 .TP
|
|
172 .B -q INT
|
|
173 Skip alignments with MAPQ smaller than INT [0]
|
|
174 .TP
|
|
175 .B -l STR
|
|
176 Only output reads in library STR [null]
|
|
177 .TP
|
|
178 .B -r STR
|
|
179 Only output reads in read group STR [null]
|
|
180 .RE
|
|
181
|
|
182 .TP
|
|
183 .B faidx
|
|
184 samtools faidx <ref.fasta> [region1 [...]]
|
|
185
|
|
186 Index reference sequence in the FASTA format or extract subsequence from
|
|
187 indexed reference sequence. If no region is specified,
|
|
188 .B faidx
|
|
189 will index the file and create
|
|
190 .I <ref.fasta>.fai
|
|
191 on the disk. If regions are speficified, the subsequences will be
|
|
192 retrieved and printed to stdout in the FASTA format. The input file can
|
|
193 be compressed in the
|
|
194 .B RAZF
|
|
195 format.
|
|
196
|
|
197 .TP
|
|
198 .B pileup
|
|
199 samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l in.site_list]
|
|
200 [-iscgS2] [-T theta] [-N nHap] [-r pairDiffRate] <in.bam>|<in.sam>
|
|
201
|
|
202 Print the alignment in the pileup format. In the pileup format, each
|
|
203 line represents a genomic position, consisting of chromosome name,
|
|
204 coordinate, reference base, read bases, read qualities and alignment
|
|
205 mapping qualities. Information on match, mismatch, indel, strand,
|
|
206 mapping quality and start and end of a read are all encoded at the read
|
|
207 base column. At this column, a dot stands for a match to the reference
|
|
208 base on the forward strand, a comma for a match on the reverse strand,
|
|
209 `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch
|
|
210 on the reverse strand. A pattern `\\+[0-9]+[ACGTNacgtn]+' indicates
|
|
211 there is an insertion between this reference position and the next
|
|
212 reference position. The length of the insertion is given by the integer
|
|
213 in the pattern, followed by the inserted sequence. Similarly, a pattern
|
|
214 `-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference. The
|
|
215 deleted bases will be presented as `*' in the following lines. Also at
|
|
216 the read base column, a symbol `^' marks the start of a read segment
|
|
217 which is a contiguous subsequence on the read separated by `N/S/H' CIGAR
|
|
218 operations. The ASCII of the character following `^' minus 33 gives the
|
|
219 mapping quality. A symbol `$' marks the end of a read segment.
|
|
220
|
|
221 If option
|
|
222 .B -c
|
|
223 is applied, the consensus base, consensus quality, SNP quality and RMS
|
|
224 mapping quality of the reads covering the site will be inserted between
|
|
225 the `reference base' and the `read bases' columns. An indel occupies an
|
|
226 additional line. Each indel line consists of chromosome name,
|
|
227 coordinate, a star, the genotype, consensus quality, SNP quality, RMS
|
|
228 mapping quality, # covering reads, the first alllele, the second allele,
|
|
229 # reads supporting the first allele, # reads supporting the second
|
|
230 allele and # reads containing indels different from the top two alleles.
|
|
231
|
|
232 .B OPTIONS:
|
|
233 .RS
|
|
234
|
|
235 .TP 10
|
|
236 .B -s
|
|
237 Print the mapping quality as the last column. This option makes the
|
|
238 output easier to parse, although this format is not space efficient.
|
|
239
|
|
240 .TP
|
|
241 .B -S
|
|
242 The input file is in SAM.
|
|
243
|
|
244 .TP
|
|
245 .B -i
|
|
246 Only output pileup lines containing indels.
|
|
247
|
|
248 .TP
|
|
249 .B -f FILE
|
|
250 The reference sequence in the FASTA format. Index file
|
|
251 .I FILE.fai
|
|
252 will be created if
|
|
253 absent.
|
|
254
|
|
255 .TP
|
|
256 .B -M INT
|
|
257 Cap mapping quality at INT [60]
|
|
258
|
|
259 .TP
|
|
260 .B -t FILE
|
|
261 List of reference names ane sequence lengths, in the format described
|
|
262 for the
|
|
263 .B import
|
|
264 command. If this option is present, samtools assumes the input
|
|
265 .I <in.alignment>
|
|
266 is in SAM format; otherwise it assumes in BAM format.
|
|
267
|
|
268 .TP
|
|
269 .B -l FILE
|
|
270 List of sites at which pileup is output. This file is space
|
|
271 delimited. The first two columns are required to be chromosome and
|
|
272 1-based coordinate. Additional columns are ignored. It is
|
|
273 recommended to use option
|
|
274 .B -s
|
|
275 together with
|
|
276 .B -l
|
|
277 as in the default format we may not know the mapping quality.
|
|
278
|
|
279 .TP
|
|
280 .B -c
|
|
281 Call the consensus sequence using MAQ consensus model. Options
|
|
282 .B -T,
|
|
283 .B -N,
|
|
284 .B -I
|
|
285 and
|
|
286 .B -r
|
|
287 are only effective when
|
|
288 .B -c
|
|
289 or
|
|
290 .B -g
|
|
291 is in use.
|
|
292
|
|
293 .TP
|
|
294 .B -g
|
|
295 Generate genotype likelihood in the binary GLFv3 format. This option
|
|
296 suppresses -c, -i and -s.
|
|
297
|
|
298 .TP
|
|
299 .B -T FLOAT
|
|
300 The theta parameter (error dependency coefficient) in the maq consensus
|
|
301 calling model [0.85]
|
|
302
|
|
303 .TP
|
|
304 .B -N INT
|
|
305 Number of haplotypes in the sample (>=2) [2]
|
|
306
|
|
307 .TP
|
|
308 .B -r FLOAT
|
|
309 Expected fraction of differences between a pair of haplotypes [0.001]
|
|
310
|
|
311 .TP
|
|
312 .B -I INT
|
|
313 Phred probability of an indel in sequencing/prep. [40]
|
|
314
|
|
315 .RE
|
|
316
|
|
317 .TP
|
|
318 .B tview
|
|
319 samtools tview <in.sorted.bam> [ref.fasta]
|
|
320
|
|
321 Text alignment viewer (based on the ncurses library). In the viewer,
|
|
322 press `?' for help and press `g' to check the alignment start from a
|
|
323 region in the format like `chr10:10,000,000'.
|
|
324
|
|
325 .RE
|
|
326
|
|
327 .TP
|
|
328 .B fixmate
|
|
329 samtools fixmate <in.nameSrt.bam> <out.bam>
|
|
330
|
|
331 Fill in mate coordinates, ISIZE and mate related flags from a
|
|
332 name-sorted alignment.
|
|
333
|
|
334 .TP
|
|
335 .B rmdup
|
|
336 samtools rmdup <input.srt.bam> <out.bam>
|
|
337
|
|
338 Remove potential PCR duplicates: if multiple read pairs have identical
|
|
339 external coordinates, only retain the pair with highest mapping quality.
|
|
340 This command
|
|
341 .B ONLY
|
|
342 works with FR orientation and requires ISIZE is correctly set.
|
|
343
|
|
344 .RE
|
|
345
|
|
346 .TP
|
|
347 .B rmdupse
|
|
348 samtools rmdupse <input.srt.bam> <out.bam>
|
|
349
|
|
350 Remove potential duplicates for single-ended reads. This command will
|
|
351 treat all reads as single-ended even if they are paired in fact.
|
|
352
|
|
353 .RE
|
|
354
|
|
355 .TP
|
|
356 .B fillmd
|
|
357 samtools fillmd [-e] <aln.bam> <ref.fasta>
|
|
358
|
|
359 Generate the MD tag. If the MD tag is already present, this command will
|
|
360 give a warning if the MD tag generated is different from the existing
|
|
361 tag.
|
|
362
|
|
363 .B OPTIONS:
|
|
364 .RS
|
|
365 .TP 8
|
|
366 .B -e
|
|
367 Convert a the read base to = if it is identical to the aligned reference
|
|
368 base. Indel caller does not support the = bases at the moment.
|
|
369
|
|
370 .RE
|
|
371
|
|
372 .SH SAM FORMAT
|
|
373
|
|
374 SAM is TAB-delimited. Apart from the header lines, which are started
|
|
375 with the `@' symbol, each alignment line consists of:
|
|
376
|
|
377 .TS
|
|
378 center box;
|
|
379 cb | cb | cb
|
|
380 n | l | l .
|
|
381 Col Field Description
|
|
382 _
|
|
383 1 QNAME Query (pair) NAME
|
|
384 2 FLAG bitwise FLAG
|
|
385 3 RNAME Reference sequence NAME
|
|
386 4 POS 1-based leftmost POSition/coordinate of clipped sequence
|
|
387 5 MAPQ MAPping Quality (Phred-scaled)
|
|
388 6 CIAGR extended CIGAR string
|
|
389 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)
|
|
390 8 MPOS 1-based Mate POSistion
|
|
391 9 ISIZE Inferred insert SIZE
|
|
392 10 SEQ query SEQuence on the same strand as the reference
|
|
393 11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
|
|
394 12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE
|
|
395 .TE
|
|
396
|
|
397 .PP
|
|
398 Each bit in the FLAG field is defined as:
|
|
399
|
|
400 .TS
|
|
401 center box;
|
|
402 cb | cb
|
|
403 l | l .
|
|
404 Flag Description
|
|
405 _
|
|
406 0x0001 the read is paired in sequencing
|
|
407 0x0002 the read is mapped in a proper pair
|
|
408 0x0004 the query sequence itself is unmapped
|
|
409 0x0008 the mate is unmapped
|
|
410 0x0010 strand of the query (1 for reverse)
|
|
411 0x0020 strand of the mate
|
|
412 0x0040 the read is the first read in a pair
|
|
413 0x0080 the read is the second read in a pair
|
|
414 0x0100 the alignment is not primary
|
|
415 0x0200 the read fails platform/vendor quality checks
|
|
416 0x0400 the read is either a PCR or an optical duplicate
|
|
417 .TE
|
|
418
|
|
419 .SH LIMITATIONS
|
|
420 .PP
|
|
421 .IP o 2
|
|
422 Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.
|
|
423 .IP o 2
|
|
424 CIGAR operation P is not properly handled at the moment.
|
|
425 .IP o 2
|
|
426 In merging, the input files are required to have the same number of
|
|
427 reference sequences. The requirement can be relaxed. In addition,
|
|
428 merging does not reconstruct the header dictionaries
|
|
429 automatically. Endusers have to provide the correct header. Picard is
|
|
430 better at merging.
|
|
431 .IP o 2
|
|
432 Samtools' rmdup does not work for single-end data and does not remove
|
|
433 duplicates across chromosomes. Picard is better.
|
|
434
|
|
435 .SH AUTHOR
|
|
436 .PP
|
|
437 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
|
|
438 Handsaker from the Broad Institute implemented the BGZF library and Jue
|
|
439 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
|
|
440 people in the 1000Genomes Project contributed to the SAM format
|
|
441 specification.
|
|
442
|
|
443 .SH SEE ALSO
|
|
444 .PP
|
|
445 Samtools website: <http://samtools.sourceforge.net>
|