comparison SNV/SNVMix2_source/SNVMix2-v0.12.1-rc1/samtools-0.1.6/samtools.txt @ 0:74f5ea818cea

Uploaded
author ryanmorin
date Wed, 12 Oct 2011 19:50:38 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:74f5ea818cea
1 samtools(1) Bioinformatics tools samtools(1)
2
3
4
5 NAME
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
7
8 SYNOPSIS
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
10
11 samtools sort aln.bam aln.sorted
12
13 samtools index aln.sorted.bam
14
15 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
16
17 samtools merge out.bam in1.bam in2.bam in3.bam
18
19 samtools faidx ref.fasta
20
21 samtools pileup -f ref.fasta aln.sorted.bam
22
23 samtools tview aln.sorted.bam ref.fasta
24
25
26 DESCRIPTION
27 Samtools is a set of utilities that manipulate alignments in the BAM
28 format. It imports from and exports to the SAM (Sequence Alignment/Map)
29 format, does sorting, merging and indexing, and allows to retrieve
30 reads in any regions swiftly.
31
32 Samtools is designed to work on a stream. It regards an input file `-'
33 as the standard input (stdin) and an output file `-' as the standard
34 output (stdout). Several commands can thus be combined with Unix pipes.
35 Samtools always output warning and error messages to the standard error
36 output (stderr).
37
38 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
39 HTTP server if the BAM file name starts with `ftp://' or `http://'.
40 Samtools checks the current working directory for the index file and
41 will download the index upon absence. Samtools does not retrieve the
42 entire alignment file unless it is asked to do so.
43
44
45 COMMANDS AND OPTIONS
46 import samtools import <in.ref_list> <in.sam> <out.bam>
47
48 Since 0.1.4, this command is an alias of:
49
50 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
51
52
53 sort samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
54
55 Sort alignments by leftmost coordinates. File <out.pre-
56 fix>.bam will be created. This command may also create tempo-
57 rary files <out.prefix>.%d.bam when the whole alignment can-
58 not be fitted into memory (controlled by option -m).
59
60 OPTIONS:
61
62 -n Sort by read names rather than by chromosomal coordi-
63 nates
64
65 -m INT Approximately the maximum required memory.
66 [500000000]
67
68
69 merge samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam>
70 <in2.bam> [...]
71
72 Merge multiple sorted alignments. The header reference lists
73 of all the input BAM files, and the @SQ headers of inh.sam,
74 if any, must all refer to the same set of reference
75 sequences. The header reference list and (unless overridden
76 by -h) `@' headers of in1.bam will be copied to out.bam, and
77 the headers of other files will be ignored.
78
79 OPTIONS:
80
81 -h FILE Use the lines of FILE as `@' headers to be copied to
82 out.bam, replacing any header lines that would other-
83 wise be copied from in1.bam. (FILE is actually in
84 SAM format, though any alignment records it may con-
85 tain are ignored.)
86
87 -n The input alignments are sorted by read names rather
88 than by chromosomal coordinates
89
90
91 index samtools index <aln.bam>
92
93 Index sorted alignment for fast random access. Index file
94 <aln.bam>.bai will be created.
95
96
97 view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
98 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
99 Group] <in.bam>|<in.sam> [region1 [...]]
100
101 Extract/print all or sub alignments in SAM or BAM format. If
102 no region is specified, all the alignments will be printed;
103 otherwise only alignments overlapping the specified regions
104 will be output. An alignment may be given multiple times if
105 it is overlapping several regions. A region can be presented,
106 for example, in the following format: `chr2', `chr2:1000000'
107 or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
108
109 OPTIONS:
110
111 -b Output in the BAM format.
112
113 -u Output uncompressed BAM. This option saves time spent
114 on compression/decomprssion and is thus preferred
115 when the output is piped to another samtools command.
116
117 -h Include the header in the output.
118
119 -H Output the header only.
120
121 -S Input is in SAM. If @SQ header lines are absent, the
122 `-t' option is required.
123
124 -t FILE This file is TAB-delimited. Each line must contain
125 the reference name and the length of the reference,
126 one line for each distinct reference; additional
127 fields are ignored. This file also defines the order
128 of the reference sequences in sorting. If you run
129 `samtools faidx <ref.fa>', the resultant index file
130 <ref.fa>.fai can be used as this <in.ref_list> file.
131
132 -o FILE Output file [stdout]
133
134 -f INT Only output alignments with all bits in INT present
135 in the FLAG field. INT can be in hex in the format of
136 /^0x[0-9A-F]+/ [0]
137
138 -F INT Skip alignments with bits present in INT [0]
139
140 -q INT Skip alignments with MAPQ smaller than INT [0]
141
142 -l STR Only output reads in library STR [null]
143
144 -r STR Only output reads in read group STR [null]
145
146
147 faidx samtools faidx <ref.fasta> [region1 [...]]
148
149 Index reference sequence in the FASTA format or extract sub-
150 sequence from indexed reference sequence. If no region is
151 specified, faidx will index the file and create
152 <ref.fasta>.fai on the disk. If regions are speficified, the
153 subsequences will be retrieved and printed to stdout in the
154 FASTA format. The input file can be compressed in the RAZF
155 format.
156
157
158 pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
159 in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
160 pairDiffRate] <in.bam>|<in.sam>
161
162 Print the alignment in the pileup format. In the pileup for-
163 mat, each line represents a genomic position, consisting of
164 chromosome name, coordinate, reference base, read bases, read
165 qualities and alignment mapping qualities. Information on
166 match, mismatch, indel, strand, mapping quality and start and
167 end of a read are all encoded at the read base column. At
168 this column, a dot stands for a match to the reference base
169 on the forward strand, a comma for a match on the reverse
170 strand, `ACGTN' for a mismatch on the forward strand and
171 `acgtn' for a mismatch on the reverse strand. A pattern
172 `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
173 between this reference position and the next reference posi-
174 tion. The length of the insertion is given by the integer in
175 the pattern, followed by the inserted sequence. Similarly, a
176 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
177 reference. The deleted bases will be presented as `*' in the
178 following lines. Also at the read base column, a symbol `^'
179 marks the start of a read segment which is a contiguous sub-
180 sequence on the read separated by `N/S/H' CIGAR operations.
181 The ASCII of the character following `^' minus 33 gives the
182 mapping quality. A symbol `$' marks the end of a read seg-
183 ment.
184
185 If option -c is applied, the consensus base, consensus qual-
186 ity, SNP quality and RMS mapping quality of the reads cover-
187 ing the site will be inserted between the `reference base'
188 and the `read bases' columns. An indel occupies an additional
189 line. Each indel line consists of chromosome name, coordi-
190 nate, a star, the genotype, consensus quality, SNP quality,
191 RMS mapping quality, # covering reads, the first alllele, the
192 second allele, # reads supporting the first allele, # reads
193 supporting the second allele and # reads containing indels
194 different from the top two alleles.
195
196 OPTIONS:
197
198
199 -s Print the mapping quality as the last column. This
200 option makes the output easier to parse, although
201 this format is not space efficient.
202
203
204 -S The input file is in SAM.
205
206
207 -i Only output pileup lines containing indels.
208
209
210 -f FILE The reference sequence in the FASTA format. Index
211 file FILE.fai will be created if absent.
212
213
214 -M INT Cap mapping quality at INT [60]
215
216
217 -t FILE List of reference names ane sequence lengths, in
218 the format described for the import command. If
219 this option is present, samtools assumes the input
220 <in.alignment> is in SAM format; otherwise it
221 assumes in BAM format.
222
223
224 -l FILE List of sites at which pileup is output. This file
225 is space delimited. The first two columns are
226 required to be chromosome and 1-based coordinate.
227 Additional columns are ignored. It is recommended
228 to use option -s together with -l as in the default
229 format we may not know the mapping quality.
230
231
232 -c Call the consensus sequence using MAQ consensus
233 model. Options -T, -N, -I and -r are only effective
234 when -c or -g is in use.
235
236
237 -g Generate genotype likelihood in the binary GLFv3
238 format. This option suppresses -c, -i and -s.
239
240
241 -T FLOAT The theta parameter (error dependency coefficient)
242 in the maq consensus calling model [0.85]
243
244
245 -N INT Number of haplotypes in the sample (>=2) [2]
246
247
248 -r FLOAT Expected fraction of differences between a pair of
249 haplotypes [0.001]
250
251
252 -I INT Phred probability of an indel in sequencing/prep.
253 [40]
254
255
256
257 tview samtools tview <in.sorted.bam> [ref.fasta]
258
259 Text alignment viewer (based on the ncurses library). In the
260 viewer, press `?' for help and press `g' to check the align-
261 ment start from a region in the format like
262 `chr10:10,000,000'.
263
264
265
266 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
267
268 Fill in mate coordinates, ISIZE and mate related flags from a
269 name-sorted alignment.
270
271
272 rmdup samtools rmdup <input.srt.bam> <out.bam>
273
274 Remove potential PCR duplicates: if multiple read pairs have
275 identical external coordinates, only retain the pair with
276 highest mapping quality. This command ONLY works with FR
277 orientation and requires ISIZE is correctly set.
278
279
280
281 rmdupse samtools rmdupse <input.srt.bam> <out.bam>
282
283 Remove potential duplicates for single-ended reads. This com-
284 mand will treat all reads as single-ended even if they are
285 paired in fact.
286
287
288
289 fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>
290
291 Generate the MD tag. If the MD tag is already present, this
292 command will give a warning if the MD tag generated is dif-
293 ferent from the existing tag.
294
295 OPTIONS:
296
297 -e Convert a the read base to = if it is identical to
298 the aligned reference base. Indel caller does not
299 support the = bases at the moment.
300
301
302
303 SAM FORMAT
304 SAM is TAB-delimited. Apart from the header lines, which are started
305 with the `@' symbol, each alignment line consists of:
306
307
308 +----+-------+----------------------------------------------------------+
309 |Col | Field | Description |
310 +----+-------+----------------------------------------------------------+
311 | 1 | QNAME | Query (pair) NAME |
312 | 2 | FLAG | bitwise FLAG |
313 | 3 | RNAME | Reference sequence NAME |
314 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
315 | 5 | MAPQ | MAPping Quality (Phred-scaled) |
316 | 6 | CIAGR | extended CIGAR string |
317 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
318 | 8 | MPOS | 1-based Mate POSistion |
319 | 9 | ISIZE | Inferred insert SIZE |
320 |10 | SEQ | query SEQuence on the same strand as the reference |
321 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
322 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
323 +----+-------+----------------------------------------------------------+
324
325 Each bit in the FLAG field is defined as:
326
327
328 +-------+--------------------------------------------------+
329 | Flag | Description |
330 +-------+--------------------------------------------------+
331 |0x0001 | the read is paired in sequencing |
332 |0x0002 | the read is mapped in a proper pair |
333 |0x0004 | the query sequence itself is unmapped |
334 |0x0008 | the mate is unmapped |
335 |0x0010 | strand of the query (1 for reverse) |
336 |0x0020 | strand of the mate |
337 |0x0040 | the read is the first read in a pair |
338 |0x0080 | the read is the second read in a pair |
339 |0x0100 | the alignment is not primary |
340 |0x0200 | the read fails platform/vendor quality checks |
341 |0x0400 | the read is either a PCR or an optical duplicate |
342 +-------+--------------------------------------------------+
343
344 LIMITATIONS
345 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
346 bam_aux.c.
347
348 o CIGAR operation P is not properly handled at the moment.
349
350 o In merging, the input files are required to have the same number of
351 reference sequences. The requirement can be relaxed. In addition,
352 merging does not reconstruct the header dictionaries automatically.
353 Endusers have to provide the correct header. Picard is better at
354 merging.
355
356 o Samtools' rmdup does not work for single-end data and does not remove
357 duplicates across chromosomes. Picard is better.
358
359
360 AUTHOR
361 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
362 Handsaker from the Broad Institute implemented the BGZF library and Jue
363 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
364 people in the 1000Genomes Project contributed to the SAM format speci-
365 fication.
366
367
368 SEE ALSO
369 Samtools website: <http://samtools.sourceforge.net>
370
371
372
373 samtools-0.1.6 2 September 2009 samtools(1)