Mercurial > repos > ryanmorin > nextgen_variant_identification
comparison SNV/SNVMix2_source/SNVMix2-v0.12.1-rc1/samtools-0.1.6/samtools.txt @ 0:74f5ea818cea
Uploaded
| author | ryanmorin |
|---|---|
| date | Wed, 12 Oct 2011 19:50:38 -0400 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:74f5ea818cea |
|---|---|
| 1 samtools(1) Bioinformatics tools samtools(1) | |
| 2 | |
| 3 | |
| 4 | |
| 5 NAME | |
| 6 samtools - Utilities for the Sequence Alignment/Map (SAM) format | |
| 7 | |
| 8 SYNOPSIS | |
| 9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz | |
| 10 | |
| 11 samtools sort aln.bam aln.sorted | |
| 12 | |
| 13 samtools index aln.sorted.bam | |
| 14 | |
| 15 samtools view aln.sorted.bam chr2:20,100,000-20,200,000 | |
| 16 | |
| 17 samtools merge out.bam in1.bam in2.bam in3.bam | |
| 18 | |
| 19 samtools faidx ref.fasta | |
| 20 | |
| 21 samtools pileup -f ref.fasta aln.sorted.bam | |
| 22 | |
| 23 samtools tview aln.sorted.bam ref.fasta | |
| 24 | |
| 25 | |
| 26 DESCRIPTION | |
| 27 Samtools is a set of utilities that manipulate alignments in the BAM | |
| 28 format. It imports from and exports to the SAM (Sequence Alignment/Map) | |
| 29 format, does sorting, merging and indexing, and allows to retrieve | |
| 30 reads in any regions swiftly. | |
| 31 | |
| 32 Samtools is designed to work on a stream. It regards an input file `-' | |
| 33 as the standard input (stdin) and an output file `-' as the standard | |
| 34 output (stdout). Several commands can thus be combined with Unix pipes. | |
| 35 Samtools always output warning and error messages to the standard error | |
| 36 output (stderr). | |
| 37 | |
| 38 Samtools is also able to open a BAM (not SAM) file on a remote FTP or | |
| 39 HTTP server if the BAM file name starts with `ftp://' or `http://'. | |
| 40 Samtools checks the current working directory for the index file and | |
| 41 will download the index upon absence. Samtools does not retrieve the | |
| 42 entire alignment file unless it is asked to do so. | |
| 43 | |
| 44 | |
| 45 COMMANDS AND OPTIONS | |
| 46 import samtools import <in.ref_list> <in.sam> <out.bam> | |
| 47 | |
| 48 Since 0.1.4, this command is an alias of: | |
| 49 | |
| 50 samtools view -bt <in.ref_list> -o <out.bam> <in.sam> | |
| 51 | |
| 52 | |
| 53 sort samtools sort [-n] [-m maxMem] <in.bam> <out.prefix> | |
| 54 | |
| 55 Sort alignments by leftmost coordinates. File <out.pre- | |
| 56 fix>.bam will be created. This command may also create tempo- | |
| 57 rary files <out.prefix>.%d.bam when the whole alignment can- | |
| 58 not be fitted into memory (controlled by option -m). | |
| 59 | |
| 60 OPTIONS: | |
| 61 | |
| 62 -n Sort by read names rather than by chromosomal coordi- | |
| 63 nates | |
| 64 | |
| 65 -m INT Approximately the maximum required memory. | |
| 66 [500000000] | |
| 67 | |
| 68 | |
| 69 merge samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam> | |
| 70 <in2.bam> [...] | |
| 71 | |
| 72 Merge multiple sorted alignments. The header reference lists | |
| 73 of all the input BAM files, and the @SQ headers of inh.sam, | |
| 74 if any, must all refer to the same set of reference | |
| 75 sequences. The header reference list and (unless overridden | |
| 76 by -h) `@' headers of in1.bam will be copied to out.bam, and | |
| 77 the headers of other files will be ignored. | |
| 78 | |
| 79 OPTIONS: | |
| 80 | |
| 81 -h FILE Use the lines of FILE as `@' headers to be copied to | |
| 82 out.bam, replacing any header lines that would other- | |
| 83 wise be copied from in1.bam. (FILE is actually in | |
| 84 SAM format, though any alignment records it may con- | |
| 85 tain are ignored.) | |
| 86 | |
| 87 -n The input alignments are sorted by read names rather | |
| 88 than by chromosomal coordinates | |
| 89 | |
| 90 | |
| 91 index samtools index <aln.bam> | |
| 92 | |
| 93 Index sorted alignment for fast random access. Index file | |
| 94 <aln.bam>.bai will be created. | |
| 95 | |
| 96 | |
| 97 view samtools view [-bhuHS] [-t in.refList] [-o output] [-f | |
| 98 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read- | |
| 99 Group] <in.bam>|<in.sam> [region1 [...]] | |
| 100 | |
| 101 Extract/print all or sub alignments in SAM or BAM format. If | |
| 102 no region is specified, all the alignments will be printed; | |
| 103 otherwise only alignments overlapping the specified regions | |
| 104 will be output. An alignment may be given multiple times if | |
| 105 it is overlapping several regions. A region can be presented, | |
| 106 for example, in the following format: `chr2', `chr2:1000000' | |
| 107 or `chr2:1,000,000-2,000,000'. The coordinate is 1-based. | |
| 108 | |
| 109 OPTIONS: | |
| 110 | |
| 111 -b Output in the BAM format. | |
| 112 | |
| 113 -u Output uncompressed BAM. This option saves time spent | |
| 114 on compression/decomprssion and is thus preferred | |
| 115 when the output is piped to another samtools command. | |
| 116 | |
| 117 -h Include the header in the output. | |
| 118 | |
| 119 -H Output the header only. | |
| 120 | |
| 121 -S Input is in SAM. If @SQ header lines are absent, the | |
| 122 `-t' option is required. | |
| 123 | |
| 124 -t FILE This file is TAB-delimited. Each line must contain | |
| 125 the reference name and the length of the reference, | |
| 126 one line for each distinct reference; additional | |
| 127 fields are ignored. This file also defines the order | |
| 128 of the reference sequences in sorting. If you run | |
| 129 `samtools faidx <ref.fa>', the resultant index file | |
| 130 <ref.fa>.fai can be used as this <in.ref_list> file. | |
| 131 | |
| 132 -o FILE Output file [stdout] | |
| 133 | |
| 134 -f INT Only output alignments with all bits in INT present | |
| 135 in the FLAG field. INT can be in hex in the format of | |
| 136 /^0x[0-9A-F]+/ [0] | |
| 137 | |
| 138 -F INT Skip alignments with bits present in INT [0] | |
| 139 | |
| 140 -q INT Skip alignments with MAPQ smaller than INT [0] | |
| 141 | |
| 142 -l STR Only output reads in library STR [null] | |
| 143 | |
| 144 -r STR Only output reads in read group STR [null] | |
| 145 | |
| 146 | |
| 147 faidx samtools faidx <ref.fasta> [region1 [...]] | |
| 148 | |
| 149 Index reference sequence in the FASTA format or extract sub- | |
| 150 sequence from indexed reference sequence. If no region is | |
| 151 specified, faidx will index the file and create | |
| 152 <ref.fasta>.fai on the disk. If regions are speficified, the | |
| 153 subsequences will be retrieved and printed to stdout in the | |
| 154 FASTA format. The input file can be compressed in the RAZF | |
| 155 format. | |
| 156 | |
| 157 | |
| 158 pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l | |
| 159 in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r | |
| 160 pairDiffRate] <in.bam>|<in.sam> | |
| 161 | |
| 162 Print the alignment in the pileup format. In the pileup for- | |
| 163 mat, each line represents a genomic position, consisting of | |
| 164 chromosome name, coordinate, reference base, read bases, read | |
| 165 qualities and alignment mapping qualities. Information on | |
| 166 match, mismatch, indel, strand, mapping quality and start and | |
| 167 end of a read are all encoded at the read base column. At | |
| 168 this column, a dot stands for a match to the reference base | |
| 169 on the forward strand, a comma for a match on the reverse | |
| 170 strand, `ACGTN' for a mismatch on the forward strand and | |
| 171 `acgtn' for a mismatch on the reverse strand. A pattern | |
| 172 `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion | |
| 173 between this reference position and the next reference posi- | |
| 174 tion. The length of the insertion is given by the integer in | |
| 175 the pattern, followed by the inserted sequence. Similarly, a | |
| 176 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the | |
| 177 reference. The deleted bases will be presented as `*' in the | |
| 178 following lines. Also at the read base column, a symbol `^' | |
| 179 marks the start of a read segment which is a contiguous sub- | |
| 180 sequence on the read separated by `N/S/H' CIGAR operations. | |
| 181 The ASCII of the character following `^' minus 33 gives the | |
| 182 mapping quality. A symbol `$' marks the end of a read seg- | |
| 183 ment. | |
| 184 | |
| 185 If option -c is applied, the consensus base, consensus qual- | |
| 186 ity, SNP quality and RMS mapping quality of the reads cover- | |
| 187 ing the site will be inserted between the `reference base' | |
| 188 and the `read bases' columns. An indel occupies an additional | |
| 189 line. Each indel line consists of chromosome name, coordi- | |
| 190 nate, a star, the genotype, consensus quality, SNP quality, | |
| 191 RMS mapping quality, # covering reads, the first alllele, the | |
| 192 second allele, # reads supporting the first allele, # reads | |
| 193 supporting the second allele and # reads containing indels | |
| 194 different from the top two alleles. | |
| 195 | |
| 196 OPTIONS: | |
| 197 | |
| 198 | |
| 199 -s Print the mapping quality as the last column. This | |
| 200 option makes the output easier to parse, although | |
| 201 this format is not space efficient. | |
| 202 | |
| 203 | |
| 204 -S The input file is in SAM. | |
| 205 | |
| 206 | |
| 207 -i Only output pileup lines containing indels. | |
| 208 | |
| 209 | |
| 210 -f FILE The reference sequence in the FASTA format. Index | |
| 211 file FILE.fai will be created if absent. | |
| 212 | |
| 213 | |
| 214 -M INT Cap mapping quality at INT [60] | |
| 215 | |
| 216 | |
| 217 -t FILE List of reference names ane sequence lengths, in | |
| 218 the format described for the import command. If | |
| 219 this option is present, samtools assumes the input | |
| 220 <in.alignment> is in SAM format; otherwise it | |
| 221 assumes in BAM format. | |
| 222 | |
| 223 | |
| 224 -l FILE List of sites at which pileup is output. This file | |
| 225 is space delimited. The first two columns are | |
| 226 required to be chromosome and 1-based coordinate. | |
| 227 Additional columns are ignored. It is recommended | |
| 228 to use option -s together with -l as in the default | |
| 229 format we may not know the mapping quality. | |
| 230 | |
| 231 | |
| 232 -c Call the consensus sequence using MAQ consensus | |
| 233 model. Options -T, -N, -I and -r are only effective | |
| 234 when -c or -g is in use. | |
| 235 | |
| 236 | |
| 237 -g Generate genotype likelihood in the binary GLFv3 | |
| 238 format. This option suppresses -c, -i and -s. | |
| 239 | |
| 240 | |
| 241 -T FLOAT The theta parameter (error dependency coefficient) | |
| 242 in the maq consensus calling model [0.85] | |
| 243 | |
| 244 | |
| 245 -N INT Number of haplotypes in the sample (>=2) [2] | |
| 246 | |
| 247 | |
| 248 -r FLOAT Expected fraction of differences between a pair of | |
| 249 haplotypes [0.001] | |
| 250 | |
| 251 | |
| 252 -I INT Phred probability of an indel in sequencing/prep. | |
| 253 [40] | |
| 254 | |
| 255 | |
| 256 | |
| 257 tview samtools tview <in.sorted.bam> [ref.fasta] | |
| 258 | |
| 259 Text alignment viewer (based on the ncurses library). In the | |
| 260 viewer, press `?' for help and press `g' to check the align- | |
| 261 ment start from a region in the format like | |
| 262 `chr10:10,000,000'. | |
| 263 | |
| 264 | |
| 265 | |
| 266 fixmate samtools fixmate <in.nameSrt.bam> <out.bam> | |
| 267 | |
| 268 Fill in mate coordinates, ISIZE and mate related flags from a | |
| 269 name-sorted alignment. | |
| 270 | |
| 271 | |
| 272 rmdup samtools rmdup <input.srt.bam> <out.bam> | |
| 273 | |
| 274 Remove potential PCR duplicates: if multiple read pairs have | |
| 275 identical external coordinates, only retain the pair with | |
| 276 highest mapping quality. This command ONLY works with FR | |
| 277 orientation and requires ISIZE is correctly set. | |
| 278 | |
| 279 | |
| 280 | |
| 281 rmdupse samtools rmdupse <input.srt.bam> <out.bam> | |
| 282 | |
| 283 Remove potential duplicates for single-ended reads. This com- | |
| 284 mand will treat all reads as single-ended even if they are | |
| 285 paired in fact. | |
| 286 | |
| 287 | |
| 288 | |
| 289 fillmd samtools fillmd [-e] <aln.bam> <ref.fasta> | |
| 290 | |
| 291 Generate the MD tag. If the MD tag is already present, this | |
| 292 command will give a warning if the MD tag generated is dif- | |
| 293 ferent from the existing tag. | |
| 294 | |
| 295 OPTIONS: | |
| 296 | |
| 297 -e Convert a the read base to = if it is identical to | |
| 298 the aligned reference base. Indel caller does not | |
| 299 support the = bases at the moment. | |
| 300 | |
| 301 | |
| 302 | |
| 303 SAM FORMAT | |
| 304 SAM is TAB-delimited. Apart from the header lines, which are started | |
| 305 with the `@' symbol, each alignment line consists of: | |
| 306 | |
| 307 | |
| 308 +----+-------+----------------------------------------------------------+ | |
| 309 |Col | Field | Description | | |
| 310 +----+-------+----------------------------------------------------------+ | |
| 311 | 1 | QNAME | Query (pair) NAME | | |
| 312 | 2 | FLAG | bitwise FLAG | | |
| 313 | 3 | RNAME | Reference sequence NAME | | |
| 314 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence | | |
| 315 | 5 | MAPQ | MAPping Quality (Phred-scaled) | | |
| 316 | 6 | CIAGR | extended CIGAR string | | |
| 317 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) | | |
| 318 | 8 | MPOS | 1-based Mate POSistion | | |
| 319 | 9 | ISIZE | Inferred insert SIZE | | |
| 320 |10 | SEQ | query SEQuence on the same strand as the reference | | |
| 321 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) | | |
| 322 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE | | |
| 323 +----+-------+----------------------------------------------------------+ | |
| 324 | |
| 325 Each bit in the FLAG field is defined as: | |
| 326 | |
| 327 | |
| 328 +-------+--------------------------------------------------+ | |
| 329 | Flag | Description | | |
| 330 +-------+--------------------------------------------------+ | |
| 331 |0x0001 | the read is paired in sequencing | | |
| 332 |0x0002 | the read is mapped in a proper pair | | |
| 333 |0x0004 | the query sequence itself is unmapped | | |
| 334 |0x0008 | the mate is unmapped | | |
| 335 |0x0010 | strand of the query (1 for reverse) | | |
| 336 |0x0020 | strand of the mate | | |
| 337 |0x0040 | the read is the first read in a pair | | |
| 338 |0x0080 | the read is the second read in a pair | | |
| 339 |0x0100 | the alignment is not primary | | |
| 340 |0x0200 | the read fails platform/vendor quality checks | | |
| 341 |0x0400 | the read is either a PCR or an optical duplicate | | |
| 342 +-------+--------------------------------------------------+ | |
| 343 | |
| 344 LIMITATIONS | |
| 345 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and | |
| 346 bam_aux.c. | |
| 347 | |
| 348 o CIGAR operation P is not properly handled at the moment. | |
| 349 | |
| 350 o In merging, the input files are required to have the same number of | |
| 351 reference sequences. The requirement can be relaxed. In addition, | |
| 352 merging does not reconstruct the header dictionaries automatically. | |
| 353 Endusers have to provide the correct header. Picard is better at | |
| 354 merging. | |
| 355 | |
| 356 o Samtools' rmdup does not work for single-end data and does not remove | |
| 357 duplicates across chromosomes. Picard is better. | |
| 358 | |
| 359 | |
| 360 AUTHOR | |
| 361 Heng Li from the Sanger Institute wrote the C version of samtools. Bob | |
| 362 Handsaker from the Broad Institute implemented the BGZF library and Jue | |
| 363 Ruan from Beijing Genomics Institute wrote the RAZF library. Various | |
| 364 people in the 1000Genomes Project contributed to the SAM format speci- | |
| 365 fication. | |
| 366 | |
| 367 | |
| 368 SEE ALSO | |
| 369 Samtools website: <http://samtools.sourceforge.net> | |
| 370 | |
| 371 | |
| 372 | |
| 373 samtools-0.1.6 2 September 2009 samtools(1) |
