Galaxy |

What it does

Samtools view can:

convert between alignment formats (SAM, BAM, CRAM)
filter and subsample alignments according to user-specified criteria
count the reads in the input dataset or those retained after filtering and subsampling
obtain just the header of the input in any supported format

In addition, the tool has (limited) options to modify read records during conversion and/or filtering by:

stripping them of user-specified tags
collapsing backward CIGAR operations if they are specified in their CIGAR fields

With default settings, the tool generates a BAM dataset with the header and reads found in the input dataset (which can be in SAM, BAM, or CRAM format).

Alignment format conversion

By changing the Output format it is possible to convert an input dataset to another format. Inputs of type SAM, BAM, and CRAM are accepted and can be converted to each of these formats (alternatively alignment counts can be computed) by selecting the appropriate "Output type".

The tool allows you to specify a reference sequence. This is required for SAM input with missing @SQ headers (which include sequence names, length, md5, etc) and useful (and sometimes necessary) for CRAM input and output. In the following the use of the reference sequence in the CRAM format is detailed. CRAM is (primarily) a reference-based compressed format, i.e. only sequence differences between aligned reads and the reference are stored. As a consequence, the reference that was used during read mapping is needed in order to interpret the alignment records (a checksum stored in the CRAM file is used to verify that only the correct reference sequence can be used). This allows for more space-efficient storage than with BAM format, but such a CRAM file is not usable without its reference. It is also possible, however, to use CRAM without a reference with the disadvantage that the reference sequence gets stored then explicitely (as in SAM and BAM).

The Galaxy tool currently generates only CRAM without reference sequence.

For reference based CRAM input the correct refernce sequence needs to be specified.

Filtering alignments

If you ask for A filtered/subsampled selection of reads, the tool will allow you to specify filter conditions and/or to choose a subsampling strategy, and the output will contain one of the following depending on your choice under What would you like to have reported?:

All reads retained after filtering and subsampling
Reads dropped during filtering and subsampling

If instead you want to split the input reads based on your criteria and obtain two datasets, one with the retained and one with the dropped reads, check the Produce extra dataset with dropped/retained reads? option.

Filtering by regions

You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).

Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position coordinates are 1-based.

When multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.

Examples of region specifications:

chr1: Output all alignments mapped to the reference sequence named 'chr1' (i.e. @SQ SN:chr1).
chr2:1000000: The region on chr2 beginning at base position 1,000,000 and ending at the end of the chromosome.
chr3:1000-2000: The 1001bp region on chr3 beginning at base position 1,000 and ending at base position 2,000 (including both end positions).
*: Output the unmapped reads at the end of the file. (This does not include any unmapped reads placed on a reference sequence alongside their mapped mates.)
.: Output all alignments. (Mostly unnecessary as not specifying a region at all has the same effect.)

Filtering by quality

This filters based on the MAPQ column of the SAM format which gives an estimate about the correct placement of the alignment. Note that aligners do not follow a consistent definition.

## Filtering by Tag **

This filter allows to select reads based on tool or user specific tags, e.g., XS:i:-18 the alignment score tag of bowtie. Thus to filter for a specific value of the tag you need the format STR1:STR2, e.g., XS:-18 to filter reads with an aligment score of -18. You can also just write STR1 without the value STR2 hence the filter selects all reads with the tag STR1, e.g., XS.