umi_tools group - Group reads based on their UMI

Purpose

The purpose of this command is to identify groups of reads based on their genomic coordinate and UMI.

The group command can be used to create two types of outfile: a tagged BAM or a flatfile describing the read groups

To generate the tagged-BAM file, use the option --output-bam and provide a filename with the -S option. Alternatively, if you do not provide a filename, the bam file will be outputted to the stdout. If you have provided the --log/-L option to send the logging output elsewhere, you can pipe the output from the group command directly to e.g samtools sort like so:

umi_tools group -I inf.bam --group-out=grouped.tsv --output-bam --log=group.log --paired | samtools sort - -o grouped_sorted.bam

The tagged-BAM file will have two tagged per read:

UG = Unique_id.

0-indexed unique id number for each group of reads with the same genomic position and UMI or UMIs inferred to be from the same true UMI + errors

BX = Final UMI.

The inferred true UMI for the group

To generate the flatfile describing the read groups, include the --group-out=<filename> option. The columns of the read groups file are below. The first five columns relate to the read. The final 3 columns relate to the group.

read_id

read identifier

contig

alignment contig

position

Alignment position. Note that this position is not the start position of the read in the BAM file but the start of the read taking into account the read strand and cigar

umi

The read UMI

umi_count

The number of times this UMI is observed for reads at the same position

final_umi

The inferred true UMI for the group

final_umi_count

The total number of reads within the group

unique_id

The unique id for the group

Extracting barcodes

It is assumed that the FASTQ files were processed with umi_tools extract before mapping and thus the UMI is the last word of the read name. e.g:

@HISEQ:87:00000000_AATT

where AATT is the UMI sequeuence.

If you have used an alternative method which does not separate the read id and UMI with a "_", such as bcl2fastq which uses ":", you can specify the separator with the option --umi-separator=<sep>, replacing <sep> with e.g ":".

Alternatively, if your UMIs are encoded in a tag, you can specify this by setting the option --extract-umi-method=tag and set the tag name with the --umi-tag option. For example, if your UMIs are encoded in the 'UM' tag, provide the following options: --extract-umi-method=tag --umi-tag=UM

Finally, if you have used umis to extract the UMI +/- cell barcode, you can specify --extract-umi-method=umis

The start position of a read is considered to be the start of its alignment minus any soft clipped bases. A read aligned at position 500 with cigar 2S98M will be assumed to start at position 498.

UMI grouping options

Grouping Method

What method to use to identify group of reads with the same (or similar) UMI(s)?

All methods start by identifying the reads with the same mapping position.

The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene.

unique

Reads group share the exact same UMI
percentile

Reads group share the exact same UMI. UMIs with counts < 1% of the median counts for UMIs at the same position are ignored.
cluster

Identify clusters of connected UMIs (based on hamming distance threshold). Each network is a read group
adjacency

Cluster UMIs as above. For each cluster, select the node (UMI) with the highest counts. Visit all nodes one edge away. If all nodes have been visited, stop. Otherwise, repeat with remaining nodes until all nodes have been visted. Each step defines a read group.
directional (default)

Identify clusters of connected UMIs (based on hamming distance threshold) and umi A counts >= (2* umi B counts) - 1. Each network is a read group.