The purpose of this command is to identify groups of reads based on their genomic coordinate and UMI.
The group command can be used to create two types of outfile: a tagged BAM or a flatfile describing the read groups
To generate the tagged-BAM file, use the option --output-bam and provide a filename with the -S option. Alternatively, if you do not provide a filename, the bam file will be outputted to the stdout. If you have provided the --log/-L option to send the logging output elsewhere, you can pipe the output from the group command directly to e.g samtools sort like so:
umi_tools group -I inf.bam --group-out=grouped.tsv --output-bam --log=group.log --paired | samtools sort - -o grouped_sorted.bam
The tagged-BAM file will have two tagged per read:
- UG = Unique_id.
- 0-indexed unique id number for each group of reads with the same genomic position and UMI or UMIs inferred to be from the same true UMI + errors
- BX = Final UMI.
- The inferred true UMI for the group
To generate the flatfile describing the read groups, include the --group-out=<filename> option. The columns of the read groups file are below. The first five columns relate to the read. The final 3 columns relate to the group.
- read_id
- read identifier
- contig
- alignment contig
- position
- Alignment position. Note that this position is not the start position of the read in the BAM file but the start of the read taking into account the read strand and cigar
- umi
- The read UMI
- umi_count
- The number of times this UMI is observed for reads at the same position
- final_umi
- The inferred true UMI for the group
- final_umi_count
- The total number of reads within the group
- unique_id
- The unique id for the group
It is assumed that the FASTQ files were processed with umi_tools extract before mapping and thus the UMI is the last word of the read name. e.g:
@HISEQ:87:00000000_AATT
where AATT is the UMI sequeuence.
If you have used an alternative method which does not separate the read id and UMI with a "_", such as bcl2fastq which uses ":", you can specify the separator with the option --umi-separator=<sep>, replacing <sep> with e.g ":".
Alternatively, if your UMIs are encoded in a tag, you can specify this by setting the option --extract-umi-method=tag and set the tag name with the --umi-tag option. For example, if your UMIs are encoded in the 'UM' tag, provide the following options: --extract-umi-method=tag --umi-tag=UM
Finally, if you have used umis to extract the UMI +/- cell barcode, you can specify --extract-umi-method=umis
The start position of a read is considered to be the start of its alignment minus any soft clipped bases. A read aligned at position 500 with cigar 2S98M will be assumed to start at position 498.
What method to use to identify group of reads with the same (or similar) UMI(s)?
All methods start by identifying the reads with the same mapping position.
The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene.