The purpose of this command is to deduplicate BAM files based on the first mapping co-ordinate and the UMI attached to the read.
It is assumed that the FASTQ files were processed with umi_tools extract before mapping and thus the UMI is the last word of the read name. e.g:
where AATT is the UMI sequeuence.
If you have used an alternative method which does not separate the read id and UMI with a "_", such as bcl2fastq which uses ":", you can specify the separator with the option --umi-separator=<sep>, replacing <sep> with e.g ":".
Alternatively, if your UMIs are encoded in a tag, you can specify this by setting the option --extract-umi-method=tag and set the tag name with the --umi-tag option. For example, if your UMIs are encoded in the 'UM' tag, provide the following options: --extract-umi-method=tag --umi-tag=UM
Finally, if you have used umis to extract the UMI +/- cell barcode, you can specify --extract-umi-method=umis
The start position of a read is considered to be the start of its alignment minus any soft clipped bases. A read aligned at position 500 with cigar 2S98M will be assumed to start at position 498.
What method to use to identify group of reads with the same (or similar) UMI(s)?
All methods start by identifying the reads with the same mapping position.
The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene.
For every group of duplicate reads, a single representative read is retained.The following criteria are applied to select the read that will be retained from a group of duplicated reads:
1. The read with the lowest number of mapping coordinates (see --multimapping-detection-method option)
2. The read with the highest mapping quality. Note that this is not the read sequencing quality and that if two reads have the same mapping quality then one will be picked at random regardless of the read quality.
Otherwise a read is chosen at random.
One can use the edit distance between UMIs at the same position as an quality control for the deduplication process by comparing with a null expectation of random sampling. For the random sampling, the observed frequency of UMIs is used to more reasonably model the null expectation.
Use the option Output UMI related statistics files? generate stats outfiles:
In addition, this option will trigger reporting of further summary statistics for the UMIs which may be informative for selecting the optimal deduplication method or debugging.
Each unique UMI sequence may be observed [0-many] times at multiple positions in the BAM. The following files report the distribution for the frequencies of each UMI.
The _stats_per_umi_per_position.tsv file simply tabulates the counts for unique combinations of UMI and position. E.g if prior to deduplication, we have two positions in the BAM (POSa, POSb), at POSa we have observed 2*UMIa, 1*UMIb and at POSb: 1*UMIc, 3*UMId, then the stats file is populated thus:
If post deduplication, UMIb is grouped with UMIa such that POSa: 3*UMIa, then the instances_post column is populated thus:
The _stats_per_umi_per.tsv table provides UMI-level summary statistics. Keeping in mind that each unique UMI sequence can be observed at [0-many] times across multiple positions in the BAM,
|times_observed:||How many positions the UMI was observed at|
|total_counts:||The total number of times the UMI was observed across all positions|
|median_counts:||The median for the distribution of how often the UMI was observed at each position (excluding zeros)|
Hence, whenever times_observed=1, total_counts==median_counts.