This tool is only designed to work with library preparation methods where the fragmentation occurs after amplification, as per most single cell RNA-Seq methods (e.g 10x, inDrop, Drop-seq, SCRB-seq and CEL-seq2). Since the precise mapping co-ordinate is not longer informative for such library preparations, it is simplified to the gene. This is a reasonable approach providing the number of available UMIs is sufficiently high and the sequencing depth is sufficiently low that the probability of two reads from the same gene having the same UMIs is acceptably low.
If you want to count reads per gene for library preparations which fragment prior to amplification (e.g bulk RNA-Seq), please use umi_tools dedup to remove the duplicate reads as this will use the full information from the mapping co-ordinate. Then use a read counting tool such as FeatureCounts or HTSeq to count the reads per gene.
In the rare case of bulk RNA-Seq using a library preparation method with fragmentation after amplification, one can still use count but note that it has not been tested on bulk RNA-Seq.
This tool deviates from group and dedup in that the --per-gene option is hardcoded on.
It is assumed that the FASTQ files were processed with umi_tools extract before mapping and thus the UMI is the last word of the read name. e.g:
@HISEQ:87:00000000_AATT
where AATT is the UMI sequeuence.
If you have used an alternative method which does not separate the read id and UMI with a "_", such as bcl2fastq which uses ":", you can specify the separator with the option --umi-separator=<sep>, replacing <sep> with e.g ":".
Alternatively, if your UMIs are encoded in a tag, you can specify this by setting the option --extract-umi-method=tag and set the tag name with the --umi-tag option. For example, if your UMIs are encoded in the 'UM' tag, provide the following options: --extract-umi-method=tag --umi-tag=UM
Finally, if you have used umis to extract the UMI +/- cell barcode, you can specify --extract-umi-method=umis
The start position of a read is considered to be the start of its alignment minus any soft clipped bases. A read aligned at position 500 with cigar 2S98M will be assumed to start at position 498.
What method to use to identify group of reads with the same (or similar) UMI(s)?
All methods start by identifying the reads with the same mapping position.
The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene.