Extract cell barcodes and identify the most likely true barcodes using the 'knee' method.
There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns should be provided using the --bc-pattern and --bc-pattern2 options.x
- string
This should be used where the barcodes are always in the same place in the read.
- N = UMI position (required)
- C = cell barcode position (optional)
- X = sample position (optional)
Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read. Bases with an X will be reattached to the read.
E.g. If the pattern is NNNNCC, Then the read:
@HISEQ:87:00000000 read1 AAGGTTGCTGATTGGATGGGCTAG + DA1AEBFGGCG01DFH00B1FF0Bwill become:
@HISEQ:87:00000000_TT_AAGG read1 GCTGATTGGATGGGCTAG + 1AFGGCG01DFH00B1FF0Bwhere 'TT' is the cell barcode and 'AAGG' is the UMI.
- regex
This method allows for more flexible barcode extraction and should be used where the cell barcodes are variable in length. Alternatively, the regex option can also be used to filter out reads which do not contain an expected adapter sequence. UMI-tools uses the regex module rather than the more standard re module since the former also enables fuzzy matching
The regex must contain groups to define how the barcodes are encoded in the read. The expected groups in the regex are:
umi_n = UMI positions, where n can be any value (required) cell_n = cell barcode positions, where n can be any value (optional) discard_n = positions to discard, where n can be any value (optional)
UMI positions and cell barcode positions will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read.
Discard bases and the corresponding quality scores will be removed from the read. All bases matched by other groups or components of the regex will be reattached to the read sequence
For example, the following regex can be used to extract reads from the Klein et al inDrop data:
(?P<cell_1>.{8,12})(?P<discard_1>GAGTGATTGCTTGTGACGCCTT)(?P<cell_2>.{8})(?P<umi_1>.{6})T{3}.*Where only reads with a 3' T-tail and GAGTGATTGCTTGTGACGCCTT in the correct position to yield two cell barcodes of 8-12 and 8bp respectively, and a 6bp UMI will be retained.
You can also specify fuzzy matching to allow errors. For example if the discard group above was specified as below this would enable matches with up to 2 errors in the discard_1 group.
(?P<discard_1>GAGTGATTGCTTGTGACGCCTT){s<=2}Note that all UMIs must be the same length for downstream processing with dedup, group or count commands
The whitelist is outputted as 4 tab-separated columns:
- whitelisted cell barcode
- Other cell barcode(s) (comma-separated) to correct to the whitelisted barcode
- Count for whitelisted cell barcodes
- Count(s) for the other cell barcode(s) (comma-separated)
example output:
AAAAAA AGAAAA 146 1 AAAATC 22 AAACAT 21 AAACTA AAACTN,GAACTA 27 1,1 AAATAC 72 AAATCA GAATCA 37 3 AAATGT AAAGGT,CAATGT 41 1,1 AAATTG CAATTG 36 1 AACAAT 18 AACATA 24
If --error-correct-threshold is set to 0, columns 2 and 4 will be empty.