Galaxy | Tool Preview

UMI-tools extract (version 1.1.2+galaxy2)
Use this option to specify the format of the UMI/barcode. Use Ns to represent the random positions and Xs to indicate the bc positions. Bases with Ns will be extracted and added to the read name. Remaining bases, marked with an X will be reattached to the read
If bracketed expressions are used in the above barcode pattern, then set this to 'regex'. Otherwise leave as 'string'
By default the barcode is assumed to be on the 5' end of the read, but use this option to specify that it is on the 3' end instead. This option only works with ``--extract-method=string`` since 3' encoding can be specified explicitly with a regex, e.g ``.*(?P.{5})$``
This only applies if your barcode file has two columns output from the umi_tools whitelist command
Choose if you want to generate a text file containing logging information

extract - Extract UMI from fastq

Extract UMI barcode from a read and add it to the read name, leaving any sample barcode in place

Can deal with paired end reads and UMIs split across the paired ends. Can also optionally extract cell barcodes and append these to the read name also. See the section below for an explanation for how to encode the barcode pattern(s) to specficy the position of the UMI +/- cell barcode.

Filtering and correcting cell barcodes

umi_tools extract can optionally filter cell barcodes against a user-supplied whitelist (--whitelist). If a whitelist is not available for your data, e.g if you have performed droplet-based scRNA-Seq, you can use the whitelist tool.

Cell barcodes which do not match the whitelist (user-generated or automatically generated) can also be optionally corrected using the --error-correct-cell option.

The whitelist should be in the following format (tab-separated):

AAAAAA    AGAAAA
AAAATC
AAACAT
AAACTA    AAACTN,GAACTA
AAATAC
AAATCA    GAATCA
AAATGT    AAAGGT,CAATGT

Where column 1 is the whitelisted cell barcodes and column 2 is the list (comma-separated) of other cell barcodes which should be corrected to the barcode in column 1. If the --error-correct-cell option is not used, this column will be ignored. Any additional columns in the whitelist input, such as the counts columns from the output of umi_tools whitelist, will be ignored.

There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns should be provided using the --bc-pattern and --bc-pattern2 options.x
  • string

    This should be used where the barcodes are always in the same place in the read.

    • N = UMI position (required)
    • C = cell barcode position (optional)
    • X = sample position (optional)

    Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read. Bases with an X will be reattached to the read.

    E.g. If the pattern is NNNNCC, Then the read:

    @HISEQ:87:00000000 read1
    AAGGTTGCTGATTGGATGGGCTAG
    +
    DA1AEBFGGCG01DFH00B1FF0B
    

    will become:

    @HISEQ:87:00000000_TT_AAGG read1
    GCTGATTGGATGGGCTAG
    +
    1AFGGCG01DFH00B1FF0B
    

    where 'TT' is the cell barcode and 'AAGG' is the UMI.

  • regex

    This method allows for more flexible barcode extraction and should be used where the cell barcodes are variable in length. Alternatively, the regex option can also be used to filter out reads which do not contain an expected adapter sequence. UMI-tools uses the regex module rather than the more standard re module since the former also enables fuzzy matching

    The regex must contain groups to define how the barcodes are encoded in the read. The expected groups in the regex are:

    umi_n = UMI positions, where n can be any value (required) cell_n = cell barcode positions, where n can be any value (optional) discard_n = positions to discard, where n can be any value (optional)

    UMI positions and cell barcode positions will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read.

    Discard bases and the corresponding quality scores will be removed from the read. All bases matched by other groups or components of the regex will be reattached to the read sequence

    For example, the following regex can be used to extract reads from the Klein et al inDrop data:

    (?P<cell_1>.{8,12})(?P<discard_1>GAGTGATTGCTTGTGACGCCTT)(?P<cell_2>.{8})(?P<umi_1>.{6})T{3}.*
    

    Where only reads with a 3' T-tail and GAGTGATTGCTTGTGACGCCTT in the correct position to yield two cell barcodes of 8-12 and 8bp respectively, and a 6bp UMI will be retained.

    You can also specify fuzzy matching to allow errors. For example if the discard group above was specified as below this would enable matches with up to 2 errors in the discard_1 group.

    (?P<discard_1>GAGTGATTGCTTGTGACGCCTT){s<=2}
    

    Note that all UMIs must be the same length for downstream processing with dedup, group or count commands