Galaxy | Tool Preview

UMI-tools whitelist (version 1.1.2+galaxy2)
Use this option to specify the format of the UMI/barcode. Use Ns to represent the random positions and Xs to indicate the bc positions. Bases with Ns will be extracted and added to the read name. Remaining bases, marked with an X will be reattached to the read
If bracketed expressions are used in the above barcode pattern, then set this to 'regex'. Otherwise leave as 'string'
By default the barcode is assumed to be on the 5' end of the read, but use this option to specify that it is on the 3' end instead. This option only works with ``--extract-method=string`` since 3' encoding can be specified explicitly with a regex, e.g ``.*(?P.{5})$``
Many published protocols rank CBs by the number of reads the CBs appear in. However you could also use the number of unique UMIs a CB is associated with. Note that this is still and approximation to the number of transcripts captured because the same UMI could be associated with two different transcripts and be counted as independent
Will still output the plots if requested
which may be sequence errors from another CB
Choose if you want to generate a text file containing logging information

UMI-tools whitelist - Extract barcodes from fastq

Purpose

Extract cell barcodes and identify the most likely true barcodes using the 'knee' method.

There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns should be provided using the --bc-pattern and --bc-pattern2 options.x
  • string

    This should be used where the barcodes are always in the same place in the read.

    • N = UMI position (required)
    • C = cell barcode position (optional)
    • X = sample position (optional)

    Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read. Bases with an X will be reattached to the read.

    E.g. If the pattern is NNNNCC, Then the read:

    @HISEQ:87:00000000 read1
    AAGGTTGCTGATTGGATGGGCTAG
    +
    DA1AEBFGGCG01DFH00B1FF0B
    

    will become:

    @HISEQ:87:00000000_TT_AAGG read1
    GCTGATTGGATGGGCTAG
    +
    1AFGGCG01DFH00B1FF0B
    

    where 'TT' is the cell barcode and 'AAGG' is the UMI.

  • regex

    This method allows for more flexible barcode extraction and should be used where the cell barcodes are variable in length. Alternatively, the regex option can also be used to filter out reads which do not contain an expected adapter sequence. UMI-tools uses the regex module rather than the more standard re module since the former also enables fuzzy matching

    The regex must contain groups to define how the barcodes are encoded in the read. The expected groups in the regex are:

    umi_n = UMI positions, where n can be any value (required) cell_n = cell barcode positions, where n can be any value (optional) discard_n = positions to discard, where n can be any value (optional)

    UMI positions and cell barcode positions will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read.

    Discard bases and the corresponding quality scores will be removed from the read. All bases matched by other groups or components of the regex will be reattached to the read sequence

    For example, the following regex can be used to extract reads from the Klein et al inDrop data:

    (?P<cell_1>.{8,12})(?P<discard_1>GAGTGATTGCTTGTGACGCCTT)(?P<cell_2>.{8})(?P<umi_1>.{6})T{3}.*
    

    Where only reads with a 3' T-tail and GAGTGATTGCTTGTGACGCCTT in the correct position to yield two cell barcodes of 8-12 and 8bp respectively, and a 6bp UMI will be retained.

    You can also specify fuzzy matching to allow errors. For example if the discard group above was specified as below this would enable matches with up to 2 errors in the discard_1 group.

    (?P<discard_1>GAGTGATTGCTTGTGACGCCTT){s<=2}
    

    Note that all UMIs must be the same length for downstream processing with dedup, group or count commands

Output:

The whitelist is outputted as 4 tab-separated columns:

  1. whitelisted cell barcode
  2. Other cell barcode(s) (comma-separated) to correct to the whitelisted barcode
  3. Count for whitelisted cell barcodes
  4. Count(s) for the other cell barcode(s) (comma-separated)

example output:

AAAAAA AGAAAA 146 1 AAAATC 22 AAACAT 21 AAACTA AAACTN,GAACTA 27 1,1 AAATAC 72 AAATCA GAATCA 37 3 AAATGT AAAGGT,CAATGT 41 1,1 AAATTG CAATTG 36 1 AACAAT 18 AACATA 24

If --error-correct-threshold is set to 0, columns 2 and 4 will be empty.