Galaxy |

What it does
Je MarkDupes: Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules taking into account molecular barcodes (Unique Molecular Identifiers or UMIs) found in read header. All records are then either written to the output file with the duplicate records flagged or trashed.
Input file is a bam file.
Author: Charles Girardot (charles.girardot@embl.de).
Wrapper by: Jelle Scholtalbers (jelle.scholtalbers@embl.de).
Know what you are doing
You will want to read the documentation.
Parameter list
This is an exhaustive list of options:
INPUT=String
I=String

  One or more input SAM or BAM files to analyze. Must be coordinate sorted.

  Default value: null. This option may be specified 0 or more times.

OUTPUT=File
O=File

  The output file to write marked records to

  Required.

MISMATCHES=Integer
MM=Integer

  Number of MisMatches (inclusive) to still consider two Unique Molecular Identifiers
  (UMIs) the same i.e. this option buffers for sequencing errors.
  Indeed, in case of a sequencing error, 2 duplicate reads would not be considered
  duplicates anymore.
  Note that N are not considered mismatches during comparison ie ATTNGG and NTTANG are seen
  as the same barcode and these two reads would be flagged duplicates.
  This option takes a single value even when several barcodes are present (see SLOTS).
  Note that when declaring several barcodes (see SLOTS) AND providing a predefined set
  of barcodes (see BC option), the MM value is applicable in each lookup. When a predefined
  set of barcodes is NOT given, the different barcodes (SLOTS) are concatenated first and
  the MM value is therefore considered *overall* as the concatenated code is seen as a
  unique code.
  MM=null is like MM=0
  Use the minimum Hamming distance of the original barcode set (if applicable).

  Required.

MAX_NUMBER_OF_N=Integer
MAX_N=Integer

  Maximum number of Ns a molecular code can contain (inclusive). Above this value, reads
  are placed in a UNDEF group.
  More precisely, these 'too degenarate' codes will not :
       * be compared to the list of predefined codes [predefined code list situation ie BC
  option given] nor
       * be considered as a potential independent code [no predefined code list situation ie
  BC option not given]
  Default value is the MISMATCHES number.
  Note that when declaring several barcodes (see SLOTS) AND providing a predefined set
  of barcodes (see BC option), the MAX_N value is applicable to each barcode. When a
  predefined set
  of barcodes is NOT given, the different barcodes (SLOTS) are concatenated first and the
  MAX_N value
  is therefore considered *overall*.

  Default value: null.


SLOTS=Integer
SLOTS=Integer

  Where to find the UMIs (and only the UMIs) in the read name once read name has been
  tokenized using the SPLIT character (e.g. ':').
  By default, the UMI is considered to be found at the end of the read header i.e. after
  the last ':'. Use this option to indicate other or additional UMI positions (e.g.
  multiple UMIs present in read header.
  IMPORTANT: counting starts at 1 and negative numbers can be used to start counting from
  the end.
  For example, consider the following read name that lists 3 different barcodes in the end:
    HISEQ:44:C6KC0ANXX:8:2112:20670:79594:CGATGTTT:GATCCTAG:AAGGTACG
  to indicate that the three barcodes are molecular codes, use
    SLOTS=-1 SLOTS=-2 SLOTS=-3
  if only the 2 last ones should be considered (the third one being a sample encoding
  barcode), use
    SLOTS=-1 SLOTS=-2

  Default value: null. This option may be specified 0 or more times.

BARCODE_FILE=File
BC=File

  Pre-defined list of UMIs that can be expected. Format: one column text file, one barcode
  per line. All UMIs MUST have the same length.

  Default value: null.

TRIM_HEADERS=Boolean
T=Boolean

  Should barcode information be removed from read names in the output BAM?

  Default value: false. This option can be set to 'null' to clear the default value.
  Possible values: {true, false}

TSLOTS=Integer
TSLOTS=Integer

  Where to find *all* barcode(s) (i.e. sample encoding and UMIs) in the read name once has
  been tokenized using the SPLIT character (e.g. ':').
  This option is only considered when TRIM_HEADERS=true. When TSLOTS is ommited while
  TRIM_HEADERS=true, the values of SLOTS apply.
  IMPORTANT : counting starts at 1 and negative numbers can be used to start counting from
  the end.
  See SLOT help for examples.

  Default value: null. This option may be specified 0 or more times.

SPLIT_CHAR=String
SPLIT=String

  Character to use to split up the read header line, default is ':'.

  Default value: ':'. This option can be set to 'null' to clear the default value.

INPUT=String
I=String

  One or more input SAM or BAM files to analyze. Must be coordinate sorted.

  Default value: null. This option may be specified 0 or more times.

OUTPUT=File
O=File

  The output file to write marked records to  Required.

METRICS_FILE=File
M=File

  File to write duplication metrics to  Required.

COMMENT=String
CO=String

  Comment(s) to include in the output file's header.

  Default value: null. This option may be specified 0 or more times.

REMOVE_DUPLICATES=Boolean

  If true do not write duplicates to the output file instead of writing them with
  appropriate flags set.

  Default value: false. This option can be set to 'null' to clear
  the default value.
  Possible values: {true, false}

ASSUME_SORTED=Boolean
AS=Boolean

  If true, assume that the input file is coordinate sorted even if the header says
  otherwise.

  Default value: false. This option can be set to 'null' to clear the default
  value.
  Possible values: {true, false}

DUPLICATE_SCORING_STRATEGY=ScoringStrategy
DS=ScoringStrategy

  The scoring strategy for choosing the non-duplicate among candidates.

  Default value: SUM_OF_BASE_QUALITIES. This option can be set to 'null' to clear the default value.
  Possible values: {SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH}

READ_NAME_REGEX=String

  Regular expression that can be used to parse read names in the incoming SAM file. Read
  names are parsed to extract three variables: tile/region, x coordinate and y coordinate.
  These values are used to estimate the rate of optical duplication in order to give a more
  accurate estimated library size. Set this option to null to disable optical duplicate
  detection. The regular expression should contain three capture groups for the three
  variables, in order. It must match the entire read name. Note that if the default regex
  is specified, a regex match is not actually done, but instead the read name  is split on
  colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be
  tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements
  are assumed to be tile, x and y values.

  Default value:
  [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to
  clear the default value.

OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer

  The maximum offset between two duplicte clusters in order to consider them optical
  duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels)
  unless using later versions of the Illumina pipeline that multiply pixel values by 10, in
  which case 50-100 is more normal.

  Default value: 100. This option can be set to 'null'
  to clear the default value.