Galaxy | Tool Preview

MarkDuplicates (version 3.1.1.0)
If empty, upload or import a SAM/BAM dataset
Comments
Comment 0
REMOVE_DUPLICATES; default=False
ASSUME_SORTED; default=True
DUPLICATE_SCORING_STRATEGY; default=SUM_OF_BASE_QUALITIES
READ_NAME_REGEX; Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. See help below for more info; default='' (uses : separation)
OPTICAL_DUPLICATE_PIXEL_DISTANCE; default=100
Barcode SAM tag. This tag can be utilized when you have data from an assay that includes Unique Molecular Indices. Typically 'RX'
Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

Purpose

Examines aligned records in the supplied SAM or BAM dataset to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged.


Dataset collections - processing large numbers of datasets at once

This will be added shortly


Inputs, outputs, and parameters

Either a SAM file or a BAM file must be supplied. Galaxy automatically coordinate-sorts all uploaded BAM files.

From Picard documentation( http://broadinstitute.github.io/picard/):

COMMENT=String
CO=String                     Comment(s) to include in the output file's header.  This option may be specified 0 or
                              more times.

REMOVE_DUPLICATES=Boolean     If true do not write duplicates to the output file instead of writing them with
                              appropriate flags set.  Default value: false.

READ_NAME_REGEX=String        This option is only needed if your read names do not follow a standard illumina convention
                              of colon separation but do contain tile, x, and y coordinates (unusual).
                              A regular expression that can be used to parse read names in the incoming SAM file. Read
                              names are parsed to extract three variables: tile/region, x coordinate and y coordinate.
                              These values are used to estimate the rate of optical duplication in order to give a more
                              accurate estimated library size. Set this option to null to disable optical duplicate
                              detection. The regular expression should contain three capture groups for the three
                              variables, in order. It must match the entire read name. Note that if the default regex
                              is specified, a regex match is not actually done, but instead the read name  is split on
                              colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be
                              tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements
                              are assumed to be tile, x and y values.  Default value: ''


DUPLICATE_SCORING_STRATEGY=ScoringStrategy
DS=ScoringStrategy            The scoring strategy for choosing the non-duplicate among candidates.  Default value:
                              SUM_OF_BASE_QUALITIES. Possible values: {SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH}

OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer
                              The maximum offset between two duplicate clusters in order to consider them optical
                              duplicates. This should be set to 100 for (circa 2011+) read names and typical flowcells.
                              Structured flow cells (NovaSeq, HiSeq 4000, X) should use ~2500.
                              For older conventions, distances could be to some fairly small number (e.g. 5-10 pixels)
                              Default value: 100.

BARCODE_TAG=String            Barcode SAM tag (ex. BC for 10X Genomics)  Default value: null.

Additional information

Additional information about Picard tools is available from Picard web site at http://broadinstitute.github.io/picard/ .