view hicCorrectMatrix.xml @ 9:ac80bd0a96ca draft

planemo upload for repository https://github.com/maxplanck-ie/HiCExplorer/tree/master/galaxy/wrapper/ commit eec0a4d5a7c5ba4ec0fbd2ead8280c3d143bb9d8
author iuc
date Fri, 27 Apr 2018 03:29:59 -0400
parents f7d344dacfeb
children bfa1c014f64a
line wrap: on
line source

<tool id="hicexplorer_hiccorrectmatrix" name="@BINARY@" version="@WRAPPER_VERSION@.0">
    <description>run Imakaev's iterative correction over a Hi-C contact matrix.</description>
    <macros>
        <token name="@BINARY@">hicCorrectMatrix</token>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <command detect_errors="exit_code"><![CDATA[

        hicCorrectMatrix
            $mode.mode_selector
            --matrix '$matrix_h5_cooler'

            ## special: --chromosomes is optional, but if given needs at least one argument
            #set chroms = '" "'.join([ str($var.chromosome) for $var in $chromosomes ])
            #if chroms
                --chromosomes "$chroms"
            #end if

            #if $mode.mode_selector == 'correct':

                --iterNum $mode.iterNum
                --outFileName matrix.$mode.outputFormat

                #if $mode.filterThreshold_low and $mode.filterThreshold_large:
                    --filterThreshold $mode.filterThreshold_low $mode.filterThreshold_large
                #end if

                #if $mode.inflationCutoff:
                    --inflationCutoff $mode.inflationCutoff
                #end if

                #if $mode.transCutoff:
                    --transCutoff $mode.transCutoff
                #end if

                #if $mode.sequencedCountCutoff:
                    --sequencedCountCutoff $mode.sequencedCountCutoff
                #end if

                $mode.skipDiagonal
                $mode.perchr

            #elif $mode.mode_selector == 'merge_failed':
                --plotName diagnostic_plot.png
                --outMatrixFile corrected_matrix.npz.h5
                #if $mode.xMax:
                    --xMax $mode.xMax
                #end if
                #if $mode.filterThreshold_low and $mode.filterThreshold_large:
                    --filterThreshold '$mode.filterThreshold_low' '$mode.filterThreshold_large'
                #end if
            #else:
                --plotName diagnostic_plot.png
                #if $mode.xMax:
                    --xMax $mode.xMax
                #end if
            #end if
        #if $mode.mode_selector == 'correct':
            && mv matrix.$mode.outputFormat matrix
        #end if
]]>
    </command>
    <inputs>
        <expand macro='matrix_h5_cooler_macro' />
        <conditional name="mode">
            <param name="mode_selector" type="select" label="Mode">
                <option value="diagnostic_plot">Diagnostic plot</option>
                <option value="correct">Correct matrix</option>
            </param>
            <when value="diagnostic_plot">
                <expand macro="xMax" />
            </when>
            <when value="correct">
                <param argument="--iterNum" name="iterNum" type="integer" optional="true" value="500"
                    label="Number of iterations" />

                <param argument="--inflationCutoff" name="inflationCutoff" type="float" optional="true"
                    label="Inflation cutoff" value=""
                    help="Value corresponding to the maximum number of times a bin can be scaled up during the iterative correction.
                    For example, a inflationCutoff of 3 will filter out all bins that were expanded 3 times or more during the iterative correction."/>

                <param argument="--transCutoff" name="transCutoff" type="float" optional="true"
                    label="Trans region cutoff" value=""
                    help="Clip high counts in the top -transcut trans regions (i.e. between chromosomes). A usual value is 0.05."/>

                <param argument="--sequencedCountCutoff" name="sequencedCountCutoff" optional="true" type="float"
                    label="Sequenced count cutoff"
                    help="Each bin receives a value indicating the fraction that is covered by reads.
                        A cutoff of 0.5 will discard all those bins that have less than half of the bin covered."/>

                <param argument="--skipDiagonal" name="skipDiagonal" type="boolean" truevalue="--skipDiagonal" falsevalue="" checked="false"
                    label="Skip diagonal counts"/>

                <param argument="--perchr" name="perchr" type="boolean" truevalue="--perchr" falsevalue="" checked="false"
                    label="Normalize each chromosome separately" />
                <expand macro="filterThreshold" />
                <param name='outputFormat' type='select' label="Output file format">
                    <option value='h5'>HiCExplorer format</option>
                    <option value="cool">cool</option>
                </param>
            </when>
        </conditional>

        <repeat name="chromosomes" min="0"
            title="Include chromosomes" help="List of chromosomes to be included in the iterative correction.
                    The order of the given chromosomes will be kept for the resulting corrected matrix">
            <param name="chromosome" type="text" value="" label='chromosome (one per field)'>
                <validator type="empty_field" />
            </param>
        </repeat>

    </inputs>
    <outputs>
        <data name="outFileName" from_work_dir="matrix" format="h5">
            <change_format>
                <when input="mode.outputFormat" value="cool" format="cool" />
            </change_format>
            <filter>mode['mode_selector'] == "correct"</filter>

        </data>


        <data name="diagnostic_plot" from_work_dir="diagnostic_plot.png" format="png">
            <filter>mode['mode_selector'] == "diagnostic_plot"</filter>
        </data>
    </outputs>
    <tests>
        <test>
            <param name="matrix_h5_cooler" value="small_test_matrix.h5"/>

            <param name="mode_selector" value="correct"/>
            <repeat name="chromosomes">
                <param name="chromosome" value="chrUextra"/>
            </repeat>
            <repeat name="chromosomes">
                <param name="chromosome" value="chr3LHet"/>
            </repeat>
            <param name='outputFormat' value='h5'/>
            <param name='filterThreshold_low' value='-2.0' />
            <param name='filterThreshold_large' value='4' />
            <output name="outFileName" file="hicCorrectMatrix_result1.npz.h5" ftype="h5" compare="sim_size"/>
        </test>
        <test>
            <param name="matrix_h5_cooler" value="small_test_matrix.h5"/>
            <param name="mode_selector" value="diagnostic_plot"/>
            <repeat name="chromosomes">
                <param name="chromosome" value="chrUextra"/>
            </repeat>
            <repeat name="chromosomes">
                <param name="chromosome" value="chr3LHet"/>
            </repeat>
            <output name="diagnostic_plot" file="diagnostic_plot.png" ftype="png" compare="sim_size"/>
        </test>
    </tests>
    <help><![CDATA[

Hi-C contact matrix correction
==============================

**hicCorrectMatrix** runs Imakaev's iterative correction, described in `Imakaev et al. (2012)`_, over a Hi-C matrix. For the matrix correction to be efficient,
it is important to remove the unassembled scaffolds (e.g. `NT_`), mitochondrial DNA and Y chromosome and keep only full length
chromosomes, as scaffolds create problems with matrix correction. Therefore we use the chromosome names (1-19, X, Y) here.

**Important**: Use ‘chr1 chr2 chr3 etc.’ if your genome index uses chromosome names with the ‘chr’ prefix.

Also, for the method to work correctly, bins with zero reads assigned to them should be removed as they can not be corrected. Also, bins with low number of reads should be removed, otherwise, during the correction step, the counts associated with those bins will be amplified (usually, zero and low coverage bins tend contain repetitive regions). Bins with extremely high number of reads can also be removed from the correction as they may represent copy number variations.

To aid in the identification of bins with low and high read coverage, the ``diagnostic plot`` function of **hicCorrectMatrix** must be used.

Indeed, **hicCorrectMatrix** works in two steps:

  - **Diagnostic plot**: First a histogram containing the sum of contact per bin (row sum) is produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads.

  - **Correct**: The second step removes the bins outside of the defined thresholds and perfroms the iterative correction.

_________________

Usage
-----

This tool must be used on uncorrected matrices at restriction enzyme resolution or with merged bins (``hicMergeMatrixBins``).

_________________

Output
------

Diagnostic plot
_______________

The diagnostic plot consists of a bar plot of the contacts coverage per bins size together with the
modified z-score based on the Median Absolute Deviation (MAD) method.

See Boris Iglewicz and David Hoaglin 1993, Volume 16:
How to Detect and Handle Outliers The ASQC Basic References in Quality Control: Statistical Techniques,
Edward F. Mykytka, Ph.D., Editor.

Using this diagnostic plot, a user can decide if values
with a too low (and/or too high) number of contacts in respect to their genomic distance should
be removed from the data before the correction applies.

Moreover, the shown distribution should be a Gaussian bell. If it doesn’t follow a Gaussian distribution
this is an indicator that the used data is of bad quality or that the used contact matrix
is maybe not the one that should be used. It can happen that users select for example a merge
matrix with a lower resolution that was previously needed for plotting. In such cases the
diagnostic plot helps to detect this and prevent the user from running the analysis on a wrong dataset.


.. image:: $PATH_TO_IMAGES/diagnostic_plot.png
    :width: 50%

On the example plot above, a user can then use the lower threshold defined by the MAD method (black bold bar), or define its own threshold based on the contacts distribution.

Correct
_______

Run the iterative correction and outputs the corrected matrix. This matrix can then be used with all downstream analysis tools such as ``hicPlotMatrix``, ``hicPlotTADs``, ``hicPlotViewpoint``, ``hicAggregateContacts`` for **visualization of Hi-C data**, ``hicCorrelate``, ``hicPlotDistVsCounts``, ``hicTransform``, ``hicFindTADs``, ``hicPCA`` **for data and scores computation on Hi-C data**.

It is noteworthy that ``hicSumMatrices`` and ``hicMergeMatrixBins`` **must be performed on uncorrected matrices**.

_________________

| For more information about HiCExplorer please consider our documentation on readthedocs.io_

.. _readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html
.. _`Imakaev et al. (2012)`: http://doi.org/doi:10.1038/nmeth.2148
]]></help>
    <expand macro="citations" />
</tool>