diff hicCorrectMatrix.xml @ 9:ac80bd0a96ca draft

planemo upload for repository https://github.com/maxplanck-ie/HiCExplorer/tree/master/galaxy/wrapper/ commit eec0a4d5a7c5ba4ec0fbd2ead8280c3d143bb9d8
author iuc
date Fri, 27 Apr 2018 03:29:59 -0400
parents f7d344dacfeb
children bfa1c014f64a
line wrap: on
line diff
--- a/hicCorrectMatrix.xml	Wed Mar 07 03:24:24 2018 -0500
+++ b/hicCorrectMatrix.xml	Fri Apr 27 03:29:59 2018 -0400
@@ -1,5 +1,5 @@
 <tool id="hicexplorer_hiccorrectmatrix" name="@BINARY@" version="@WRAPPER_VERSION@.0">
-    <description>Runs Dekker's iterative correction over a hic matrix.</description>
+    <description>run Imakaev's iterative correction over a Hi-C contact matrix.</description>
     <macros>
         <token name="@BINARY@">hicCorrectMatrix</token>
         <import>macros.xml</import>
@@ -64,7 +64,7 @@
     <inputs>
         <expand macro='matrix_h5_cooler_macro' />
         <conditional name="mode">
-            <param name="mode_selector" type="select" label="Range restriction (in bp)" argument="--range">
+            <param name="mode_selector" type="select" label="Mode">
                 <option value="diagnostic_plot">Diagnostic plot</option>
                 <option value="correct">Correct matrix</option>
             </param>
@@ -105,7 +105,7 @@
         <repeat name="chromosomes" min="0"
             title="Include chromosomes" help="List of chromosomes to be included in the iterative correction.
                     The order of the given chromosomes will be kept for the resulting corrected matrix">
-            <param name="chromosome" type="text" value="" >
+            <param name="chromosome" type="text" value="" label='chromosome (one per field)'>
                 <validator type="empty_field" />
             </param>
         </repeat>
@@ -117,10 +117,10 @@
                 <when input="mode.outputFormat" value="cool" format="cool" />
             </change_format>
             <filter>mode['mode_selector'] == "correct"</filter>
-            
+
         </data>
-        
-       
+
+
         <data name="diagnostic_plot" from_work_dir="diagnostic_plot.png" format="png">
             <filter>mode['mode_selector'] == "diagnostic_plot"</filter>
         </data>
@@ -128,16 +128,17 @@
     <tests>
         <test>
             <param name="matrix_h5_cooler" value="small_test_matrix.h5"/>
-            
+
             <param name="mode_selector" value="correct"/>
             <repeat name="chromosomes">
                 <param name="chromosome" value="chrUextra"/>
             </repeat>
             <repeat name="chromosomes">
                 <param name="chromosome" value="chr3LHet"/>
-            </repeat> 
+            </repeat>
             <param name='outputFormat' value='h5'/>
-
+            <param name='filterThreshold_low' value='-2.0' />
+            <param name='filterThreshold_large' value='4' />
             <output name="outFileName" file="hicCorrectMatrix_result1.npz.h5" ftype="h5" compare="sim_size"/>
         </test>
         <test>
@@ -148,79 +149,82 @@
             </repeat>
             <repeat name="chromosomes">
                 <param name="chromosome" value="chr3LHet"/>
-            </repeat> 
+            </repeat>
             <output name="diagnostic_plot" file="diagnostic_plot.png" ftype="png" compare="sim_size"/>
         </test>
     </tests>
     <help><![CDATA[
 
-Matrix correction
-==================
+Hi-C contact matrix correction
+==============================
 
-``hicCorrectMatrix`` runs Dekker's iterative correction over a Hi-C matrix (`Imakaev 2012`_.). For correcting the matrix, 
-it is important to remove the unassembled scaffolds (e.g. `NT_`), mitochondrial DNA and Y chromosome and keep only 
-chromosomes, as scaffolds create problems with matrix correction. Therefore 
-we use the chromosome names (1-19, X, Y) here. 
- 
+**hicCorrectMatrix** runs Imakaev's iterative correction, described in `Imakaev et al. (2012)`_, over a Hi-C matrix. For the matrix correction to be efficient,
+it is important to remove the unassembled scaffolds (e.g. `NT_`), mitochondrial DNA and Y chromosome and keep only full length
+chromosomes, as scaffolds create problems with matrix correction. Therefore we use the chromosome names (1-19, X, Y) here.
+
 **Important**: Use ‘chr1 chr2 chr3 etc.’ if your genome index uses chromosome names with the ‘chr’ prefix.
 
-Matrix correction works in two steps: first a histogram containing the sum of contact  per bin (row sum) is produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads. The second steps removes the low scoring bins and does the correction.
+Also, for the method to work correctly, bins with zero reads assigned to them should be removed as they can not be corrected. Also, bins with low number of reads should be removed, otherwise, during the correction step, the counts associated with those bins will be amplified (usually, zero and low coverage bins tend contain repetitive regions). Bins with extremely high number of reads can also be removed from the correction as they may represent copy number variations.
+
+To aid in the identification of bins with low and high read coverage, the ``diagnostic plot`` function of **hicCorrectMatrix** must be used.
+
+Indeed, **hicCorrectMatrix** works in two steps:
 
-Input
+  - **Diagnostic plot**: First a histogram containing the sum of contact per bin (row sum) is produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads.
+
+  - **Correct**: The second step removes the bins outside of the defined thresholds and perfroms the iterative correction.
+
+_________________
+
+Usage
 -----
 
-
-Diagnostic plot
-~~~~~~~~~~~~~~~~
-Plots a histogram of the coverage per bin together with the
-modified z-score based on the median absolute deviation
-method.
-
-See Boris Iglewicz and David Hoaglin 1993, Volume 16: 
-How to Detect and Handle Outliers The ASQC Basic References in Quality Control: Statistical Techniques,
-Edward F. Mykytka, Ph.D., Editor.
-
-Parameters
-__________
-- the contact matrix
-- Max value for the x-axis in counts per bin
-- include chromosomes
+This tool must be used on uncorrected matrices at restriction enzyme resolution or with merged bins (``hicMergeMatrixBins``).
 
-
-Correct
-~~~~~~~
-
-Run the iterative correction. 
-
-Parameters
-__________
-- number of iterations
-- inflation cutoff
-- trans region cutoff
-- sequenced count cutoff
-- skip diagonal counts
-- normalize each chromosome separately
-- remove bins of low coverage
-- remove bins of large coverage
-- include chromosomes
+_________________
 
 Output
 ------
 
-    Diagnostic plot:
+Diagnostic plot
+_______________
+
+The diagnostic plot consists of a bar plot of the contacts coverage per bins size together with the
+modified z-score based on the Median Absolute Deviation (MAD) method.
+
+See Boris Iglewicz and David Hoaglin 1993, Volume 16:
+How to Detect and Handle Outliers The ASQC Basic References in Quality Control: Statistical Techniques,
+Edward F. Mykytka, Ph.D., Editor.
+
+Using this diagnostic plot, a user can decide if values
+with a too low (and/or too high) number of contacts in respect to their genomic distance should
+be removed from the data before the correction applies.
 
-    .. image:: $PATH_TO_IMAGES/diagnostic_plot.png
-        :width: 70%
-    
-    Correct:
-    - the corrected contact matrix
+Moreover, the shown distribution should be a Gaussian bell. If it doesn’t follow a Gaussian distribution
+this is an indicator that the used data is of bad quality or that the used contact matrix
+is maybe not the one that should be used. It can happen that users select for example a merge
+matrix with a lower resolution that was previously needed for plotting. In such cases the
+diagnostic plot helps to detect this and prevent the user from running the analysis on a wrong dataset.
+
+
+.. image:: $PATH_TO_IMAGES/diagnostic_plot.png
+    :width: 50%
+
+On the example plot above, a user can then use the lower threshold defined by the MAD method (black bold bar), or define its own threshold based on the contacts distribution.
+
+Correct
+_______
+
+Run the iterative correction and outputs the corrected matrix. This matrix can then be used with all downstream analysis tools such as ``hicPlotMatrix``, ``hicPlotTADs``, ``hicPlotViewpoint``, ``hicAggregateContacts`` for **visualization of Hi-C data**, ``hicCorrelate``, ``hicPlotDistVsCounts``, ``hicTransform``, ``hicFindTADs``, ``hicPCA`` **for data and scores computation on Hi-C data**.
+
+It is noteworthy that ``hicSumMatrices`` and ``hicMergeMatrixBins`` **must be performed on uncorrected matrices**.
+
+_________________
 
 | For more information about HiCExplorer please consider our documentation on readthedocs.io_
 
 .. _readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html
-
-.. _`Imakaev 2012`: http://doi.org/doi:10.1038/nmeth.2148
+.. _`Imakaev et al. (2012)`: http://doi.org/doi:10.1038/nmeth.2148
 ]]></help>
     <expand macro="citations" />
 </tool>
-