comparison hicBuildMatrix.xml @ 9:495ae38f6e0d draft

planemo upload for repository https://github.com/maxplanck-ie/HiCExplorer/tree/master/galaxy/wrapper/ commit eec0a4d5a7c5ba4ec0fbd2ead8280c3d143bb9d8
author iuc
date Fri, 27 Apr 2018 03:35:56 -0400
parents 707f691c974c
children 8bf84c4cb1cb
comparison
equal deleted inserted replaced
8:707f691c974c 9:495ae38f6e0d
1 <tool id="hicexplorer_hicbuildmatrix" name="@BINARY@" version="@WRAPPER_VERSION@.0"> 1 <tool id="hicexplorer_hicbuildmatrix" name="@BINARY@" version="@WRAPPER_VERSION@.0">
2 <description>creates a contact matrix</description> 2 <description>create a contact matrix</description>
3 <macros> 3 <macros>
4 <token name="@BINARY@">hicBuildMatrix</token> 4 <token name="@BINARY@">hicBuildMatrix</token>
5 <import>macros.xml</import> 5 <import>macros.xml</import>
6 </macros> 6 </macros>
7 <expand macro="requirements" > 7 <expand macro="requirements" >
36 #if $region: 36 #if $region:
37 --region '$region' 37 --region '$region'
38 #end if 38 #end if
39 39
40 --outFileName matrix.$outputFormat 40 --outFileName matrix.$outputFormat
41 41
42 --outBam ./unsorted.bam 42 #if $outBam_Boolean:
43 $outBam_Boolean ./unsorted.bam
44 #end if
43 45
44 $keepSelfCircles 46 $keepSelfCircles
45 47
46 #if $minMappingQuality and $minMappingQuality is not None: 48 #if $minMappingQuality and $minMappingQuality is not None:
47 --minMappingQuality $minMappingQuality 49 --minMappingQuality $minMappingQuality
50 #if $danglingSequence: 52 #if $danglingSequence:
51 --danglingSequence '$danglingSequence' 53 --danglingSequence '$danglingSequence'
52 #end if 54 #end if
53 55
54 --threads @THREADS@ 56 --threads @THREADS@
55 57
56 --QCfolder ./QCfolder 58 --QCfolder ./QCfolder
57 && 59 &&
58 mv ./QCfolder/* $qc.files_path/ 60 mv ./QCfolder/* $qc.files_path/
59 && 61 &&
60 mv $qc.files_path/hicQC.html $qc 62 mv $qc.files_path/hicQC.html $qc
61 && mv $qc.files_path/*.log raw_qc 63 && mv $qc.files_path/*.log raw_qc
62 && mv matrix.$outputFormat matrix 64 && mv matrix.$outputFormat matrix
63 && samtools sort ./unsorted.bam -o sorted.bam 65 #if $outBam_Boolean:
64 66 && samtools sort ./unsorted.bam -o sorted.bam
67 #end if
68
65 ]]> 69 ]]>
66 </command> 70 </command>
67 <inputs> 71 <inputs>
68 <repeat max="2" min="2" name="samFiles" title="Sam/Bam files to process"> 72 <repeat max="2" min="2" name="samFiles" title="Sam/Bam files to process (forward/reverse)"
69 <param name="samFile" type="data" format="sam,bam"/> 73 help="Please use the special BAM datatype: qname_input_sorted.bam and use for 'bowtie2' the '--reorder' option to create a BAM file.">
74 <param name="samFile" type="data" format="sam,qname_input_sorted.bam"/>
70 </repeat> 75 </repeat>
71 <conditional name="restrictionCutFileBinSize_conditional"> 76 <conditional name="restrictionCutFileBinSize_conditional">
72 <param name="restrictionCutFileBinSize_selector" type="select" label="Choose to use a restriction cut file or a bin size"> 77 <param name="restrictionCutFileBinSize_selector" type="select" label="Choose to use a restriction cut file or a bin size">
73 <option value="optionRestrictionCutFile">Restriction cut file</option> 78 <option value="optionRestrictionCutFile">Restriction cut file</option>
74 <option value="optionBinSize" selected="True">Bin size</option> 79 <option value="optionBinSize" selected="True">Bin size</option>
104 label="Keep self circles" 109 label="Keep self circles"
105 help="If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots." /> 110 help="If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots." />
106 111
107 <expand macro="minMappingQuality" /> 112 <expand macro="minMappingQuality" />
108 113
109 <param argument="--danglingSequence" type="text" optional="true" label="The dangling sequence" 114 <param argument="--danglingSequence" type="text" optional="true" label="Dangling sequence"
110 help="Dangling end sequence left by the restriction enzyme. For DpnII for example, 115 help="Sequence left by the restriction enzyme after cutting.
111 the dangling end is the same restriction sequence. This is used 116 Each restriction enzyme recognizes a different DNA sequence and,
112 to discard reads that end/start with such sequence 117 after cutting, they leave behind a specific ‘sticky’ end or dangling end sequence.
113 and that are considered un-ligated fragments or 118 For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT.
114 'dangling-ends'. If not given, such statistics will 119 For DpnII, the restriction site and dangling end sequence are the same: GATC.
115 not be available."/> 120 This information is easily found on the description of the restriction enzyme.
121 The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads.
122 A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol. "/>
123
124 <param name='outBam_Boolean' type='boolean' truevalue='--outBam' falsevalue="" checked="false" label="Save valid Hi-C reads in BAM file"
125 help="A bam
126 file containing all valid Hi-C reads can be created
127 using this option. This bam file could be useful to
128 inspect the distribution of valid Hi-C reads pairs or
129 for other downstream analyses, but is not used by any
130 HiCExplorer tool. Computation will be significantly
131 longer if this option is set."/>
132
116 <param name='outputFormat' type='select' label="Output file format"> 133 <param name='outputFormat' type='select' label="Output file format">
117 <option value='h5'>HiCExplorer format</option> 134 <option value='h5'>HiCExplorer format</option>
118 <option value="cool">cool</option> 135 <option value="cool">cool</option>
119 </param> 136 </param>
120 </inputs> 137 </inputs>
121 <outputs> 138 <outputs>
122 <data name="outBam" from_work_dir="sorted.bam" format="bam" label="${tool.name} BAM file on ${on_string}"/> 139 <data name="outBam" from_work_dir="sorted.bam" format="bam" label="${tool.name} BAM file on ${on_string}">
140 <filter>outBam_Boolean</filter>
141 </data>
123 <data name="outFileName" from_work_dir="matrix" format="h5" label="${tool.name} MATRIX on ${on_string}"> 142 <data name="outFileName" from_work_dir="matrix" format="h5" label="${tool.name} MATRIX on ${on_string}">
124 <change_format> 143 <change_format>
125 <when input="outputFormat" value="cool" format="cool" /> 144 <when input="outputFormat" value="cool" format="cool" />
126 </change_format> 145 </change_format>
127 </data> 146 </data>
128 <data name="qc" format="html" label="${tool.name} QC"/> 147 <data name="qc" format="html" label="${tool.name} QC on ${on_string}"/>
129 148
130 <data name="raw_qc" from_work_dir='raw_qc' format='txt' label="${tool.name} raw QC" /> 149 <data name="raw_qc" from_work_dir='raw_qc' format='txt' label="${tool.name} raw QC on ${on_string}" />
131 </outputs> 150 </outputs>
132 <tests> 151 <tests>
133 <test> 152 <test>
134 <repeat name="samFiles"> 153 <repeat name="samFiles">
135 <param name="samFile" value="small_test_R1_unsorted.sam"/> 154 <param name="samFile" value="small_test_R1_unsorted.sam"/>
140 <conditional name="restrictionCutFileBinSize_conditional"> 159 <conditional name="restrictionCutFileBinSize_conditional">
141 <param name="restrictionCutFileBinSize_selector" value="optionBinSize"/> 160 <param name="restrictionCutFileBinSize_selector" value="optionBinSize"/>
142 <param name="binSize" value="5000"/> 161 <param name="binSize" value="5000"/>
143 </conditional> 162 </conditional>
144 <param name='outputFormat' value='h5'/> 163 <param name='outputFormat' value='h5'/>
145 164 <param name='outBam_Boolean' value="True" />
146 <output name="outBam" file="small_test_matrix_result_sorted.bam" ftype="bam"/> 165 <output name="outBam" file="small_test_matrix_result_sorted.bam" ftype="bam"/>
147 <output name="outFileName" file="small_test_matrix_2.h5" ftype="h5" compare="sim_size"/> 166 <output name="outFileName" file="small_test_matrix_2.h5" ftype="h5" compare="sim_size"/>
148 <output name="raw_qc" file='raw_qc_report' compare='diff' lines_diff='2'/> 167 <output name="raw_qc" file='raw_qc_report' compare='diff' lines_diff='2'/>
149 </test> 168 </test>
150 </tests> 169 </tests>
151 <help><![CDATA[ 170 <help><![CDATA[
152 171
153 Creation of the contact matrix 172 Creation of the contact matrix
154 =============================== 173 ===============================
155 174
156 ``hicBuildMatrix`` creates a contact matrix based on Hi-C read pairs. It requires two sam or bam files 175
157 corresponding to the first and second mates of the paired-end H-C reads. The sam and bam files should 176 **hicBuildMatrix** creates a contact matrix based on Hi-C read pairs. It requires two sam or bam files
158 not be sorted by position. There are two main options to create the Hi-C contact matrix, either by 177 corresponding to the first and second mates of the paired-end Hi-C reads mapped on the reference genome.
159 fixed bin size (eg. 10.000 bp) or by bins of variable restriction fragment size length. 178 The sam and bam files should not be sorted by position. There are two main options to create the Hi-C contact matrix,
160 ``hicBuildMatrix`` generates a quality control output that can be used to analyze the quality of the Hi-C reads. 179 either by fixed bin size (eg. 10000 bp) or by bins of variable length following restriction enzyme sites location in the genome (restriction enzyme resolution).
161 180 **hicBuildMatrix** generates a quality control output that can be used to analyze the quality of the Hi-C reads to assess if the experiment and sequencing were successful.
162 Input 181
182 _________________
183
184
185 Usage
163 ----- 186 -----
164 187
165 `hicBuildMatrix` is having the following parameters: 188
166 189 This tool must be used on paired sam / bam files produced with a program that supports local alignment (e.g. Bowtie2) where both PE reads are mapped using the --local option.
167 Parameters 190
168 __________ 191 _________________
169
170
171 - two input BAM/SAM files
172 - a bin size
173 - a restriction cut file as an alternative to the bin size
174 - restriction sequence: e.g. HindIII: GATC
175
176 192
177 193
178 Output 194 Output
179 ------ 195 ------
180 196
181 `hicBuildMatrix` creates as an output: 197 **hicBuildMatrix** creates multiple outputs:
182 - the contact matrix 198
183 - a bam file with the accepted alignments 199 - The contact matrix used by HiCExplorer for all downstream analyses.
184 - a quality report. 200 - A bam file with the accepted alignments, which can be useful to inspect the distribution of valid Hi-C reads pairs, notably around restriction enzyme sites or for other downstream analyses. This file is not used by any HiCExplorer tools.
201 - A quality control report to assess if the Hi-C protocol and library contrusction were successful.
185 202
186 Example plot 203 Example plot
187 ----------------------------------------------------------------- 204 ++++++++++++
188 .. image:: $PATH_TO_IMAGES/SRR027956.svg 205
189 :width: 70% 206 .. image:: hicPlotMatrix.png
190 207 :width: 50%
191 Contact matrix created with `hicPlotMatrix`. 208
209 Contact matrix of *Drosophila melanogaster* embryos built with **hicBuildMatrix** and visualized using ``hicPlotMatrix``. Hi-C matrix bins were merged to a 25 kb bin size before plotting using ``hicMergeMatrixBins``.
210
211
212
192 213
193 Quality report 214 Quality report
194 -------------- 215 ++++++++++++++
195 216
196 The quality report gives you information about: 217 A quality report is produced alongside the contact matrix.
197 218
198 - how many pairs were used to build the contact matrix 219 .. image:: $PATH_TO_IMAGES/hicQC.png
199 - dangling end pairs: These are reads that start with the restriction site and constitute reads that were digested but no ligated. 220 :width: 40%
200 - same fragment pairs: These are read mates, facing inward, separated by up to 800 bp that do not have a restriction enzyme in between. These read pairs are not valid Hi-C pairs. 221
201 - self circles: Self circles are defined as pairs within 25kb with 'outward' read orientation 222 Several plots, that are described in details below, are comprised inside this report.
202 - self ligations: These are read pairs with a restriction site in between that are within 800 bp. 223
203 224 .. image:: $PATH_TO_IMAGES/hicQC_pairs_sequenced.png
204 Contact distance: 225 :width: 40%
226
227 On the plot above, we can see how many reads were sequenced per sample (pairs considered), how many reads were mappable, unique and of high quality and how many reads passed all quality controls and are thus useful for further analysis (pairs used). All quality controls used for read filtering are explained below.
228
229 .. image:: $PATH_TO_IMAGES/hicQC_unmappable_and_non_unique.png
230 :width: 40%
231
232 The figure above contains the fraction of reads with respect to the total number of reads that did not map, that have a low quality score or that didn't map uniquely to the genome. In our example we can see that Sample 3 has the highest fraction of pairs used. We explain the differences between the three samples on the plot below.
233
234 .. image:: $PATH_TO_IMAGES/hicQC_pairs_discarded.png
235 :width: 40%
236
237 This figure contains the fraction of read pairs (with respect to mappable and unique reads) that were discarded when building the Hi-C matrix. You can find the description of each category below:
238
239 - **Dangling ends:** reads that start with the restriction site and constitute reads that were digested but not ligated. Sample 1 in our example has a high fraction of dangling ends (and thus a low proportion of pairs used). Reasons for this can be inefficient ligation or insufficient removal of danging ends during samples preparation.
240
241 - **Duplicated pairs:** reads that have the same sequence due to PCR amplification. For example, Sample 2 was amplified too much and thus has a very high fraction of duplicated pairs.
242
243 - **Same fragment:** read mates facing inward, separated by up to 800bp that do not have a restriction enzyme site in between. These read pairs are not valid Hi-C pairs and are thus discarded from further analyses.
244
245 - **Self circle:** read pairs within 25kb with 'outward' read orientation.
246
247 - **Self ligation:** read pairs with a restriction site in between that are within 800bp.
248
249 .. image:: $PATH_TO_IMAGES/hicQC_distance.png
250 :width: 40%
251
252 The figure above contains the fraction of read pairs (with respect to mappable reads) that compose inter chromosomal, short range (< 20kb) or long range contacts. Inter chromosomal reads of a wild-type sample are expected to be low. Trans-chromosomal contacts can be primarily considered as random ligation events. These would be expected to contribute to technical noise that may obscure some of the finer features in the Hi-C datasets (Nagano *et al.* 2015, Comparison of Hi-C results using in-solution versus in-nucleus ligation, doi: https://doi.org/10.1186/s13059-015-0753-7). As such, a high fraction of inter chromosomal reads is an indicator of low sample quality, but it can also be associated to cell cycle changes (Nagano *et al.* 2018, Cell-cycle dynamics of chromosomal organisation at single-cell resolution, doi: https://doi.org/10.1038/nature23001).
253
254 Short range and long range contacts proportions can be associated to how the fixation is performed during Hi-C sample preparation. These two proportions also directly impact the Hi-C corrected counts versus genomic distance plots generated by hicPlotDistVsCounts.
255
256 .. image:: $PATH_TO_IMAGES/hicQC_read_orientation.png
257 :width: 40%
258
259 The last figure shows the fractions of inward, outward, left or right read pairs (with respect to mappable reads). Deviations from an equal distribution indicates problems during sample preparation.
260
205 _________________ 261 _________________
206 - inter chromosomal
207 - short range < 20 kb
208 - long range
209
210 Read orientation:
211 _________________
212 - inward pairs
213 - outward pairs
214 - left pairs
215 - right pairs
216
217 .. image:: $PATH_TO_IMAGES/hicQC.png
218 :width: 70 %
219
220 262
221 | For more information about HiCExplorer please consider our documentation on readthedocs.io_. 263 | For more information about HiCExplorer please consider our documentation on readthedocs.io_.
222 264
223 .. _readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html 265 .. _readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html
224
225
226 ]]></help> 266 ]]></help>
227 <expand macro="citations" /> 267 <expand macro="citations" />
228 </tool> 268 </tool>