comparison computeGCBias.xml @ 1:e74853730716 draft

planemo upload for repository https://github.com/fidelram/deepTools/tree/master/galaxy/wrapper/ commit fef8b344925620444d93d8159c0b2731a5777920
author bgruening
date Mon, 15 Feb 2016 10:29:47 -0500
parents 4409903dcb88
children 12a3082cf023
comparison
equal deleted inserted replaced
0:4409903dcb88 1:e74853730716
1 <tool id="deeptools_compute_gc_bias" name="computeGCBias" version="@WRAPPER_VERSION@.0"> 1 <tool id="deeptools_compute_gc_bias" name="computeGCBias" version="@WRAPPER_VERSION@.0">
2 <description>to see whether your samples should be normalized for GC bias</description> 2 <description>Determine the GC bias of your sequenced reads</description>
3 <macros> 3 <macros>
4 <token name="@BINARY@">computeGCBias</token> 4 <token name="@BINARY@">computeGCBias</token>
5 <import>deepTools_macros.xml</import> 5 <import>deepTools_macros.xml</import>
6 </macros> 6 </macros>
7 <expand macro="requirements" /> 7 <expand macro="requirements" />
124 <output name="outFileName" file="computeGCBias_result1.tabular" ftype="tabular" /> 124 <output name="outFileName" file="computeGCBias_result1.tabular" ftype="tabular" />
125 </test> 125 </test>
126 </tests> 126 </tests>
127 <help> 127 <help>
128 <![CDATA[ 128 <![CDATA[
129 **What it does** 129 What it does
130 ------------------
130 131
131 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details). 132 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details).
132 The output is used to plot the bias and can also be used later on to correct the bias with the tool correctGCbias. 133 The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``.
133 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot 134 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin.
134 depicting the ratio of observed/expected reads per GC-content bin.
135 135
136 ----- 136 Output files
137 --------------
137 138
138 **Summary of the method used** 139 - Diagnostic plots:
140 - box plot of absolute read numbers per GC-content bin
141 - x-y plot of observed/expected read ratios per GC-content bin
139 142
140 In order to estimate how many reads with what kind of GC content one should have sequenced, we first need to determine how many regions the 143 - Tabular file: to be used for GC correction with ``correctGCbias``
141 reference genome contains with each percentage of GC content, i.e. how many regions in the genome have 50% GC (or 10% GC or 90% GC or...). 144
142 We then sample a large number of equally sized genomic bins and count how many times we see a bin with 50% GC (or 10% GC or 90% or...). These EXPECTED values are independent of any 145 .. image:: $PATH_TO_IMAGES/computeGCBias_output.png
143 sequencing bias and is purely dependent on the underlying genome (i.e. it will most likely vary between mouse and fruit fly due to their genome's different GC contents). 146 :width: 600
144 The OBSERVED values are based on the reads from the sequenced sample. Instead of noting how many genomic regions there are per GC content, we now count the reads per GC content. 147 :height: 455
145 In an ideal sample without GC bias, the ratio of OBSERVED/EXPECTED values should be close to 1 regardless of the GC content. Due to PCR (over)amplifications, the majority of ChIP samples 148
146 usually shows a significant bias towards reads with high GC content (>50%) 149 ---------------------------------------------
150
151 Background
152 -------------
153
154 ``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_.
155 The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition.
156 In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference.
157
158 ``computeGCbias`` will first calculate the **expected GC profile** by counting the number of DNA fragments of a fixed size per GC fraction where GC fraction is defined as the number of G's or C's in a genome region of a given length.
159 The result is basically a histogram depicting the frequency of DNA fragments for each type of genome region with a GC fraction between 0 to 100 percent. This will be different for each reference genome, but is independent of the actual sequencing experiment.
160
161 The profile of the expected DNA fragment distribution is then compared to the **observed GC profile**, which is generated by counting the number of sequenced reads per GC fraction.
162
163 In an ideal experiment, the observed GC profile would, of course, look like the expected profile.
164 This is indeed the case when applying ``computeGCBias`` to simulated reads.
165
166 .. _computeGCBias_example_image:
167
168 .. image:: $PATH_TO_IMAGES/GC_bias_simulated_reads_2L.png
169
170 As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary.
171
172 Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase.
147 173
148 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png 174 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png
175 :width: 600
176 :height: 452
149 177
178 **Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%).
150 179
151 You can find more details on the computeGCBias doc page: https://deeptools.readthedocs.org/en/master/content/tools/computeGCBias.html 180 For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_
152
153
154 **Output files**:
155
156 - Diagnostic plot
157
158 - box plot of absolute read numbers per GC-content bin
159 - x-y plot of observed/expected read ratios per GC-content bin
160
161 - Data matrix
162
163 - to be used for GC correction with correctGCbias
164 181
165 182
166 ----- 183 -----
167 184
168 @REFERENCES@ 185 @REFERENCES@