Mercurial > repos > bgruening > deeptools_compute_gc_bias
comparison computeGCBias.xml @ 1:e74853730716 draft
planemo upload for repository https://github.com/fidelram/deepTools/tree/master/galaxy/wrapper/ commit fef8b344925620444d93d8159c0b2731a5777920
author | bgruening |
---|---|
date | Mon, 15 Feb 2016 10:29:47 -0500 |
parents | 4409903dcb88 |
children | 12a3082cf023 |
comparison
equal
deleted
inserted
replaced
0:4409903dcb88 | 1:e74853730716 |
---|---|
1 <tool id="deeptools_compute_gc_bias" name="computeGCBias" version="@WRAPPER_VERSION@.0"> | 1 <tool id="deeptools_compute_gc_bias" name="computeGCBias" version="@WRAPPER_VERSION@.0"> |
2 <description>to see whether your samples should be normalized for GC bias</description> | 2 <description>Determine the GC bias of your sequenced reads</description> |
3 <macros> | 3 <macros> |
4 <token name="@BINARY@">computeGCBias</token> | 4 <token name="@BINARY@">computeGCBias</token> |
5 <import>deepTools_macros.xml</import> | 5 <import>deepTools_macros.xml</import> |
6 </macros> | 6 </macros> |
7 <expand macro="requirements" /> | 7 <expand macro="requirements" /> |
124 <output name="outFileName" file="computeGCBias_result1.tabular" ftype="tabular" /> | 124 <output name="outFileName" file="computeGCBias_result1.tabular" ftype="tabular" /> |
125 </test> | 125 </test> |
126 </tests> | 126 </tests> |
127 <help> | 127 <help> |
128 <![CDATA[ | 128 <![CDATA[ |
129 **What it does** | 129 What it does |
130 ------------------ | |
130 | 131 |
131 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details). | 132 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details). |
132 The output is used to plot the bias and can also be used later on to correct the bias with the tool correctGCbias. | 133 The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``. |
133 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot | 134 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin. |
134 depicting the ratio of observed/expected reads per GC-content bin. | |
135 | 135 |
136 ----- | 136 Output files |
137 -------------- | |
137 | 138 |
138 **Summary of the method used** | 139 - Diagnostic plots: |
140 - box plot of absolute read numbers per GC-content bin | |
141 - x-y plot of observed/expected read ratios per GC-content bin | |
139 | 142 |
140 In order to estimate how many reads with what kind of GC content one should have sequenced, we first need to determine how many regions the | 143 - Tabular file: to be used for GC correction with ``correctGCbias`` |
141 reference genome contains with each percentage of GC content, i.e. how many regions in the genome have 50% GC (or 10% GC or 90% GC or...). | 144 |
142 We then sample a large number of equally sized genomic bins and count how many times we see a bin with 50% GC (or 10% GC or 90% or...). These EXPECTED values are independent of any | 145 .. image:: $PATH_TO_IMAGES/computeGCBias_output.png |
143 sequencing bias and is purely dependent on the underlying genome (i.e. it will most likely vary between mouse and fruit fly due to their genome's different GC contents). | 146 :width: 600 |
144 The OBSERVED values are based on the reads from the sequenced sample. Instead of noting how many genomic regions there are per GC content, we now count the reads per GC content. | 147 :height: 455 |
145 In an ideal sample without GC bias, the ratio of OBSERVED/EXPECTED values should be close to 1 regardless of the GC content. Due to PCR (over)amplifications, the majority of ChIP samples | 148 |
146 usually shows a significant bias towards reads with high GC content (>50%) | 149 --------------------------------------------- |
150 | |
151 Background | |
152 ------------- | |
153 | |
154 ``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_. | |
155 The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition. | |
156 In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference. | |
157 | |
158 ``computeGCbias`` will first calculate the **expected GC profile** by counting the number of DNA fragments of a fixed size per GC fraction where GC fraction is defined as the number of G's or C's in a genome region of a given length. | |
159 The result is basically a histogram depicting the frequency of DNA fragments for each type of genome region with a GC fraction between 0 to 100 percent. This will be different for each reference genome, but is independent of the actual sequencing experiment. | |
160 | |
161 The profile of the expected DNA fragment distribution is then compared to the **observed GC profile**, which is generated by counting the number of sequenced reads per GC fraction. | |
162 | |
163 In an ideal experiment, the observed GC profile would, of course, look like the expected profile. | |
164 This is indeed the case when applying ``computeGCBias`` to simulated reads. | |
165 | |
166 .. _computeGCBias_example_image: | |
167 | |
168 .. image:: $PATH_TO_IMAGES/GC_bias_simulated_reads_2L.png | |
169 | |
170 As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary. | |
171 | |
172 Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase. | |
147 | 173 |
148 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png | 174 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png |
175 :width: 600 | |
176 :height: 452 | |
149 | 177 |
178 **Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%). | |
150 | 179 |
151 You can find more details on the computeGCBias doc page: https://deeptools.readthedocs.org/en/master/content/tools/computeGCBias.html | 180 For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_ |
152 | |
153 | |
154 **Output files**: | |
155 | |
156 - Diagnostic plot | |
157 | |
158 - box plot of absolute read numbers per GC-content bin | |
159 - x-y plot of observed/expected read ratios per GC-content bin | |
160 | |
161 - Data matrix | |
162 | |
163 - to be used for GC correction with correctGCbias | |
164 | 181 |
165 | 182 |
166 ----- | 183 ----- |
167 | 184 |
168 @REFERENCES@ | 185 @REFERENCES@ |