diff computeGCBias.xml @ 1:e74853730716 draft

planemo upload for repository https://github.com/fidelram/deepTools/tree/master/galaxy/wrapper/ commit fef8b344925620444d93d8159c0b2731a5777920
author bgruening
date Mon, 15 Feb 2016 10:29:47 -0500
parents 4409903dcb88
children 12a3082cf023
line wrap: on
line diff
--- a/computeGCBias.xml	Mon Jan 25 20:20:49 2016 -0500
+++ b/computeGCBias.xml	Mon Feb 15 10:29:47 2016 -0500
@@ -1,5 +1,5 @@
 <tool id="deeptools_compute_gc_bias" name="computeGCBias" version="@WRAPPER_VERSION@.0">
-    <description>to see whether your samples should be normalized for GC bias</description>
+    <description>Determine the GC bias of your sequenced reads</description>
     <macros>
         <token name="@BINARY@">computeGCBias</token>
         <import>deepTools_macros.xml</import>
@@ -126,41 +126,58 @@
     </tests>
     <help>
 <![CDATA[
-**What it does**
+What it does
+------------------
 
 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details).
-The output is used to plot the bias and can also be used later on to correct the bias with the tool correctGCbias.
-There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot
-depicting the ratio of observed/expected reads per GC-content bin.
+The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``.
+There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin.
+
+Output files
+--------------
 
------
+- Diagnostic plots:
+      - box plot of absolute read numbers per GC-content bin
+      - x-y plot of observed/expected read ratios per GC-content bin
+
+- Tabular file: to be used for GC correction with ``correctGCbias``
 
-**Summary of the method used**
+.. image:: $PATH_TO_IMAGES/computeGCBias_output.png
+   :width: 600
+   :height: 455
+
+---------------------------------------------
 
-In order to estimate how many reads with what kind of GC content one should have sequenced, we first need to determine how many regions the 
-reference genome contains with each percentage of GC content, i.e. how many regions in the genome have 50% GC (or 10% GC or 90% GC or...).
-We then sample a large number of equally sized genomic bins and count how many times we see a bin with 50% GC (or 10% GC or 90% or...). These EXPECTED values are independent of any
-sequencing bias and is purely dependent on the underlying genome (i.e. it will most likely vary between mouse and fruit fly due to their genome's different GC contents).
-The OBSERVED values are based on the reads from the sequenced sample. Instead of noting how many genomic regions there are per GC content, we now count the reads per GC content.
-In an ideal sample without GC bias, the ratio of OBSERVED/EXPECTED values should be close to 1 regardless of the GC content. Due to PCR (over)amplifications, the majority of ChIP samples
-usually shows a significant bias towards reads with high GC content (>50%)
+Background
+-------------
+
+``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_.
+The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition.
+In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference.
+
+``computeGCbias`` will first calculate the **expected GC profile** by counting the number of DNA fragments of a fixed size per GC fraction where GC fraction is defined as the number of G's or C's in a genome region of a given length.
+The result is basically a histogram depicting the frequency of DNA fragments for each type of genome region with a GC fraction between 0 to 100 percent. This will be different for each reference genome, but is independent of the actual sequencing experiment.
+
+The profile of the expected DNA fragment distribution is then compared to the **observed GC profile**, which is generated by counting the number of sequenced reads per GC fraction.
+
+In an ideal experiment, the observed GC profile would, of course, look like the expected profile.
+This is indeed the case when applying ``computeGCBias`` to simulated reads.
+
+.. _computeGCBias_example_image:
+
+.. image:: $PATH_TO_IMAGES/GC_bias_simulated_reads_2L.png
+
+As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary.
+
+Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase.
 
 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png
-
-
-You can find more details on the computeGCBias doc page: https://deeptools.readthedocs.org/en/master/content/tools/computeGCBias.html
-
-
-**Output files**:
+   :width: 600
+   :height: 452
 
-- Diagnostic plot
+**Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%).
 
-  - box plot of absolute read numbers per GC-content bin
-  - x-y plot of observed/expected read ratios per GC-content bin
-
-- Data matrix
-
-  - to be used for GC correction with correctGCbias
+For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_
 
 
 -----