comparison computeGCBias.xml @ 3:12a3082cf023 draft

planemo upload for repository https://github.com/fidelram/deepTools/tree/master/galaxy/wrapper/ commit 2e8510e4f4015f51f7726de5697ba2de9b4e2f4c
author bgruening
date Wed, 09 Mar 2016 18:13:26 -0500
parents e74853730716
children 1c9d626635b4
comparison
equal deleted inserted replaced
2:fc29c04c3605 3:12a3082cf023
47 #end if 47 #end if
48 ]]> 48 ]]>
49 </command> 49 </command>
50 <inputs> 50 <inputs>
51 <param name="bamInput" format="bam" type="data" label="BAM file" 51 <param name="bamInput" format="bam" type="data" label="BAM file"
52 help="The BAM file must be sorted."/> 52 help=""/>
53 53
54 <expand macro="reference_genome_source" /> 54 <expand macro="reference_genome_source" />
55 <expand macro="effectiveGenomeSize" /> 55 <expand macro="effectiveGenomeSize" />
56 <expand macro="fragmentLength" /> 56 <expand macro="fragmentLength" />
57 <expand macro="region_limit_operation" /> 57 <expand macro="region_limit_operation" />
125 </test> 125 </test>
126 </tests> 126 </tests>
127 <help> 127 <help>
128 <![CDATA[ 128 <![CDATA[
129 What it does 129 What it does
130 ------------------ 130 ------------
131 131
132 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details). 132 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details).
133 The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``. 133 The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``.
134 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin. 134 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin.
135 135
136 Output files 136 Output files
137 -------------- 137 ------------
138 138
139 - Diagnostic plots: 139 - Diagnostic plots:
140 - box plot of absolute read numbers per GC-content bin 140 - box plot of absolute read numbers per GC-content bin
141 - x-y plot of observed/expected read ratios per GC-content bin 141 - x-y plot of observed/expected read ratios per GC-content bin
142 142
143 - Tabular file: to be used for GC correction with ``correctGCbias`` 143 - Tabular file: to be used for GC correction with ``correctGCbias``
144 144
145 .. image:: $PATH_TO_IMAGES/computeGCBias_output.png 145 .. image:: $PATH_TO_IMAGES/computeGCBias_output.png
146 :width: 600 146 :width: 600
147 :height: 455 147 :height: 455
148 148
149 --------------------------------------------- 149 -----
150 150
151 Background 151 Theoretical Background
152 ------------- 152 ----------------------
153 153
154 ``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_. 154 ``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_.
155 The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition. 155 The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition.
156 In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference. 156 In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference.
157 157
170 As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary. 170 As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary.
171 171
172 Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase. 172 Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase.
173 173
174 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png 174 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png
175 :width: 600 175 :width: 600
176 :height: 452 176 :height: 452
177 177
178 **Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%). 178 **Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%).
179 179
180 For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_ 180 For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_
181 181