Mercurial > repos > bgruening > deeptools_compute_gc_bias
comparison computeGCBias.xml @ 3:12a3082cf023 draft
planemo upload for repository https://github.com/fidelram/deepTools/tree/master/galaxy/wrapper/ commit 2e8510e4f4015f51f7726de5697ba2de9b4e2f4c
author | bgruening |
---|---|
date | Wed, 09 Mar 2016 18:13:26 -0500 |
parents | e74853730716 |
children | 1c9d626635b4 |
comparison
equal
deleted
inserted
replaced
2:fc29c04c3605 | 3:12a3082cf023 |
---|---|
47 #end if | 47 #end if |
48 ]]> | 48 ]]> |
49 </command> | 49 </command> |
50 <inputs> | 50 <inputs> |
51 <param name="bamInput" format="bam" type="data" label="BAM file" | 51 <param name="bamInput" format="bam" type="data" label="BAM file" |
52 help="The BAM file must be sorted."/> | 52 help=""/> |
53 | 53 |
54 <expand macro="reference_genome_source" /> | 54 <expand macro="reference_genome_source" /> |
55 <expand macro="effectiveGenomeSize" /> | 55 <expand macro="effectiveGenomeSize" /> |
56 <expand macro="fragmentLength" /> | 56 <expand macro="fragmentLength" /> |
57 <expand macro="region_limit_operation" /> | 57 <expand macro="region_limit_operation" /> |
125 </test> | 125 </test> |
126 </tests> | 126 </tests> |
127 <help> | 127 <help> |
128 <![CDATA[ | 128 <![CDATA[ |
129 What it does | 129 What it does |
130 ------------------ | 130 ------------ |
131 | 131 |
132 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details). | 132 This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details). |
133 The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``. | 133 The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``. |
134 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin. | 134 There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin. |
135 | 135 |
136 Output files | 136 Output files |
137 -------------- | 137 ------------ |
138 | 138 |
139 - Diagnostic plots: | 139 - Diagnostic plots: |
140 - box plot of absolute read numbers per GC-content bin | 140 - box plot of absolute read numbers per GC-content bin |
141 - x-y plot of observed/expected read ratios per GC-content bin | 141 - x-y plot of observed/expected read ratios per GC-content bin |
142 | 142 |
143 - Tabular file: to be used for GC correction with ``correctGCbias`` | 143 - Tabular file: to be used for GC correction with ``correctGCbias`` |
144 | 144 |
145 .. image:: $PATH_TO_IMAGES/computeGCBias_output.png | 145 .. image:: $PATH_TO_IMAGES/computeGCBias_output.png |
146 :width: 600 | 146 :width: 600 |
147 :height: 455 | 147 :height: 455 |
148 | 148 |
149 --------------------------------------------- | 149 ----- |
150 | 150 |
151 Background | 151 Theoretical Background |
152 ------------- | 152 ---------------------- |
153 | 153 |
154 ``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_. | 154 ``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_. |
155 The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition. | 155 The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition. |
156 In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference. | 156 In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference. |
157 | 157 |
170 As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary. | 170 As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary. |
171 | 171 |
172 Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase. | 172 Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase. |
173 | 173 |
174 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png | 174 .. image:: $PATH_TO_IMAGES/QC_GCplots_input.png |
175 :width: 600 | 175 :width: 600 |
176 :height: 452 | 176 :height: 452 |
177 | 177 |
178 **Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%). | 178 **Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%). |
179 | 179 |
180 For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_ | 180 For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_ |
181 | 181 |