Mercurial > repos > bgruening > deeptools_compute_gc_bias

<tool id="deeptools_compute_gc_bias" name="computeGCBias" version="@WRAPPER_VERSION@.0">
    <description>Determine the GC bias of your sequenced reads</description>
    <macros>
        <token name="@BINARY@">computeGCBias</token>
        <import>deepTools_macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <command>
<![CDATA[
        ln -s "$bamInput" "local_bamInput.bam" &&
        ln -s "$bamInput.metadata.bam_index" local_bamInput.bam.bai &&

        @BINARY@
            @THREADS@
            --bamfile local_bamInput.bam
            --GCbiasFrequenciesFile $outFileName
            --fragmentLength $fragmentLength

            @reference_genome_source@

            #if $effectiveGenomeSize.effectiveGenomeSize_opt == "specific":
                --effectiveGenomeSize $effectiveGenomeSize.effectiveGenomeSize
            #else:
                --effectiveGenomeSize $effectiveGenomeSize.effectiveGenomeSize_opt
            #end if

            #if str($region).strip() != '':
                --region '$region'
            #end if

            #if $advancedOpt.showAdvancedOpt == "yes":
                --sampleSize '$advancedOpt.sampleSize'
                --regionSize '$advancedOpt.regionSize'

                #if $advancedOpt.extraSampling:
                    --extraSampling $advancedOpt.extraSampling
                #end if

                @blacklist@
            #end if

            #if str($image_format) != 'none':
                --biasPlot $outImageName
                --plotFileFormat $image_format
            #end if
]]>
    </command>
    <inputs>
        <param name="bamInput" format="bam" type="data" label="BAM file"
            help=""/>

        <expand macro="reference_genome_source" />
        <expand macro="effectiveGenomeSize" />
        <expand macro="fragmentLength" />
        <expand macro="region_limit_operation" />

        <conditional name="advancedOpt">
            <param name="showAdvancedOpt" type="select" label="Show advanced options" >
                <option value="no" selected="true">no</option>
                <option value="yes">yes</option>
            </param>
            <when value="no" />
            <when value="yes">
                <param name="sampleSize" type="integer" value="50000000" min="1"
                    label="Number of sampling points to consider" help="(--sampleSize)" />
                <param name="regionSize" type="integer" value="300" min="1"
                    label="Region size"
                    help ="To plot the reads per GC over a region, the size of the region is
                    required (see below for more details about the method). By default, the bin size
                    is set to 300 bases, which is close to the standard fragment size of many sequencing
                    applications. However, if the depth of sequencing is low, a larger bin size will
                    be required, otherwise many bins will not overlap with any read. (--regionSize)"/>
                <param name="extraSampling" type="data" format="bed" optional="true"
                    label="BED file containing genomic regions for which extra sampling is required because they are underrepresented in the genome"
                    help="(--extraSampling)" />
                <expand macro="blacklist" />
            </when>
        </conditional>
        <param name="image_format" type="select"
            label="GC bias plot"
            help="If given, a diagnostic image summarizing the GC bias found on the sample will be created. (--plotFileFormat)">
            <option value="none">No image</option>
            <option value="png" selected="true">Image in png format</option>
            <option value="pdf">Image in pdf format</option>
            <option value="svg">Image in svg format</option>
            <option value="eps">Image in eps format</option>
        </param>
    </inputs>
    <outputs>
        <data name="outFileName" format="tabular" />
        <data name="outImageName" format="png" label="${tool.name} GC-bias Plot">
            <filter>
            ((
                image_format != 'none'
            ))
            </filter>
            <change_format>
                <when input="image_format" value="pdf" format="pdf" />
                <when input="image_format" value="svg" format="svg" />
                <when input="image_format" value="eps" format="eps" />
            </change_format>
        </data>
    </outputs>
    <tests>
        <test>
            <param name="bamInput" value="paired_chr2L.bam" ftype="bam" />
            <param name="image_format" value="png" />
            <param name="showAdvancedOpt" value="yes" />
            <param name="regionSize" value="1" />
            <param name="ref_source" value="history" />
            <param name="input1" value="sequence.2bit" />
            <param name="sampleSize" value="10" />
            <param name="effectiveGenomeSize_opt" value="specific" />
            <param name="effectiveGenomeSize" value="10050" />
            <param name="region" value="chr2L" />
            <param name="image_format" value="none" />
            <output name="outFileName" file="computeGCBias_result1.tabular" ftype="tabular" />
        </test>
    </tests>
    <help>
<![CDATA[
What it does
------------

This tool computes the GC bias using the method proposed in Benjamini and Speed (2012) Nucleic Acids Res. (see below for further details).
The output is used to plot the results and can also be used later on to correct the bias with the tool ``correctGCbias``.
There are two plots produced by the tool: a boxplot showing the absolute read numbers per GC-content bin and an x-y plot depicting the ratio of observed/expected reads per GC-content bin.

Output files
------------

- Diagnostic plots:
    - box plot of absolute read numbers per GC-content bin
    - x-y plot of observed/expected read ratios per GC-content bin

- Tabular file: to be used for GC correction with ``correctGCbias``

.. image:: $PATH_TO_IMAGES/computeGCBias_output.png
    :width: 600
    :height: 455

-----

Theoretical Background
----------------------

``computeGCBias`` is based on a paper by `Benjamini and Speed <http://nar.oxfordjournals.org/content/40/10/e72>`_.
The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition.
In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference.

``computeGCbias`` will first calculate the **expected GC profile** by counting the number of DNA fragments of a fixed size per GC fraction where GC fraction is defined as the number of G's or C's in a genome region of a given length.
The result is basically a histogram depicting the frequency of DNA fragments for each type of genome region with a GC fraction between 0 to 100 percent. This will be different for each reference genome, but is independent of the actual sequencing experiment.

The profile of the expected DNA fragment distribution is then compared to the **observed GC profile**, which is generated by counting the number of sequenced reads per GC fraction.

In an ideal experiment, the observed GC profile would, of course, look like the expected profile.
This is indeed the case when applying ``computeGCBias`` to simulated reads.

.. _computeGCBias_example_image:

.. image:: $PATH_TO_IMAGES/GC_bias_simulated_reads_2L.png

As you can see, both plots based on **simulated reads** do not show enrichments or depletions for specific GC content bins, there is an almost flat line around the log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the *Drosophila* genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary.

Now, let's have a look at **real-life data** from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase.

.. image:: $PATH_TO_IMAGES/QC_GCplots_input.png
    :width: 600
    :height: 452

**Note:** The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from, for example, *Plasmodium falciparum* (average genome GC content *P. falciparum*: 20%).

For more details, for example about when to exclude regions from the read distribution calculation, go `here <http://deeptools.readthedocs.org/en/latest/content/tools/computeGCBias.html#excluding-regions-from-the-read-distribution-calculation>`_


-----

@REFERENCES@
]]>
    </help>
    <expand macro="citations" />
</tool>
author	bgruening
date	Mon, 05 Dec 2016 08:09:41 -0500
parents	1c9d626635b4
children	1795fdfc23b8