view tag2collapse.xml @ 0:0475e4175855 draft default tip

planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
author yqiancolumbia
date Mon, 30 Apr 2018 05:25:11 -0400
parents
children
line wrap: on
line source

<tool id="tag2collapse" name="Collapse PCR duplicates">
	<description>using coordinates</description>

	<command interpreter="perl">
		/home/galaxy/tools/CTK/tag2collapse.pl $bigFile -v --weight-in-name --keep-max-score --keep-tag-name

		#if $randomBarcode.hasRandomBarcode == "yes":
		--random-barcode -EM $randomBarcode.confidence --seq-error-model $randomBarcode.seqErrorModel
		#end if

		$weightFlag $input $output
	</command>
	<inputs>
		<param type="data" format="bed" name="input" label="Input BED file"/>
		<param name="bigFile" type="boolean" truevalue="-big" falsevalue="" checked="yes" label="Big file (over 6M lines)" />
		<param name="weightFlag" type="boolean" truevalue="-weight" falsevalue="" checked="yes" label="Consider the weight of each tag - each read has a weight representing its exact copy number in the raw data (see help below for more information)" />


		<conditional name="randomBarcode">
			<param name="hasRandomBarcode" type="select" label="Is there degenerate barcode (i.e., UMI) attached to the id? (no collapse for different barcodes; see help below for more information)">
				<option value="yes">Yes</option>
				<option value="no">No</option>
			</param>
			<when value="yes">
				<param name="seqErrorModel" type="select" label="How should sequencing error be estimated with sequencing error model?">
				<option value="alignment" selected="yes">From mismatches in alignment</option>
				<option value="em-local">From degenerate barcode using an EM algorithm</option>
				</param>
				<param name="confidence" type="integer" value="30" label="Confidence score for the EM algorithm" />
			</when>
			<when value="no">
			</when>
		</conditional>
	</inputs>

	<outputs>
		<data name="output" format="bed" label="Collapse PCR duplicates on ${on_string}"/>
	</outputs>

	<help>

.. class:: infomark

**What this tool does**

This tool collaspes tags according to the start position.

It will take as input files in BED format of tags and output files in BED format of unique tags to eliminate potential PCR duplicates.
 
It can run in two modes:

1. No degenerate barcode.  In this mode, tags with the same starting coordinates are collapsed and only one is kept.
2. With degenerate barcode.  In this mode, tags that are mapped to the same position but carry different degenerate barcodes still have a chance to be kept (see below for details).  

-----

**Consider the weight of each tag if you collapsed exact duplicates (reads with exactly the same sequences)**

The tool will then consider a "weight" that represents the copy number of each tag.  This weight can be given in the score (5th) column, or attached to the tag NAME (before the degenerate barcode sequence (e.g. READ1#10#ACGTA).

-----

.. class:: warningmark

**Input data format (important)**

The input file is the unambiguously mappable tags in a BED file.  However, the tool may need extra information embeded in the BED file depending on the parameters you use. These pieces of extra information are already there if you use the alignment tool provided.

First, the tool tries to keep track of the copy number of each tag, if you collapsed exact duplicates before alignment (which is always recommended).  By default, the copy number was attached to the sequence name (4th column).  Therefore, a sequence id might read like this::

	tag1#3
	
which means tag1 has 3 exact copies and tag1 is the representative of the three. if your sample has degenerate barcode, the barcode sequence is also attached to the sequence id, after the copy number, so that an sequence id might read like this::

	    tag1#3#AAAGG

If you check "has weight", but do not check "Weight in name", the tool will use the number in the score (4th) column as the copy number. When your data do not have degenerate barcode, the tool will collapse all reads with the same genomic starting coordinates and sum up the copy number of tags in each position and save the information in the score column of the output unique tag BED file.  In this case, whether the copy number information was provided correctly or not will not affect the number and identity of the unique tags it reports, so it does not matter much if you do not care about the total copy number.

However, if your sample has degenerate barcode, it is CRITICAL to specify the information correctly. 


.. class:: warningmark

Update of input format in the new version (11/22/2010):

A new method is introduced to give more accurate estimate of sequencing error, which is important for samples with degenerate barcode. Now there are two options to estimate sequencing errors, either from the degenerate barcode iteratively in the EM algorithm (the original method), or from mismatches detected during alignment (the new method). To use the new method, the number of mismatches has to be provided in the score column of the BED file (which is already there if you used the alignment tool provided).  In this case, if you collapse exact duplicates before alignment, the copy number must be attached to the sequence id (so that you need to check both "has weight" and "has weight in name"). 

The results from the two methods are not dramatically different.  For new analysis, the new method (estimate sequencing error from alignment) is recommended and a confidence score of 30 should be suitable in most cases. However, if you want to keep your analysis consistent with previous data, you can still use the original method.  

The new method is set to be default on April 8, 2011.

-----

**How this tool determines unique reads with degenerate barcode**

If the raw reads have random barcodes attached to the 5' end of each read, the barcode has to be striped before alignment.  The barcode is attached to NAME of each tag (e.g., READ1 will become READ1#ACGTA), which will be used here to determine tags that have the same starting coordinates, but have "sufficiently" distinct barcodes. This task is not trivial given that some tags can have thousands of copies in some CLIP experiments, so that sequencing errors are not negligible. 

To deal with the problem, the program uses an iterative Expectation-Maximization algorithm that estimates sequencing errors in degenerate barcode, or uses the sequencing error estimated from the mismatches detected during alignment (new, 11/22/2010). It models the copy number and identity of each barcode sequence, and infers which tag is generated by sequencing errors that lead to apparently different barcode.  The confidence measures the probability of each tag with the observed barcode are "bona fide" in the CLIP library, or generated by sequencing error. To be consistent with other sequence alignment programs, -log10(P) * 10  is reported, so that a score of 50 represents the tag has a chance of 10^(-5) to be generated by sequencing error.  The confidence score is in the score (5 column) of the output. If all tags mapped to a position have the same barcode (single-copy tag is a special case), an arbitrary score of 100 is given. Therefore, a confidence score threshold > 100 should never be used.

	</help>

</tool>