annotate tag2collapse.xml @ 0:0475e4175855 draft default tip

planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
author yqiancolumbia
date Mon, 30 Apr 2018 05:25:11 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
1 <tool id="tag2collapse" name="Collapse PCR duplicates">
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
2 <description>using coordinates</description>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
3
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
4 <command interpreter="perl">
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
5 /home/galaxy/tools/CTK/tag2collapse.pl $bigFile -v --weight-in-name --keep-max-score --keep-tag-name
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
6
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
7 #if $randomBarcode.hasRandomBarcode == "yes":
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
8 --random-barcode -EM $randomBarcode.confidence --seq-error-model $randomBarcode.seqErrorModel
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
9 #end if
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
10
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
11 $weightFlag $input $output
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
12 </command>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
13 <inputs>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
14 <param type="data" format="bed" name="input" label="Input BED file"/>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
15 <param name="bigFile" type="boolean" truevalue="-big" falsevalue="" checked="yes" label="Big file (over 6M lines)" />
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
16 <param name="weightFlag" type="boolean" truevalue="-weight" falsevalue="" checked="yes" label="Consider the weight of each tag - each read has a weight representing its exact copy number in the raw data (see help below for more information)" />
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
17
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
18
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
19 <conditional name="randomBarcode">
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
20 <param name="hasRandomBarcode" type="select" label="Is there degenerate barcode (i.e., UMI) attached to the id? (no collapse for different barcodes; see help below for more information)">
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
21 <option value="yes">Yes</option>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
22 <option value="no">No</option>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
23 </param>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
24 <when value="yes">
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
25 <param name="seqErrorModel" type="select" label="How should sequencing error be estimated with sequencing error model?">
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
26 <option value="alignment" selected="yes">From mismatches in alignment</option>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
27 <option value="em-local">From degenerate barcode using an EM algorithm</option>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
28 </param>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
29 <param name="confidence" type="integer" value="30" label="Confidence score for the EM algorithm" />
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
30 </when>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
31 <when value="no">
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
32 </when>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
33 </conditional>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
34 </inputs>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
35
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
36 <outputs>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
37 <data name="output" format="bed" label="Collapse PCR duplicates on ${on_string}"/>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
38 </outputs>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
39
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
40 <help>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
41
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
42 .. class:: infomark
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
43
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
44 **What this tool does**
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
45
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
46 This tool collaspes tags according to the start position.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
47
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
48 It will take as input files in BED format of tags and output files in BED format of unique tags to eliminate potential PCR duplicates.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
49
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
50 It can run in two modes:
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
51
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
52 1. No degenerate barcode. In this mode, tags with the same starting coordinates are collapsed and only one is kept.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
53 2. With degenerate barcode. In this mode, tags that are mapped to the same position but carry different degenerate barcodes still have a chance to be kept (see below for details).
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
54
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
55 -----
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
56
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
57 **Consider the weight of each tag if you collapsed exact duplicates (reads with exactly the same sequences)**
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
58
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
59 The tool will then consider a "weight" that represents the copy number of each tag. This weight can be given in the score (5th) column, or attached to the tag NAME (before the degenerate barcode sequence (e.g. READ1#10#ACGTA).
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
60
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
61 -----
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
62
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
63 .. class:: warningmark
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
64
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
65 **Input data format (important)**
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
66
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
67 The input file is the unambiguously mappable tags in a BED file. However, the tool may need extra information embeded in the BED file depending on the parameters you use. These pieces of extra information are already there if you use the alignment tool provided.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
68
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
69 First, the tool tries to keep track of the copy number of each tag, if you collapsed exact duplicates before alignment (which is always recommended). By default, the copy number was attached to the sequence name (4th column). Therefore, a sequence id might read like this::
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
70
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
71 tag1#3
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
72
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
73 which means tag1 has 3 exact copies and tag1 is the representative of the three. if your sample has degenerate barcode, the barcode sequence is also attached to the sequence id, after the copy number, so that an sequence id might read like this::
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
74
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
75 tag1#3#AAAGG
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
76
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
77 If you check "has weight", but do not check "Weight in name", the tool will use the number in the score (4th) column as the copy number. When your data do not have degenerate barcode, the tool will collapse all reads with the same genomic starting coordinates and sum up the copy number of tags in each position and save the information in the score column of the output unique tag BED file. In this case, whether the copy number information was provided correctly or not will not affect the number and identity of the unique tags it reports, so it does not matter much if you do not care about the total copy number.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
78
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
79 However, if your sample has degenerate barcode, it is CRITICAL to specify the information correctly.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
80
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
81
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
82 .. class:: warningmark
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
83
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
84 Update of input format in the new version (11/22/2010):
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
85
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
86 A new method is introduced to give more accurate estimate of sequencing error, which is important for samples with degenerate barcode. Now there are two options to estimate sequencing errors, either from the degenerate barcode iteratively in the EM algorithm (the original method), or from mismatches detected during alignment (the new method). To use the new method, the number of mismatches has to be provided in the score column of the BED file (which is already there if you used the alignment tool provided). In this case, if you collapse exact duplicates before alignment, the copy number must be attached to the sequence id (so that you need to check both "has weight" and "has weight in name").
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
87
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
88 The results from the two methods are not dramatically different. For new analysis, the new method (estimate sequencing error from alignment) is recommended and a confidence score of 30 should be suitable in most cases. However, if you want to keep your analysis consistent with previous data, you can still use the original method.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
89
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
90 The new method is set to be default on April 8, 2011.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
91
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
92 -----
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
93
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
94 **How this tool determines unique reads with degenerate barcode**
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
95
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
96 If the raw reads have random barcodes attached to the 5' end of each read, the barcode has to be striped before alignment. The barcode is attached to NAME of each tag (e.g., READ1 will become READ1#ACGTA), which will be used here to determine tags that have the same starting coordinates, but have "sufficiently" distinct barcodes. This task is not trivial given that some tags can have thousands of copies in some CLIP experiments, so that sequencing errors are not negligible.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
97
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
98 To deal with the problem, the program uses an iterative Expectation-Maximization algorithm that estimates sequencing errors in degenerate barcode, or uses the sequencing error estimated from the mismatches detected during alignment (new, 11/22/2010). It models the copy number and identity of each barcode sequence, and infers which tag is generated by sequencing errors that lead to apparently different barcode. The confidence measures the probability of each tag with the observed barcode are "bona fide" in the CLIP library, or generated by sequencing error. To be consistent with other sequence alignment programs, -log10(P) * 10 is reported, so that a score of 50 represents the tag has a chance of 10^(-5) to be generated by sequencing error. The confidence score is in the score (5 column) of the output. If all tags mapped to a position have the same barcode (single-copy tag is a special case), an arbitrary score of 100 is given. Therefore, a confidence score threshold > 100 should never be used.
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
99
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
100 </help>
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
101
0475e4175855 planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
yqiancolumbia
parents:
diff changeset
102 </tool>