comparison mmquant.xml @ 0:60abb6540004 draft

planemo upload commit fb76aa0a938a2498d3206e6039bc1d9906e6c2ce-dirty
author m-zytnicki
date Thu, 11 Aug 2016 03:26:32 -0400
parents
children 87c5fa8651c1
comparison
equal deleted inserted replaced
-1:000000000000 0:60abb6540004
1 <tool id="mmquant" name="Gene quantification (mmquant)" version="0.1.0">
2 <requirements>
3 <requirement type="package" version="0.1.0">mmquant</requirement>
4 </requirements>
5 <stdio>
6 <exit_code range="1:" />
7 </stdio>
8 <command><![CDATA[
9 mmquant
10 -a "$annotation"
11 -r
12 #for $r in $reads_info
13 ${r.reads.file_name}
14 #end for
15 -f
16 #for $r in $reads_info
17 ${r.reads.ext}
18 #end for
19 -s
20 #for $r in $reads_info
21 ${r.strand}
22 #end for
23 -n
24 #for $r in $reads_info
25 ${r.name}
26 #end for
27 -l "$overlap"
28 "$gene_name"
29 -c "$count"
30 -m "$merge"
31 -o "$output"
32 ]]></command>
33 <inputs>
34 <param name="annotation" type="data" label="Annotation" format="gtf" />
35 <repeat name="reads_info" title="Reads" min="1" default="1">
36 <param name="reads" type="data" label="Reads" multiple="false" format="sam,bam" />
37 <param name="name" type="text" label="Sample name" value="sample_N" />
38 <param name="strand" type="select" label="Strand" multiple="false" >
39 <option value="U" selected="yes">unknown</option>
40 <option value="FR">forward-reverse (for paired-end reads)</option>
41 <option value="RF">reverse-forward (for paired-end reads)</option>
42 <option value="F">forward (for single-end reads)</option>
43 <option value="R">reverse (for single-end reads)</option>
44 </param>
45 </repeat>
46 <param name="overlap" type="float" value="-1" label="Overlap type" help="&lt;0: read is included, &lt;1: overlap, otherwise: # nt" />
47 <param name="gene_name" type="boolean" label="Print gene name instead of IDs" truevalue="-g" falsevalue="" help="use gene name instead of gene ID in the output file" />
48 <param name="count" type="integer" value="0" min="0" label="Count threshold" help="Do not display genes with less than N reads" />
49 <param name="merge" type="float" value="0.0" min="0.0" max="1.0" label="Merge threshold" help="Merge gene aggregate count with parent aggregate if count is low" />
50 </inputs>
51 <outputs>
52 <data name="output" format="txt" label="${tool.name} on ${on_string}" />
53 </outputs>
54 <tests>
55 <test>
56 <param name="annotation" value="test_mmquant_1.gtf" />
57 <param name="reads" value="test_mmquant_1.sam" />
58 <param name="name" value="test" />
59 <param name="strand" value="U" />
60 <output name="output" file="test_mmquant_1.txt" ftype="txt" />
61 </test>
62 </tests>
63 <help>
64 **Why using this tool?**
65
66 This tool counts the number of reads (produced by RNA-Seq) per gene, much like HTSeq-count_ and featureCounts_. The main difference with other tools is that multi-mapping reads are counted differently: if a read is mapped to gene A, gene B, and gene C, the tool will create a new feature, "geneA--geneB--geneC", that will be counted once.
67
68 .. _HTSeq-count: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
69 .. _featureCounts: http://bioinf.wehi.edu.au/featureCounts/
70
71 **Why it matters?**
72
73 Recently, an article_ showed that RNA-Seq quantification tools are not accurate, leading to errors while finding differentially expressed genes. The authors suggest this method, that may not provide the genes that are differentially expressed (something that RNA-Seq alone cannot do), but the groups of genes that are differentially expressed.
74
75 .. _article: http://www.genomebiology.com/2015/16/1/177
76
77 **Strands**
78
79 Strands can be:
80
81 * for paired-end reads: ``U`` (unknown), ``FR`` (forward-reverse), ``RF`` (reverse-forward), ``FF`` (forward-forward);
82
83 * for single-end reads: ``U`` (unknown), ``F`` (forward), ``R`` (reverse);
84
85 * Default: ``U``.
86
87
88 **Annotation file**
89
90 The annotation file should be in GTF. GFF might work too. The tool only uses the gene/transcript/exon types.
91
92
93 **Reads files**
94
95 The reads should be given in SAM or BAM format, and be sorted (by position). The reads can be single end or paired-end (or a mixture thereof).
96
97 You can use the samtools_ to sort them. This tool uses the NH flag (provides the number of hits for each read, see the specification_), so be sure that your mapping tool sets it adequately (yes, TopHat2_ and STAR_ do it fine). You should also check how your mapping tool handles multi-mapping reads (this can usually be tuned using the appropriate parameters).
98
99 .. _samtools: http://www.htslib.org/
100 .. _specification: https://samtools.github.io/hts-specs/SAMv1.pdf
101 .. _TopHat2: http://ccb.jhu.edu/software/tophat/index.shtml
102 .. _STAR: https://github.com/alexdobin/STAR/releases
103
104
105 **Output file**
106
107 The output is a tab-separated file, to be use in EdgeR or DESeq, for instance. If the user provided *n* reads files, the output will contain *n+1* columns:
108
109 ============== ======== ======== ===
110 Gene sample_1 sample_2 ...
111 ============== ======== ======== ===
112 gene_A ... ... ...
113 gene_B ... ... ...
114 gene_B--gene_C ... ... ...
115 ============== ======== ======== ===
116
117 The first line is the ID of the genes.
118 If a read maps several genes (say, gene_B and gene_C), a new feature is added to the table, gene_B--gene_C. The reads that can be mapped to these genes will be counted there (but not in the gene_B nor gene_C lines).
119
120 With the ``Print names`` option, the gene names are used instead of gene IDs. If two different genes have the same name, the systematic name is added, like: ``Mat2a (ENSMUSG00000053907)``.
121
122 Note that the gene IDs and gene names should be given in the GTF file after the ``gene_id`` and ``gene_name`` tags respectively.
123
124 **Output stats**
125
126 The output stats are given in standard error.
127
128 The general shape is::
129
130 Results for sample_A:
131 # hits: N
132 # uniquely mapped reads: N (x%)
133 # ambiguous hits: N (x%)
134 # non-uniquely mapped hits: N (x%)
135 # unassigned hits: N (x%)
136
137 These figures mainly provide stats on hits; one sequence may have zero, one, or several hits. An ambiguous hit is a hit that overlaps several annotation features. A non-uniquely mapped hit belongs to a sequence that maps several loci in the genome.
138
139 **Overlap**
140
141 The way a read R is mapped to a gene A depends on the overlap *n* value:
142
143 ==================== ===============================================
144 if *n* is then R is mapped to A iff
145 ==================== ===============================================
146 a negative value R is included in A
147 a positive integer they have at least *n* nucleotides in common
148 a float value (0, 1) *n* % of the nucleotides of R are shared with A
149 ==================== ===============================================
150
151 **Merge Threshold**
152
153 Sometimes, there are very few reads that can be mapped unambiguously to a gene A, because it is very similar to gene B.
154
155 ============== ==========
156 Gene sample_1
157 ============== ==========
158 gene_A *x*
159 gene_B *y*
160 gene_A--gene_B *z*
161 ============== ==========
162
163 In the previous example, suppose that *x &lt;&lt; z*. In this case, you can move all the reads from gene_A to gene_A--gene_B, using the merge threshold *t*, a float in (0, 1). If *x &lt; t* x *y*, then the reads are transferred.
164
165 **Count Threshold**
166
167 If the maximum number of reads for a gene is less than the count threshold (a non-negative integer), then the corresponding line is discarded.
168
169
170 **Contact**
171
172 Comment? Suggestion? Do not hesitate sending me an email_.
173
174 .. _email: mailto:matthias.zytnicki@toulouse.inra.fr
175 </help>
176 <citations>
177 <citation type="bibtex">
178 @misc{bitbucketmmquant,
179 author = {Zytnicki.},
180 year = {2016},
181 title = {multi-mapping-counter},
182 publisher = {BitBucket},
183 journal = {BitBucket repository},
184 url = {https://bitbucket.org/mzytnicki/multi-mapping-counter},
185 }</citation>
186 </citations>
187 </tool>