0
|
1 <tool id="motiffinding_weeder2" name="Weeder2" version="2.0.0">
|
|
2 <description>Motif discovery in sequences from coregulated genes of a single species</description>
|
|
3 <command interpreter="bash">weeder2_wrapper.sh
|
|
4 $sequence_file $species_code
|
|
5 $output_motifs_file $output_matrix_file
|
|
6 $strands
|
|
7 #if $chipseq.use_chipseq
|
|
8 -chipseq -top $chipseq.top
|
|
9 #end if
|
|
10 #if str( $advanced_options.advanced_options_selector ) == "on"
|
|
11 -maxm $advanced_options.n_motifs_report
|
|
12 -b $advanced_options.n_motifs_build
|
|
13 -sim $advanced_options.sim_threshold
|
|
14 -em $advanced_options.em_cycles
|
|
15 #end if
|
|
16 </command>
|
|
17 <requirements>
|
|
18 <requirement type="package" version="2.0">weeder</requirement>
|
|
19 </requirements>
|
|
20 <inputs>
|
|
21 <param name="sequence_file" type="data" format="fasta" label="Input sequence" />
|
|
22 <param name="species_code" type="select" label="Species to use for background comparison">
|
|
23 <!-- Hard code options for now
|
|
24 See weeder's "organisms.txt" for full list
|
|
25 -->
|
|
26 <option value="HS">Homo sapiens (HS)</option>
|
|
27 <option value="MM">Mus musculus (MM)</option>
|
|
28 <option value="DM">Drosophila melanogaster (DM)</option>
|
|
29 <option value="SC">Saccharomyces cerevisiae (SC)</option>
|
|
30 <option value="AT">Arabidopsis thaliana (AT)</option>
|
|
31 </param>
|
|
32 <param name="strands" label="Use both strands of sequence" type="boolean"
|
|
33 truevalue="" falsevalue="-ss" checked="True"
|
|
34 help="If not checked then use -ss option" />
|
|
35 <conditional name="chipseq">
|
|
36 <param name="use_chipseq" type="boolean"
|
|
37 label="Use the ChIP-seq heuristic"
|
|
38 help="Speeds up the computation (-chipseq)"
|
|
39 truevalue="yes" falsevalue="no" checked="on" />
|
|
40 <when value="yes">
|
|
41 <param name="top" type="integer" value="100"
|
|
42 label="Number of top input sequences with oligos to scan for"
|
|
43 help="Increase this value to improve the chance of finding motifs enriched only in a subset of your input sequences (-top)" />
|
|
44 </when>
|
|
45 <when value="no"></when>
|
|
46 </conditional>
|
|
47 <conditional name="advanced_options">
|
|
48 <param name="advanced_options_selector" type="select"
|
|
49 label="Display advanced options">
|
|
50 <option value="off">Hide</option>
|
|
51 <option value="on">Display</option>
|
|
52 </param>
|
|
53 <when value="on">
|
|
54 <param name="n_motifs_report" type="integer" value="25"
|
|
55 label="Number of discovered motifs to report" help="(-maxm)" />
|
|
56 <param name="n_motifs_build" type="integer" value="50"
|
|
57 label="Number of top scoring motifs to build occurrences matrix profiles and outputs for"
|
|
58 help="(-b)" />
|
|
59 <param name="sim_threshold" type="float" min="0.0" max="1.0" value="0.95"
|
|
60 label="Similarity threshold for the redundancy filter"
|
|
61 help="Remove motifs that are too similar, with lower values imposing a stricter filter. Must be between 0.0 and 1.0 (-sim)" />
|
|
62 <param name="em_cycles" type="integer" min="0" max="100" value="1"
|
|
63 label="Number of expectation maximization (EM) cycles to perform"
|
|
64 help="Number of cycles must be between 0 and 100 (-em)" />
|
|
65 </when>
|
|
66 <when value="off">
|
|
67 </when>
|
|
68 </conditional>
|
|
69 </inputs>
|
|
70 <outputs>
|
|
71 <data name="output_motifs_file" format="txt" label="Weeder2 on ${on_string} (motifs)" />
|
|
72 <data name="output_matrix_file" format="txt" label="Weeder2 on ${on_string} (matrix)" />
|
|
73 </outputs>
|
|
74 <tests>
|
|
75 <test>
|
|
76 <param name="sequence_file" value="weeder_in.fa" ftype="fasta" />
|
|
77 <param name="species_code" value="MM" />
|
|
78 <output name="output_motifs_file" file="weeder2_motifs.out" lines_diff="2" />
|
|
79 <output name="output_matrix_file" file="weeder2_matrix.out" />
|
|
80 </test>
|
|
81 </tests>
|
|
82 <help>
|
|
83
|
|
84 .. class:: infomark
|
|
85
|
|
86 **What it does**
|
|
87
|
|
88 Weeder2 is a program for finding novel motifs (transcription factor binding sites)
|
|
89 conserved in a set of regulatory regions of related genes.
|
|
90
|
|
91 -------------
|
|
92
|
|
93 .. class:: infomark
|
|
94
|
|
95 **Usage advice**
|
|
96
|
|
97 Guidelines on how to use this tool can be seen in Zambelli et al. 2014 (see link
|
|
98 below), but the following is a brief guide. Please note that **motifs** are a model
|
|
99 or matrix that describes a set of sequences that may differ in the base composition.
|
|
100 **Oligos** are specific sequences found within the input sequences or genomic
|
|
101 background.
|
|
102
|
|
103 **Input sequence** (in FASTA format) should be short (100-200bp) and be reasonably
|
|
104 expected to contain an enriched motif(s). This is not generally an issue with
|
|
105 transcription factor ChIP-seq derived sequences centred on the summit of binding
|
|
106 regions that are expected to contain a dominant motif and possibly secondary motifs.
|
|
107
|
|
108 There is **no need to mask sequence for repetitive sequence** as factors may
|
|
109 legitimately bind repetitive sequence.
|
|
110
|
|
111 **Use both strands of sequence** by default, unless there is a specific reason not
|
|
112 to do so.
|
|
113
|
|
114 **Species to use for background comparison** should match the genome used to
|
|
115 generate the **input sequence**. The background genome motif frequencies are
|
|
116 generated from within the promoter regions of annotated genes and are shown to be a
|
|
117 good background for both promoter and other regulatory regions.
|
|
118
|
|
119 **Use the ChIP-seq heuristic** (-chipseq) when there are a large number of
|
|
120 input sequences (hundreds or thousands). When -chipseq is used Weeder will use
|
|
121 only oligos from the first 100 sequences to build motifs with which it scans
|
|
122 all of the input sequences. This speeds up the computational time without too much
|
|
123 risk of losing important motifs. Even if not strictly necessary it's advisable to
|
|
124 order input sequences by their significance, e.g. fold enrichment or Pvalue. For
|
|
125 large data sets (-top) should be set to a number equating at least 10 to 20% of
|
|
126 input sequences (as recommended by the authors).
|
|
127
|
|
128 **Number of discovered motifs to report** (-maxm) limits the number of reported
|
|
129 motifs even if there are more than -maxm. **Number of top scoring motifs to build
|
|
130 occurrences matrix profiles and outputs for** (-b) changes the number of top
|
|
131 scoring motifs of length 6, 8 and 10 for which the occurrence matrix is built.
|
|
132 Increasing -b may result in a larger number of reported motifs, but with potentially
|
|
133 more of low significance and increases the computational time. If increasing -b does
|
|
134 not result in more motifs in your results it means that the additional motifs are
|
|
135 filtered out by the redundancy filter or that the maximum number of reported motifs
|
|
136 set by -maxm has been reached.
|
|
137
|
|
138 **Similarity threshold for the redundancy filter** (-sim) default setting is
|
|
139 recommended.
|
|
140
|
|
141 **Number of expectation maximization (EM) cycles to perform** (-em) default is
|
|
142 recommended. The option is included to help "clean up" the resulting motif matrices.
|
|
143 In this version the number of EM steps can be increased, which can be useful for
|
|
144 motifs with highly redundant stretches of sequence.
|
|
145
|
|
146 -------------
|
|
147
|
|
148 .. class:: infomark
|
|
149
|
|
150 **A note on the results**
|
|
151
|
|
152 The resulting matrices are the result of scanning (by default both strands) for
|
|
153 oligos of length 6, 8 and 8, allowing 1, 2 and 3 substitutions respectively. The
|
|
154 matrices within the matrix.w2 file can be input into other tools. The recommended
|
|
155 next step is to use **STAMP** (http://www.benoslab.pitt.edu/stamp/), which displays
|
|
156 the motifs as logos and identifies matches with libraries of known DNA binding
|
|
157 motifs, such as TRANSFAC or JASPAR.
|
|
158
|
|
159 -------------
|
|
160
|
|
161 .. class:: infomark
|
|
162
|
|
163 **Credits**
|
|
164
|
|
165 This Galaxy tool has been developed by Peter Briggs and Ian Donaldson within the
|
|
166 Bioinformatics Core Facility at the University of Manchester, and runs the Weeder2
|
|
167 motif discovery package:
|
|
168
|
|
169 * Zambelli, F., Pesole, G. and Pavesi, G. 2014. Using Weeder, Pscan, and PscanChIP
|
|
170 for the Discovery of Enriched Transcription Factor Binding Site Motifs in
|
|
171 Nucleotide Sequences. Current Protocols in Bioinformatics. 47:2.11:2.11.1–2.11.31.
|
|
172 * http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0211s47/full
|
|
173
|
|
174 This tool is compatible with Weeder 2.0:
|
|
175
|
|
176 * http://159.149.160.51/modtools/downloads/weeder2.html
|
|
177
|
|
178 Please kindly acknowledge both this Galaxy tool, the Weeder package and the utility
|
|
179 scripts if you use it in your work.
|
|
180 </help>
|
1
|
181 <citations>
|
|
182 <!--
|
|
183 See https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax#A.3Ccitations.3E_tag_set
|
|
184 Can be either DOI or Bibtex
|
|
185 Use http://www.bioinformatics.org/texmed/ to convert PubMed to Bibtex
|
|
186 -->
|
|
187 <citation type="doi">10.1002/0471250953.bi0211s47</citation>
|
|
188 </citations>
|
0
|
189 </tool>
|