annotate kmersvm/README.txt @ 7:fd740d515502 draft default tip

Uploaded revised kmer-SVM to include modules from kmer-visual.
author cafletezbrant
date Sun, 16 Jun 2013 18:06:14 -0400
parents 7fe1103032f7
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
1 DEPENDENCIES:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
2 *************
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
3 KmerSVM requires the following software (to be installed in this order):
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
4
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
5 Mac Users:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
6 1. Xcode (Mac App Store)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
7 2. Fortran compiler (http://gcc.gnu.org/wiki/GFortran/)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
8
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
9 Everyone:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
10 1. Swig (http://www.swig.org; needed specifically to install python_modular package from Shogun Toolbox)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
11 2. Numpy (numpy.scipy.org)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
12 3. Shogun Toolbox, v0.9.3 - v1.10 (http://www.shogun-toolbox.org/)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
13 4. Bitarray (http://pypi.python.org/pypi/bitarray/)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
14 5. R (http://www.r-project.org)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
15 6. ROCR R Package (Available through CRAN)
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
16
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
17 Further, KmerSVM has been tested on Python 2.6, 2.7 on Linux and Mac OS X.
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
18 At this time KmerSVM has not been tested on Windows.
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
19
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
20 Note that for binaries are provided for Mac users. However, if difficulties
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
21 in installation are encountered, it may be beneficial to compile the
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
22 Fortran compiler from source. Additionally, be sure to add the location of
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
23 your Shogun installation to the PYTHONPATH.
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
24
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
25 REQUIRED FILES:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
26 ***************
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
27 Use the install.sh script to install many required files. Specifically:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
28
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
29 sh run.sh /path/to/galaxy-dist/tools
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
30
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
31 For efficient access to genome-wide data "Generate Null Sequence" and "Sequence Profiles" rely on access to binary files (indices) generated by using the script nullseq_build_indices.py. Download the *.tar or *.zip files for each genome to be analyzed. To create indices for a specific genome, call nullseq_build_indices.py. For example:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
32
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
33 python nullseq_build_indices.py mm8.zip mm8
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
34
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
35 Alternatively, we offer a handful of prepared index files, which should be downloaded and then extracted from our website (www.beerlab.org/kmersvm.html).
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
36
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
37 Next, open the file tool-data/nullseq_indices.loc and add the path to the created indices following the instructions included in that file. For the genomes listed above, you would add the following lines to nullseq_indices.loc:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
38
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
39 mm8 Mouse(mm8) /path/to/nullseq_indice_mm8
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
40 mm9 Mouse(mm9) /path/to/nullseq_indices_mm9
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
41 hg18 Human(hg18) /path/to/nullseq_indices_hg18
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
42 hg19 Human(hg19) /path/to/nullseq_indices_hg19
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
43
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
44 To generate FASTA files for training or scoring purposes, kmer-SVM uses the built-in Galaxy tool "Fetch Sequences", which looks for genomes in *.nib or *.2bit format. Download genomes related to your data and update the tool-data/alignseq.loc file to include the location of these genomes according to directions in that file. FASTA files can also be provided by the user. "Fetch Sequences" should be set up as follows:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
45
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
46 Download 2bit files from the UCSC genome browser. For example,
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
47
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
48 http://hgdownload.cse.ucsc.edu/goldenPath/mm8/bigZips/mm8.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
49 http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
50 http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
51 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
52
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
53 Add the following lines to galaxy-dist/tool-data/alignseq.loc
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
54
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
55 seq mm8 /path/to/mm8.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
56 seq mm9 /path/to/mm9.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
57 seq hg18 /path/to/hg18.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
58 seq hg19 /path/to/hg19.2bit
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
59
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
60 TOOL_CONF.XML:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
61 **************
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
62 Add the following lines to tool_conf.xml:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
63
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
64 <section name="SVM Tools" id="kmersvm">
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
65 <tool file="kmersvm/classify.xml"/>
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
66 <tool file="kmersvm/nullseq.xml"/>
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
67 <tool file="kmersvm/rocprcurve.xml"/>
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
68 <tool file="kmersvm/train.xml"/>
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
69 <tool file="kmersvm/split_genome.xml"/>
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
70 <tool file="kmersvm/seqprofile.xml" />
7
fd740d515502 Uploaded revised kmer-SVM to include modules from kmer-visual.
cafletezbrant
parents: 0
diff changeset
71 <tool file="kmersvm/kmertopwm.xml" />
fd740d515502 Uploaded revised kmer-SVM to include modules from kmer-visual.
cafletezbrant
parents: 0
diff changeset
72 <tool file="kmersvm/tomtom.xml" />
0
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
73 </section>
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
74
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
75 Tool Tests:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
76 ***********
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
77 Galaxy tools come with functional tests to determine if tools are operating correctly. To run tests on Galaxy tools, use the script run_functional_tests.sh. We offer tests for the tools "Train SVM", "Score Sequences of Interest" and "Split Genome".
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
78
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
79 IDs for kmer-SVM tests can be found by calling run_functional_tests.sh with the '-list' flag.
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
80
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
81 Non-Galaxy-Based Usage:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
82 ***********************
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
83 The KmerSVM suite can be ran without using the Galaxy framework. Each tool exists as
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
84 a standalone Python script (all located in /scripts) which can be called from the command
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
85 line. Specific documentation can be found within each tool's Python file, or by calling
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
86 the script with no arguments. A general workflow can be found in 'kmer-SVM: a Web-based Toolkit for the Computational
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
87 Identification of Predictive Regulatory Sequence Features in Genomic Datasets',
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
88 which can be followed by calling each of the relevant Python scripts,
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
89 with the exception that users will have to provide needed FASTA files themselves.
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
90
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
91 A simple worflow for the KmerSVM suite is as follows:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
92
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
93 1. python nullseq_build_indices.py mm8.zip mm8
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
94 2. python nullseq_generate sample_input.bed mm8 /path/to/mm8/indices #This
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
95 assumes no negative data sets. Output will need to be converted to FASTA. Skip if
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
96 negative data is provided.
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
97 3. python kmersvm_train.py positive.fa negative.fa #Outputs will be WEIGHTS, PREDICTIONS
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
98 4. python split_genome.py input.bed #Skip if already have a list of regions you want to
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
99 test. Output is test_seq.bed, which will need to be converted to FASTA.
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
100 5. python kmersvm_classify.py weights.out test_seq.fa
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
101
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
102 Additionally, for any BED file, sequence composition (in terms of length, GC content and repeat fraction) can be obtained by calling 'make profile' as follows:
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
103
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
104 python make_profile.py input.bed mm8 /path/to/mm8/indices profile.out
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
105
7fe1103032f7 Uploaded
cafletezbrant
parents:
diff changeset
106 Note that each tool has its own parameters, the manipulation of which allow the user to further customize their analysis. To learn more about a particular tool, simply call it without passing it any arguments.