view kmersvm/README.txt @ 7:fd740d515502 draft default tip

Uploaded revised kmer-SVM to include modules from kmer-visual.
author cafletezbrant
date Sun, 16 Jun 2013 18:06:14 -0400
parents 7fe1103032f7
children
line wrap: on
line source

DEPENDENCIES:
*************
KmerSVM requires the following software (to be installed in this order):

  Mac Users:
  1. Xcode (Mac App Store)
  2. Fortran compiler (http://gcc.gnu.org/wiki/GFortran/)
  
  Everyone:
  1. Swig (http://www.swig.org; needed specifically to install python_modular package from Shogun Toolbox)
  2. Numpy (numpy.scipy.org)
  3. Shogun Toolbox, v0.9.3 - v1.10 (http://www.shogun-toolbox.org/)  
  4. Bitarray (http://pypi.python.org/pypi/bitarray/)
  5. R (http://www.r-project.org)
  6. ROCR R Package (Available through CRAN)

Further, KmerSVM has been tested on Python 2.6, 2.7 on Linux and Mac OS X.
At this time KmerSVM has not been tested on Windows.

Note that for binaries are provided for Mac users.  However, if difficulties 
in installation are encountered, it may be beneficial to compile the 
Fortran compiler from source.  Additionally, be sure to add the location of 
your Shogun installation to the PYTHONPATH.

REQUIRED FILES:
***************
Use the install.sh script to install many required files. Specifically:

sh run.sh /path/to/galaxy-dist/tools

For efficient access to genome-wide data "Generate Null Sequence" and "Sequence Profiles" rely on access to binary files (indices) generated by using the script nullseq_build_indices.py. Download the *.tar or *.zip files for each genome to be analyzed. To create indices for a specific genome, call nullseq_build_indices.py. For example:

python nullseq_build_indices.py mm8.zip  mm8
          
Alternatively, we offer a handful of prepared index files, which should be downloaded and then extracted from our website (www.beerlab.org/kmersvm.html).

Next, open the file tool-data/nullseq_indices.loc and add the path to the created indices following the instructions included in that file. For the genomes listed above, you would add the following lines to nullseq_indices.loc:

mm8	Mouse(mm8)	/path/to/nullseq_indice_mm8
mm9	Mouse(mm9)	/path/to/nullseq_indices_mm9
hg18	Human(hg18)	/path/to/nullseq_indices_hg18
hg19	Human(hg19)	/path/to/nullseq_indices_hg19

To generate FASTA files for training or scoring purposes, kmer-SVM uses the built-in Galaxy tool "Fetch Sequences", which looks for genomes in *.nib or *.2bit format. Download genomes related to your data and update the tool-data/alignseq.loc file to include the location of these genomes according to directions in that file. FASTA files can also be provided by the user. "Fetch Sequences" should be set up as follows:

Download 2bit files from the UCSC genome browser. For example,

http://hgdownload.cse.ucsc.edu/goldenPath/mm8/bigZips/mm8.2bit
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

Add the following lines to galaxy-dist/tool-data/alignseq.loc

seq	mm8	/path/to/mm8.2bit
seq	mm9	/path/to/mm9.2bit
seq	hg18	/path/to/hg18.2bit
seq	hg19	/path/to/hg19.2bit

TOOL_CONF.XML:
**************
Add the following lines to tool_conf.xml:

  <section name="SVM Tools" id="kmersvm">
    <tool file="kmersvm/classify.xml"/>
    <tool file="kmersvm/nullseq.xml"/>
    <tool file="kmersvm/rocprcurve.xml"/>
    <tool file="kmersvm/train.xml"/>
    <tool file="kmersvm/split_genome.xml"/>
    <tool file="kmersvm/seqprofile.xml" />
    <tool file="kmersvm/kmertopwm.xml" />
    <tool file="kmersvm/tomtom.xml" />
  </section>

Tool Tests:
***********
Galaxy tools come with functional tests to determine if tools are operating correctly. To run tests on Galaxy tools, use the script run_functional_tests.sh. We offer tests for the tools "Train SVM", "Score Sequences of Interest" and "Split Genome".

IDs for kmer-SVM tests can be found by calling run_functional_tests.sh with the '-list' flag.

Non-Galaxy-Based Usage:
***********************
The KmerSVM suite can be ran without using the Galaxy framework.  Each tool exists as
a standalone Python script (all located in /scripts) which can be called from the command
line.  Specific documentation can be found within each tool's Python file, or by calling 
the script with no arguments.  A general workflow can be found in 'kmer-SVM: a Web-based Toolkit for the Computational 
Identification of Predictive Regulatory Sequence Features in Genomic Datasets', 
which can be followed by calling each of the relevant Python scripts, 
with the exception that users will have to provide needed FASTA files themselves.

A simple worflow for the KmerSVM suite is as follows:

	1. python nullseq_build_indices.py mm8.zip mm8
	2. python nullseq_generate sample_input.bed mm8 /path/to/mm8/indices #This 
           assumes no negative data sets. Output will need to be converted to FASTA. Skip if 
           negative data is provided.
	3. python kmersvm_train.py positive.fa negative.fa #Outputs will be WEIGHTS, PREDICTIONS
	4. python split_genome.py input.bed #Skip if already have a list of regions you want to 
           test. Output is test_seq.bed, which will need to be converted to FASTA.
	5. python kmersvm_classify.py weights.out test_seq.fa

Additionally, for any BED file, sequence composition (in terms of length, GC content and repeat fraction) can be obtained by calling 'make profile' as follows:

python make_profile.py input.bed mm8 /path/to/mm8/indices profile.out

Note that each tool has its own parameters, the manipulation of which allow the user to further customize their analysis. To learn more about a particular tool, simply call it without passing it any arguments.