Next changeset 1:dcf98c713e4a (2012-08-20) |
Commit message:
Uploaded |
added:
kmersvm/._install.sh kmersvm/.gitignore kmersvm/README.txt kmersvm/classify.xml kmersvm/install.sh kmersvm/kmersvm_output_weights.out kmersvm/nullseq.xml kmersvm/prcurve.xml kmersvm/r_wrapper.sh kmersvm/roccurve.xml kmersvm/rocprcurve.xml kmersvm/scripts/kmersvm_classify.py kmersvm/scripts/kmersvm_train.py kmersvm/scripts/libkmersvm.py kmersvm/scripts/libkmersvm.pyc kmersvm/scripts/make_profile.py kmersvm/scripts/nullseq_build_indices.py kmersvm/scripts/nullseq_generate.py kmersvm/scripts/split_genome.py kmersvm/seqprofile.xml kmersvm/split_genome.xml kmersvm/tool-data/classify_output.out kmersvm/tool-data/classify_test.fa kmersvm/tool-data/nullseq_indices.loc.sample kmersvm/tool-data/sample_roc_chen.png kmersvm/tool-data/test_negative.fa kmersvm/tool-data/test_positive.fa kmersvm/tool-data/test_weights.out kmersvm/tool-data/train_predictions.out kmersvm/train.xml |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/._install.sh |
b |
Binary file kmersvm/._install.sh has changed |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/.gitignore --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/.gitignore Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,1 @@ +scripts/libkmersvm.pyc |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/README.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/README.txt Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,104 @@ +DEPENDENCIES: +************* +KmerSVM requires the following software (to be installed in this order): + + Mac Users: + 1. Xcode (Mac App Store) + 2. Fortran compiler (http://gcc.gnu.org/wiki/GFortran/) + + Everyone: + 1. Swig (http://www.swig.org; needed specifically to install python_modular package from Shogun Toolbox) + 2. Numpy (numpy.scipy.org) + 3. Shogun Toolbox, v0.9.3 - v1.10 (http://www.shogun-toolbox.org/) + 4. Bitarray (http://pypi.python.org/pypi/bitarray/) + 5. R (http://www.r-project.org) + 6. ROCR R Package (Available through CRAN) + +Further, KmerSVM has been tested on Python 2.6, 2.7 on Linux and Mac OS X. +At this time KmerSVM has not been tested on Windows. + +Note that for binaries are provided for Mac users. However, if difficulties +in installation are encountered, it may be beneficial to compile the +Fortran compiler from source. Additionally, be sure to add the location of +your Shogun installation to the PYTHONPATH. + +REQUIRED FILES: +*************** +Use the install.sh script to install many required files. Specifically: + +sh run.sh /path/to/galaxy-dist/tools + +For efficient access to genome-wide data "Generate Null Sequence" and "Sequence Profiles" rely on access to binary files (indices) generated by using the script nullseq_build_indices.py. Download the *.tar or *.zip files for each genome to be analyzed. To create indices for a specific genome, call nullseq_build_indices.py. For example: + +python nullseq_build_indices.py mm8.zip mm8 + +Alternatively, we offer a handful of prepared index files, which should be downloaded and then extracted from our website (www.beerlab.org/kmersvm.html). + +Next, open the file tool-data/nullseq_indices.loc and add the path to the created indices following the instructions included in that file. For the genomes listed above, you would add the following lines to nullseq_indices.loc: + +mm8 Mouse(mm8) /path/to/nullseq_indice_mm8 +mm9 Mouse(mm9) /path/to/nullseq_indices_mm9 +hg18 Human(hg18) /path/to/nullseq_indices_hg18 +hg19 Human(hg19) /path/to/nullseq_indices_hg19 + +To generate FASTA files for training or scoring purposes, kmer-SVM uses the built-in Galaxy tool "Fetch Sequences", which looks for genomes in *.nib or *.2bit format. Download genomes related to your data and update the tool-data/alignseq.loc file to include the location of these genomes according to directions in that file. FASTA files can also be provided by the user. "Fetch Sequences" should be set up as follows: + +Download 2bit files from the UCSC genome browser. For example, + +http://hgdownload.cse.ucsc.edu/goldenPath/mm8/bigZips/mm8.2bit +http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit +http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit +http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit + +Add the following lines to galaxy-dist/tool-data/alignseq.loc + +seq mm8 /path/to/mm8.2bit +seq mm9 /path/to/mm9.2bit +seq hg18 /path/to/hg18.2bit +seq hg19 /path/to/hg19.2bit + +TOOL_CONF.XML: +************** +Add the following lines to tool_conf.xml: + + <section name="SVM Tools" id="kmersvm"> + <tool file="kmersvm/classify.xml"/> + <tool file="kmersvm/nullseq.xml"/> + <tool file="kmersvm/rocprcurve.xml"/> + <tool file="kmersvm/train.xml"/> + <tool file="kmersvm/split_genome.xml"/> + <tool file="kmersvm/seqprofile.xml" /> + </section> + +Tool Tests: +*********** +Galaxy tools come with functional tests to determine if tools are operating correctly. To run tests on Galaxy tools, use the script run_functional_tests.sh. We offer tests for the tools "Train SVM", "Score Sequences of Interest" and "Split Genome". + +IDs for kmer-SVM tests can be found by calling run_functional_tests.sh with the '-list' flag. + +Non-Galaxy-Based Usage: +*********************** +The KmerSVM suite can be ran without using the Galaxy framework. Each tool exists as +a standalone Python script (all located in /scripts) which can be called from the command +line. Specific documentation can be found within each tool's Python file, or by calling +the script with no arguments. A general workflow can be found in 'kmer-SVM: a Web-based Toolkit for the Computational +Identification of Predictive Regulatory Sequence Features in Genomic Datasets', +which can be followed by calling each of the relevant Python scripts, +with the exception that users will have to provide needed FASTA files themselves. + +A simple worflow for the KmerSVM suite is as follows: + + 1. python nullseq_build_indices.py mm8.zip mm8 + 2. python nullseq_generate sample_input.bed mm8 /path/to/mm8/indices #This + assumes no negative data sets. Output will need to be converted to FASTA. Skip if + negative data is provided. + 3. python kmersvm_train.py positive.fa negative.fa #Outputs will be WEIGHTS, PREDICTIONS + 4. python split_genome.py input.bed #Skip if already have a list of regions you want to + test. Output is test_seq.bed, which will need to be converted to FASTA. + 5. python kmersvm_classify.py weights.out test_seq.fa + +Additionally, for any BED file, sequence composition (in terms of length, GC content and repeat fraction) can be obtained by calling 'make profile' as follows: + +python make_profile.py input.bed mm8 /path/to/mm8/indices profile.out + +Note that each tool has its own parameters, the manipulation of which allow the user to further customize their analysis. To learn more about a particular tool, simply call it without passing it any arguments. |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/classify.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/classify.xml Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,39 @@ +<tool id="kmersvm_classify" name="Score Sequences of Interest"> + <description>using SVM weights</description> + <command interpreter="python">scripts/kmersvm_classify.py -q $inputA $inputB</command> + <inputs> + <param format="tabular" name="inputA" type="data" label="SVM Weights"/> + <param format="fasta" name="inputB" type="data" label="Test Sequences"/> + </inputs> + <outputs> + <data format="tabular" name="kmersvm_scores.out" from_work_dir="kmersvm_scores.out" /> + </outputs> + <tests> + <test> + <param name="inputA" value="test_weights.out" /> + <param name="inputB" value="classify_test.fa" /> + <output name="output" file="classify_output.out" /> + </test> + </tests> + <help> + +**What it does** + +Takes as input one file of weights generated by Train SVM and one FASTA file containing sequences to be predicted. + +Returns a file containing the names of the input sequences, as well as the scores and posterior probabilities of the input sequences. + +---- + +**Example** + +Scores file:: + + #seq_id posterior_prob svm_score + mm8_chr1_3089935_3090035_+ 0.042414638227 -2.13990367846 + mm8_chr1_5031335_5031435_+ 0.351943600792 -0.478063299876 + mm8_chr1_5103742_5103842_+ 0.194625711202 -1.01493730026 + mm8_chr1_5650372_5650472_+ 0.105376843506 -1.49141463695 + + </help> +</tool> |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/install.sh --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/install.sh Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,12 @@ +#!/bin/bash +cd "$1" +cp tool-data/nullseq_indices.loc.sample ../../tool-data/nullseq_indices.loc +cp tool-data/sample_roc_chen.png ../../tool-data +cp tool-data/classify_output.out ../../test-data +cp tool-data/classify_test.fa ../../test-data +cp tool-data/kmersvm_output_weights.out ../../test-data +cp tool-data/test_positive.fa ../../test-data +cp tool-data/test_negative.fa ../../test-data +cp tool-data/test_weights.out ../../test-data +cp tool-data/train_predictions.out ../../test-data + |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/kmersvm_output_weights.out --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/kmersvm_output_weights.out Mon Aug 20 18:07:22 2012 -0400 |
b |
b'@@ -0,0 +1,2088 @@\n+#parameters:\n+#kernel=1\n+#kmerlen=6\n+#bias=4.55176901365\n+#A=-5.16773433695\n+#B=0.626781410205\n+#NOTE: k-mers with large negative weights are also important. They can be found at the bottom of the list.\n+#k-mer\trevcomp\tSVM-weight\n+AGCCCA\tTGGGCT\t0.204489955998\n+GGGTCA\tTGACCC\t0.197932576597\n+CTTTCA\tTGAAAG\t0.175623735339\n+ATTGGG\tCCCAAT\t0.151810241272\n+GGGGGA\tTCCCCC\t0.151508121036\n+CTAGTC\tGACTAG\t0.141094510944\n+AATGGC\tGCCATT\t0.134715758171\n+ATATGG\tCCATAT\t0.130371158981\n+GAATGC\tGCATTC\t0.11612142844\n+GCAGGA\tTCCTGC\t0.113158387991\n+ATGAAA\tTTTCAT\t0.107073436845\n+ATCACA\tTGTGAT\t0.106708488588\n+AATGCT\tAGCATT\t0.106604642199\n+AGTAGC\tGCTACT\t0.10403321557\n+ACTCAG\tCTGAGT\t0.101925776289\n+AATCAC\tGTGATT\t0.100561967001\n+AGTGGC\tGCCACT\t0.0987645386645\n+CCGTTC\tGAACGG\t0.0977433807672\n+CAATAC\tGTATTG\t0.0963227791486\n+GTGTTA\tTAACAC\t0.0941324907337\n+ATGCTA\tTAGCAT\t0.0891252806682\n+GCACCA\tTGGTGC\t0.0833993191799\n+CTCTTC\tGAAGAG\t0.0833356147621\n+ACTGGA\tTCCAGT\t0.0826523294856\n+ATTCAG\tCTGAAT\t0.0826469660836\n+AAACTG\tCAGTTT\t0.0824586682644\n+CTAACA\tTGTTAG\t0.0803769439042\n+TACGTA\tTACGTA\t0.0785876810566\n+GTTAAC\tGTTAAC\t0.0766029391348\n+ATCTGC\tGCAGAT\t0.0744400325764\n+CGTGTA\tTACACG\t0.0736219161694\n+ATAATC\tGATTAT\t0.0716973130728\n+GCAAAC\tGTTTGC\t0.0695793025764\n+AGTTGA\tTCAACT\t0.0691369360006\n+GTGGCA\tTGCCAC\t0.0677913523838\n+AACGGG\tCCCGTT\t0.0665866154179\n+CCGATA\tTATCGG\t0.0658721974601\n+ACGAAC\tGTTCGT\t0.063393292604\n+ACCGCT\tAGCGGT\t0.0631062404984\n+AACTGA\tTCAGTT\t0.0624442243223\n+CTCCCC\tGGGGAG\t0.0617640195469\n+AACGGA\tTCCGTT\t0.0600761019255\n+GTGTAA\tTTACAC\t0.058093173595\n+AAAGAC\tGTCTTT\t0.0549056210165\n+AAAACG\tCGTTTT\t0.0544901990994\n+ACGTAG\tCTACGT\t0.0534270793153\n+AGGATA\tTATCCT\t0.0525132986043\n+AGATGC\tGCATCT\t0.051919163559\n+GGCTAC\tGTAGCC\t0.051032183915\n+ATATCC\tGGATAT\t0.049569079902\n+GCTGAC\tGTCAGC\t0.0480503870262\n+ATAAGC\tGCTTAT\t0.0477780006604\n+CGATAG\tCTATCG\t0.0474142135194\n+CGAACC\tGGTTCG\t0.0450679013004\n+GCGCTA\tTAGCGC\t0.0447160328786\n+CAGCAA\tTTGCTG\t0.0444668746282\n+AAGCGT\tACGCTT\t0.0443654134896\n+ATTCCA\tTGGAAT\t0.0427610195292\n+CGTATA\tTATACG\t0.040480341687\n+TCGCCA\tTGGCGA\t0.0381275185187\n+CGCCAC\tGTGGCG\t0.0381153724231\n+CACGTA\tTACGTG\t0.0367905774416\n+CGTTAC\tGTAACG\t0.0357945076752\n+CATAGG\tCCTATG\t0.0345316831939\n+ACTAGG\tCCTAGT\t0.0344571169395\n+CACGAC\tGTCGTG\t0.0335787198929\n+CAGTCA\tTGACTG\t0.0334317180933\n+GGACCA\tTGGTCC\t0.0331452341451\n+ATCGGG\tCCCGAT\t0.0319104073013\n+CATTTC\tGAAATG\t0.0316315013156\n+TGCGAA\tTTCGCA\t0.0313766408202\n+CCTTTC\tGAAAGG\t0.0305787075338\n+CGCTAA\tTTAGCG\t0.0286393053316\n+CTTGGA\tTCCAAG\t0.0283882448584\n+ACGGAG\tCTCCGT\t0.0263310387396\n+ACTATA\tTATAGT\t0.026096673755\n+AGACTA\tTAGTCT\t0.0255670282096\n+CATGCG\tCGCATG\t0.025093529092\n+CAAAGG\tCCTTTG\t0.0249105903471\n+CCCTAC\tGTAGGG\t0.0240409749126\n+AAGCAG\tCTGCTT\t0.0234080284895\n+CAGGAA\tTTCCTG\t0.0227383507001\n+GCGAAA\tTTTCGC\t0.0220965445081\n+CGGCAA\tTTGCCG\t0.0203327541402\n+AGCACG\tCGTGCT\t0.0201805945132\n+ACACCA\tTGGTGT\t0.018834321513\n+CGTCCA\tTGGACG\t0.0171607213388\n+CGCTTA\tTAAGCG\t0.016104619937\n+CATCCC\tGGGATG\t0.0153163274607\n+TAGCGA\tTCGCTA\t0.014927280137\n+ACGTAA\tTTACGT\t0.0143485678563\n+GCCAGA\tTCTGGC\t0.0135057694173\n+CCAATA\tTATTGG\t0.0132879448137\n+CGAGCA\tTGCTCG\t0.0132861898017\n+TCGGCA\tTGCCGA\t0.013107009308\n+AAACGC\tGCGTTT\t0.0130861069071\n+CGACAA\tTTGTCG\t0.0129963879068\n+CGTACG\tCGTACG\t0.0126100467086\n+AGTTTC\tGAAACT\t0.0125655312936\n+ACAGAA\tTTCTGT\t0.0124832814472\n+ACCGAT\tATCGGT\t0.0115634183099\n+ATACGC\tGCGTAT\t0.0115414242443\n+ACGTAT\tATACGT\t0.0109206646956\n+CCGTAA\tTTACGG\t0.010783715864\n+CTATAC\tGTATAG\t0.0098800882502\n+CTACTA\tTAGTAG\t0.00984599515827\n+TGGCCA\tTGGCCA\t0.00945901345338\n+ACTCGA\tTCGAGT\t0.00944831576325\n+GTGGTA\tTACCAC\t0.00924035544748\n+AGTCGG\tCCGACT\t0.00923294701999\n+AGGTCG\tCGACCT\t0.00880255008466\n+CAGAAG\tCTTCTG\t0.00807064752353\n+ACGGTC\tGACCGT\t0.00775345622564\n+ACCGCA\tTGCGGT\t0.00731310264273\n+GACGAA\tTTCGTC\t0.0058533367428\n+AGAGGG\tCCCTCT\t0.00516463740642\n+CAGTAC\tGTACTG\t0.00495695451148\n+CATGTC\tGACATG\t0.00424738864511\n+ACTAGT\tACTAGT\t0.0039246326772\n+CATACG\tCGTATG\t0.00363143955983\n+CAATCG\tCGATTG\t0.00335730780355\n+ACGGAA\tTTCCGT\t'..b'\t-0.464147856205\n+AGATAG\tCTATCT\t-0.4645662219\n+AGCTGG\tCCAGCT\t-0.466210061091\n+GCCTCA\tTGAGGC\t-0.466366480048\n+AAATCT\tAGATTT\t-0.466498531963\n+CTAAGA\tTCTTAG\t-0.466866257668\n+ACTGCA\tTGCAGT\t-0.467392184239\n+AGAATA\tTATTCT\t-0.467494051938\n+AGCTAG\tCTAGCT\t-0.468669137059\n+CTGAAA\tTTTCAG\t-0.469520600226\n+CTCCAC\tGTGGAG\t-0.469631389825\n+TTAAAA\tTTTTAA\t-0.470071963663\n+GAGGAA\tTTCCTC\t-0.471268004925\n+GCAGGC\tGCCTGC\t-0.472607066849\n+ACATAT\tATATGT\t-0.473122400685\n+ACTACT\tAGTAGT\t-0.475912548788\n+TGGGAA\tTTCCCA\t-0.476327805909\n+ATGTAA\tTTACAT\t-0.477301981844\n+TGAGAA\tTTCTCA\t-0.478195982128\n+AGAGGA\tTCCTCT\t-0.478236282205\n+AAGGGC\tGCCCTT\t-0.478649824889\n+ATAAAA\tTTTTAT\t-0.480366473314\n+CCTCAG\tCTGAGG\t-0.486947102648\n+ACCCCA\tTGGGGT\t-0.489044936742\n+CAGGGC\tGCCCTG\t-0.489371070076\n+AAGTGA\tTCACTT\t-0.490189148588\n+CTCAAA\tTTTGAG\t-0.491292679725\n+CATAAA\tTTTATG\t-0.492558330842\n+AGCTTC\tGAAGCT\t-0.492977867855\n+CCAGGC\tGCCTGG\t-0.494836663537\n+AGGCCA\tTGGCCT\t-0.499232766241\n+ATGATA\tTATCAT\t-0.499336314669\n+TGCACA\tTGTGCA\t-0.499622032626\n+GCTCCA\tTGGAGC\t-0.501147185261\n+AAACAC\tGTGTTT\t-0.501710637649\n+CCTGCC\tGGCAGG\t-0.504191752598\n+TATATA\tTATATA\t-0.505357168941\n+AGACAG\tCTGTCT\t-0.509216922526\n+CTCACA\tTGTGAG\t-0.509874281285\n+TACACA\tTGTGTA\t-0.510187805254\n+ATAGAA\tTTCTAT\t-0.513097605851\n+ATTTTG\tCAAAAT\t-0.517076397245\n+GCAGCC\tGGCTGC\t-0.518966157555\n+GGGAAA\tTTTCCC\t-0.519849860491\n+CTGTAA\tTTACAG\t-0.520220452943\n+AGGAGC\tGCTCCT\t-0.523931216202\n+AGTGGA\tTCCACT\t-0.525970888551\n+AAAACA\tTGTTTT\t-0.52724215949\n+CAGGGG\tCCCCTG\t-0.527565892279\n+ATTACA\tTGTAAT\t-0.527725996886\n+AGGCAT\tATGCCT\t-0.527768805152\n+CCACAG\tCTGTGG\t-0.531331484967\n+CACATA\tTATGTG\t-0.532904518918\n+GTCTGA\tTCAGAC\t-0.533629145278\n+AGGAGG\tCCTCCT\t-0.533787326344\n+TCCTCA\tTGAGGA\t-0.534492706867\n+CCACCC\tGGGTGG\t-0.535476719082\n+AGATGG\tCCATCT\t-0.536726922644\n+AATGCC\tGGCATT\t-0.536932370356\n+GAAGGA\tTCCTTC\t-0.537501586575\n+ATTTTA\tTAAAAT\t-0.537752193064\n+ATTGTG\tCACAAT\t-0.53866469212\n+ATCATC\tGATGAT\t-0.542500395118\n+CTAGAA\tTTCTAG\t-0.543296815603\n+CCACCA\tTGGTGG\t-0.543549537686\n+ACAGAT\tATCTGT\t-0.546614586045\n+AGCAAC\tGTTGCT\t-0.54905905052\n+GATAGA\tTCTATC\t-0.552299656009\n+ATTCAA\tTTGAAT\t-0.555982433675\n+AGGGCA\tTGCCCT\t-0.556041638936\n+CACCCA\tTGGGTG\t-0.556174561485\n+AGTGAC\tGTCACT\t-0.556988157387\n+ACCAAA\tTTTGGT\t-0.557347982652\n+AGCAGC\tGCTGCT\t-0.559196748919\n+TAGATA\tTATCTA\t-0.560235158887\n+TAGAAA\tTTTCTA\t-0.568940253825\n+CAGGCA\tTGCCTG\t-0.57188370955\n+CCTCTC\tGAGAGG\t-0.574423797857\n+CTCCCA\tTGGGAG\t-0.575493680427\n+CCCGCC\tGGCGGG\t-0.578843460425\n+ACAAAC\tGTTTGT\t-0.578930662863\n+GACACA\tTGTGTC\t-0.579580526104\n+CAGCAG\tCTGCTG\t-0.581994519837\n+ATCCAT\tATGGAT\t-0.584776542951\n+CATATA\tTATATG\t-0.592857243311\n+AAACAA\tTTGTTT\t-0.594904466081\n+CTGGGA\tTCCCAG\t-0.602160214568\n+ATAGAT\tATCTAT\t-0.605460847829\n+ACACAT\tATGTGT\t-0.606534828064\n+ATTTAA\tTTAAAT\t-0.611717314678\n+ACTGGG\tCCCAGT\t-0.612469560409\n+AAAATG\tCATTTT\t-0.613223611287\n+ATATAT\tATATAT\t-0.613765305776\n+CGCCCC\tGGGGCG\t-0.620062967118\n+GAAAGA\tTCTTTC\t-0.626959807223\n+CCTCCC\tGGGAGG\t-0.637444233428\n+ACAGGA\tTCCTGT\t-0.640692486716\n+AATAAA\tTTTATT\t-0.641911867715\n+AGAAAG\tCTTTCT\t-0.643730863213\n+CCTTCC\tGGAAGG\t-0.645188770967\n+CCGCCC\tGGGCGG\t-0.655920687336\n+ATACAT\tATGTAT\t-0.656565760177\n+CCCCAG\tCTGGGG\t-0.660497743063\n+TAAATA\tTATTTA\t-0.67019051662\n+CATACA\tTGTATG\t-0.670245069654\n+AACCAA\tTTGGTT\t-0.677857110445\n+AGAAAT\tATTTCT\t-0.681479998956\n+CCAAAG\tCTTTGG\t-0.691316214143\n+CCTGGA\tTCCAGG\t-0.692084306561\n+CAACCA\tTGGTTG\t-0.69245939213\n+ACATAC\tGTATGT\t-0.715297612044\n+ATAAAT\tATTTAT\t-0.72144447054\n+ACCCAC\tGTGGGT\t-0.722257607177\n+GAGGGA\tTCCCTC\t-0.740111037839\n+AAATAA\tTTATTT\t-0.75536652708\n+CATCCA\tTGGATG\t-0.765629307273\n+AGGGAG\tCTCCCT\t-0.770422600651\n+CTCCTC\tGAGGAG\t-0.773793736326\n+TACATA\tTATGTA\t-0.779745687886\n+CCCACC\tGGTGGG\t-0.808464992637\n+CCCTCC\tGGAGGG\t-0.825388866404\n+AAAGAA\tTTCTTT\t-0.94196675862\n+CCCTGC\tGCAGGG\t-0.970276055932\n+AAGAAA\tTTTCTT\t-0.988900449499\n+CCCCCC\tGGGGGG\t-1.23390665711\n+AGAGAG\tCTCTCT\t-1.32036067663\n+AAAAAA\tTTTTTT\t-1.44585337029\n+GAGAGA\tTCTCTC\t-1.47677790622\n+CACACA\tTGTGTG\t-1.69250039294\n+ACACAC\tGTGTGT\t-1.80322224697\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/nullseq.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/nullseq.xml Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,77 @@ +<tool id="kmersvm_nullseq" name="Generate Null Sequence"> + <description>using random sampling from genomic DNA</description> + <command interpreter="python">scripts/nullseq_generate.py -q + #if str($excluded) !="None": + -e $excluded + #end if + -x $fold -r $rseed -g $gc_err -t $rpt_err $input $dbkey ${indices_path.fields.path} + </command> + <inputs> + <param name="fold" type="integer" value="1" label="# of Fold-Increase" /> + <param name="gc_err" type="float" value="0.02" label="Allowable GC Error" /> + <param name="rpt_err" type="float" value="0.02" label="Allowable Repeat Error" /> + <param name="rseed" type="integer" value="1" label="Random Number Seed" /> + <param format="interval" name="input" type="data" label="BED File of Positive Regions" /> + <validator type="unspecified_build" /> + <validator type="dataset_metadata_in_file" filename="nullseq_indices.loc" metadata_name="dbkey" metadata_column="0" message="Sequences are currently unavailable for the specified build." /> + <param name="excluded" optional="true" format="interval" type="data" value="None" label="Excluded Regions (optional)" /> + <param name="indices_path" type="select" label="Available Datasets"> + <options from_file="nullseq_indices.loc"> + <column name="dbkey" index="0"/> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + <!--filter type="data_meta" ref="input" key="dbkey" column="0" /--> + </options> + </param> + </inputs> + <outputs> + <data format="interval" name="nullseq_output" from_work_dir="nullseq_output.bed" /> + </outputs> + <tests> + <test> + <param name="input" value="nullseq_test.bed" /> + <param name="fold" value="1" /> + <param name="gc_err" value="0.02" /> + <param name="rpt_err" value="0.02" /> + <param name="rseed" value="1" /> + <param name="indices_path" value="hg19" /> + <output name="output" file="nullseq_output.bed" /> + </test> + </tests> + <help> + +**What it does** + +Takes an input BED file and generates a set of sequences for use as negative data (null sequences) in Train SVM similar in length, GC content and repeat fraction. Uses random sampling for efficiency. + +**Parameters** + +Fold-Increase: Size of desired null sequence data set expressed as multiple of the size of the input data set. + +GC Error, Repeat Error: Acceptable difference between a positive sequence and its corresponding null sequence in terms of GC content, repeat content. + +Random Number Seed: Seed for random number generator. + +Excluded Regions: Submitted regions will be excluded from null sequence generation. + +---- + +**Example** + +Given a BED file containing:: + + chr1 10212203 10212303 + chr1 103584748 103584848 + chr1 105299130 105299230 + chr1 106367772 106367872 + +Tool will output BED file matched in length, GC content and repeat content:: + + chr1 3089935 3090035 + chr1 5031335 5031435 + chr1 5103742 5103842 + chr1 5650372 5650472 + + </help> +</tool> |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/prcurve.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/prcurve.xml Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,118 @@ +<tool id="PR Curve" name="Plot Precision-Recall Curve"> + <description></description> + <command interpreter="sh">r_wrapper.sh $script_file</command> + <inputs> + <param format="tabular" name="cvpred_data" type="data" label="CV Predictions"/> + </inputs> + <outputs> + <data format="png" name="PR_curve" from_work_dir="prcurve.png" /> + </outputs> + + <configfiles> + <configfile name="script_file"> + + rm(list = objects() ) + + ########## calculate auprc ######### + auPRC <- function (perf) { + rec <- perf@x.values + prec <- perf@y.values + result <- list() + for (i in 1:length(perf@x.values)) { + result[i] <- list(sum((rec[[i]][2:length(rec[[i]])] - rec[[i]][2:length(rec[[i]])-1])*prec[[i]][-1])) + } + return(result) + } + + ########## plot ROC and PR-Curve ######### + prcurve <- function(x) { + sink(NULL,type="message") + options(warn=-1) + suppressMessages(suppressWarnings(library('ROCR'))) + svmresult <- data.frame(x) + colnames(svmresult) <- c("Seqid","Pred","Label", "CV") + + linewd <- 1 + wd <- 4 + ht <- 4 + fig.nrows <- 1 + fig.ncols <- 1 + pt <- 10 + cex.general <- 1 + cex.lab <- 0.9 + cex.axis <- 0.9 + cex.main <- 1.2 + cex.legend <- 0.8 + + + png("prcurve.png", width=wd*fig.ncols, height=ht*fig.nrows, unit="in", res=100) + + par(xaxs="i", yaxs="i", mar=c(3.5,3.5,2,2)+0.1, mgp=c(2,0.8,0), mfrow=c(fig.nrows, fig.ncols)) + + CVs <- unique(svmresult[["CV"]]) + preds <- list() + labs <- list() + auc <- c() + for(i in 1:length(CVs)) { + preds[i] <- subset(svmresult, CV==(i-1), select=c(Pred)) + labs[i] <- subset(svmresult, CV==(i-1), select=c(Label)) + } + + pred <- prediction(preds, labs) + perf_prc <- performance(pred, 'prec', 'rec') + + prcs <- auPRC(perf_prc) + avgprc <- 0 + + for(j in 1:length(CVs)) { + avgprc <- avgprc + prcs[[j]] + } + + avgprc <- avgprc/length(CVs) + + plot(perf_prc, colorize=T, main="P-R curve", spread.estimate="stderror", + xlab="Recall", ylab="Precision", cex.lab=1.2, xlim=c(0,1), ylim=c(0,1)) + text(0.2, 0.1, paste("AUC=", format(avgprc, digits=3, nsmall=3))) + + dev.off() + } + + ############## main function ################# + d <- read.table("${cvpred_data}") + + prcurve(d) + + </configfile> + </configfiles> + + <help> + +**Note** + +This tool is based on the ROCR library. If you use this tool please cite: + +Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. +ROCR: visualizing classifier performance in R. +Bioinformatics 21(20):3940-3941 (2005). + +---- + +**What it does** + +Takes as input cross-validation predictions and calculates Precision-Recall(PR) Curve and its area under curve(AUC). + +---- + +**Results** + +PR Curve: Precision Recall Curve. Compares number of true positives (recall; same as sensitivity) to the number of true positives relative to the total number sequences classified as positive (precision). + +Area Under the PR Curve. + +<!-- +**Example** + +.. image:: ./static/images/sample_roc_chen.png +--> + </help> +</tool> |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/r_wrapper.sh --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/r_wrapper.sh Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,23 @@ +#!/bin/sh + +### Run R providing the R script in $1 as standard input and passing +### the remaining arguments on the command line + +# Function that writes a message to stderr and exits +fail() +{ + echo "$@" >&2 + exit 1 +} + +# Ensure R executable is found +which R > /dev/null || fail "'R' is required by this tool but was not found on path" + +# Extract first argument +infile=$1; shift + +# Ensure the file exists +test -f $infile || fail "R input file '$infile' does not exist" + +# Invoke R passing file named by first argument to stdin +R --vanilla --slave $* < $infile |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/roccurve.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/roccurve.xml Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,115 @@ +<tool id="ROC Curve" name="Plot ROC Curve"> + <description></description> + <command interpreter="sh">r_wrapper.sh $script_file</command> + <inputs> + <param format="tabular" name="cvpred_data" type="data" label="CV Predictions"/> + </inputs> + <outputs> + <data format="png" name="roccurve.png" from_work_dir="roccurve.png" /> + </outputs> + + <configfiles> + <configfile name="script_file"> + + rm(list = objects() ) + + ########## plot ROC and PR-Curve ######### + roccurve <- function(x) { + sink(NULL,type="message") + options(warn=-1) + suppressMessages(suppressWarnings(library('ROCR'))) + svmresult <- data.frame(x) + colnames(svmresult) <- c("Seqid","Pred","Label", "CV") + + linewd <- 1 + wd <- 4 + ht <- 4 + fig.nrows <- 1 + fig.ncols <- 1 + pt <- 10 + cex.general <- 1 + cex.lab <- 0.9 + cex.axis <- 0.9 + cex.main <- 1.2 + cex.legend <- 0.8 + + png("roccurve.png", width=wd*fig.ncols, height=ht*fig.nrows, unit="in", res=100) + + par(xaxs="i", yaxs="i", mar=c(3.5,3.5,2,2)+0.1, mgp=c(2,0.8,0), mfrow=c(fig.nrows, fig.ncols)) + + CVs <- unique(svmresult[["CV"]]) + preds <- list() + labs <- list() + auc <- c() + for(i in 1:length(CVs)) { + preds[i] <- subset(svmresult, CV==(i-1), select=c(Pred)) + labs[i] <- subset(svmresult, CV==(i-1), select=c(Label)) + } + + pred <- prediction(preds, labs) + perf_roc <- performance(pred, 'tpr', 'fpr') + perf_auc <- performance(pred, 'auc') + + avgauc <- 0 + + for(j in 1:length(CVs)) { + avgauc <- avgauc + perf_auc@y.values[[j]] + } + + avgauc <- avgauc/length(CVs) + + plot(perf_roc, colorize=T, main="ROC curve", spread.estimate="stderror", + xlab="1-Specificity", ylab="Sensitivity", cex.lab=1.2) + text(0.2, 0.1, paste("AUC=", format(avgauc, digits=3, nsmall=3))) + + dev.off() + } + + ############## main function ################# + d <- read.table("${cvpred_data}") + + roccurve(d) + + </configfile> + </configfiles> + + <help> + +**Note** + +This tool is based on the ROCR library. If you use this tool please cite: + +Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. +ROCR: visualizing classifier performance in R. +Bioinformatics 21(20):3940-3941 (2005). + +---- + +**What it does** + +Takes as input cross-validation predictions and calculates ROC Curve and its area under curve (AUC). + +---- + +**Results** + +ROC Curve: Receiver Operating Characteristic Curve. Compares true positive rate (sensitivity) to false positive rate (1 - specificity). + +Area Under the ROC Curve (AUC): Probability that of a randomly selected positive/negative pair, the positive will be scored more highly by the trained SVM than a negative. + +.. class:: infomark + +ROC curves can be inaccurate if there is a large skew in class distribution. For more information see: + +Jesse Davis, Mark Goadrich. +The Relationship Between Precision-Recall and ROC Curves. +Proceedings of the 23rd Annual Internation Conference on Machine Learning. +Pittsburgh, PA, 2006. + +<!-- +**Example** + +.. image:: ./static/images/sample_roc_chen.png +--> + </help> +</tool> |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/rocprcurve.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/rocprcurve.xml Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,147 @@ +<tool id="ROC-PR Curve" name="ROC-PR Curve"> + <description>calculates AUC for ROC and PR curves</description> + <command interpreter="sh">r_wrapper.sh $script_file</command> + <inputs> + <param format="tabular" name="cvpred_data" type="data" label="CV Predictions"/> + </inputs> + <outputs> + <!-- + <data format="pdf" name="rocprc.pdf" from_work_dir="rocprc.pdf" label="ROC-PR Curve" /> + --> + <data format="png" name="rocprc.png" from_work_dir="rocprc.png" /> + </outputs> + + <configfiles> + <configfile name="script_file"> + + rm(list = objects() ) + + ########## calculate auprc ######### + auPRC <- function (perf) { + rec <- perf@x.values + prec <- perf@y.values + result <- list() + for (i in 1:length(perf@x.values)) { + result[i] <- list(sum((rec[[i]][2:length(rec[[i]])] - rec[[i]][2:length(rec[[i]])-1])*prec[[i]][-1])) + } + return(result) + } + + ########## plot ROC and PR-Curve ######### + rocprc <- function(x) { + sink(NULL,type="message") + options(warn=-1) + suppressMessages(suppressWarnings(library('ROCR'))) + svmresult <- data.frame(x) + colnames(svmresult) <- c("Seqid","Pred","Label", "CV") + + linewd <- 1 + wd <- 4 + ht <- 4 + fig.nrows <- 1 + fig.ncols <- 2 + pt <- 10 + cex.general <- 1 + cex.lab <- 0.9 + cex.axis <- 0.9 + cex.main <- 1.2 + cex.legend <- 0.8 + + + #pdf("rocprc.pdf", width=wd*fig.ncols, height=ht*fig.nrows) + png("rocprc.png", width=wd*fig.ncols, height=ht*fig.nrows, unit="in", res=100) + + par(xaxs="i", yaxs="i", mar=c(3.5,3.5,2,2)+0.1, mgp=c(2,0.8,0), mfrow=c(fig.nrows, fig.ncols)) + + CVs <- unique(svmresult[["CV"]]) + preds <- list() + labs <- list() + auc <- c() + for(i in 1:length(CVs)) { + preds[i] <- subset(svmresult, CV==(i-1), select=c(Pred)) + labs[i] <- subset(svmresult, CV==(i-1), select=c(Label)) + } + + pred <- prediction(preds, labs) + perf_roc <- performance(pred, 'tpr', 'fpr') + perf_prc <- performance(pred, 'prec', 'rec') + + perf_auc <- performance(pred, 'auc') + prcs <- auPRC(perf_prc) + avgauc <- 0 + avgprc <- 0 + + for(j in 1:length(CVs)) { + avgauc <- avgauc + perf_auc@y.values[[j]] + avgprc <- avgprc + prcs[[j]] + } + + avgauc <- avgauc/length(CVs) + avgprc <- avgprc/length(CVs) + + #preds_merged <- unlist(preds) + #labs_merged <- unlist(labs) + #pred_merged <- prediction(preds_merged, labs_merged) + #perf_merged_auc <- performance(pred_merged, 'auc') + + plot(perf_roc, colorize=T, main="ROC curve", spread.estimate="stderror", + xlab="1-Specificity", ylab="Sensitivity", cex.lab=1.2) + text(0.2, 0.1, paste("AUC=", format(avgauc, digits=3, nsmall=3))) + + plot(perf_prc, colorize=T, main="P-R curve", spread.estimate="stderror", + xlab="Recall", ylab="Precision", cex.lab=1.2, xlim=c(0,1), ylim=c(0,1)) + text(0.2, 0.1, paste("AUC=", format(avgprc, digits=3, nsmall=3))) + + dev.off() + } + + ############## main function ################# + d <- read.table("${cvpred_data}") + + rocprc(d) + + </configfile> + </configfiles> + + <help> + +**Note** + +This tool is based on the ROCR library. If you use this tool please cite: + +Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. +ROCR: visualizing classifier performance in R. +Bioinformatics 21(20):3940-3941 (2005). + +---- + +**What it does** + +Takes as input cross-validation predictions and calculates ROC Curve and its area under curve (AUC) and PR Curve and its AUC. + +---- + +**Results** + +ROC Curve: Receiver Operating Characteristic Curve. Compares true positive rate (sensitivity) to false positive rate (1 - specificity). + +PR Curve: Precision Recall Curve. Compares number of true positives (recall; same as sensitivity) to the number of true positives relative to the total number sequences classified as positive (precision). + +AUC for a given curve: Area Under the Curve: Probability that of a randomly selected positive/negative pair, the positive will be scored more highly by the trained SVM than a negative. + +.. class:: infomark + +Both curves measure SVM performance, but ROC curves can be inaccurate if there is a large skew in class distribution. For more information see: + +Jesse Davis, Mark Goadrich. +The Relationship Between Precision-Recall and ROC Curves. +Proceedings of the 23rd Annual Internation Conference on Machine Learning. +Pittsburgh, PA, 2006. + +---- + +**Example** + +.. image:: ./static/images/sample_roc_chen.png + </help> +</tool> |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/kmersvm_classify.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/scripts/kmersvm_classify.py Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,221 @@ +#!/usr/bin/python +""" + kmersvm_classify.py; classify sequences using SVM + Copyright (C) 2011 Dongwon Lee + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +""" + +import sys +import numpy +import optparse + +from libkmersvm import * + +""" +global variables +""" +g_kmer2id = {} + + +class Parameters: + def __init__(self, kernel=None, kmerlen=None, kmerlen2=None, bias=None, A=None, B=None): + self.kernel = kernel + self.kmerlen = kmerlen + self.kmerlen2 = kmerlen2 + self.bias = bias + self.A = A + self.B = B + + +def read_svmwfile_wsk(filename): + """read SVM weight file generated by kmersvm_train.py + + Arguments: + filename -- string, name of the SVM weight file + + Return: + list of SVM weights + an object of Parameters class + """ + + try: + f = open(filename, 'r') + lines = f.readlines() + f.close() + + except IOError, (errno, strerror): + print "I/O error(%d): %s" % (errno, strerror) + sys.exit(0) + + kmer_svmw_dict = {} + params = Parameters() + + for line in lines: + #header lines + if line[0] == '#': + #if this line contains '=', that should be evaluated as a parameter + if line.find('=') > 0: + name, value = line[1:].split('=') + vars(params)[name] = value + else: + s = line.split() + kmerlen = len(s[0]) + if kmerlen not in kmer_svmw_dict: + kmer_svmw_dict[kmerlen] = {} + + kmer_svmw_dict[kmerlen][s[0]] = float(s[2]) + + #type casting of parameters + params.kernel = int(params.kernel) + params.kmerlen = int(params.kmerlen) + if params.kernel == 1: + params.kmerlen2 = params.kmerlen + else: + params.kmerlen2 = int(params.kmerlen2) + params.bias = float(params.bias) + params.A = float(params.A) + params.B = float(params.B) + + #set global variable + global g_kmer2id + for k in range(params.kmerlen, params.kmerlen2+1): + kmers = generate_kmers(k) + rcmap = generate_rcmap_table(k, kmers) + for i in xrange(len(kmers)): + g_kmer2id[kmers[i]] = rcmap[i] + + #create numpy arrays of svm weights + svmw_list = [] + for k in range(params.kmerlen, params.kmerlen2+1): + svmw = [0]*(2**(2*k)) + + for kmer in kmer_svmw_dict[k].keys(): + svmw[g_kmer2id[kmer]] = kmer_svmw_dict[k][kmer] + + svmw_list.append(numpy.array(svmw, numpy.double)) + + return svmw_list, params + + +def score_seq(s, svmw, kmerlen): + """calculate SVM score of given sequence using single set of svm weights + + Arguments: + s -- string, DNA sequence + svmw -- numpy array, SVM weights + kmerlen -- integer, length of k-mer of SVM weight + + Return: + SVM score + """ + kmer2id = g_kmer2id + x = [0]*(2**(2*kmerlen)) + for j in xrange(len(s)-kmerlen+1): + x[ kmer2id[s[j:j+kmerlen]] ] += 1 + + x = numpy.array(x, numpy.double) + score_norm = numpy.dot(svmw, x)/numpy.sqrt(numpy.sum(x**2)) + + return score_norm + + +def score_seq_wsk(s, svmwlist, kmerlen_start, kmerlen_end): + """calculate svm score of given sequence with multiple sets of svm weights + + Arguments: + svmwlist -- list, SVM weights + kmerlen_start -- integer, minimum length of k-mer in the list of svm weights + kmerlen_end -- integer, maximum length of k-mer in the list of sv weights + + Return: + SVM score + """ + kmerlens = range(kmerlen_start, kmerlen_end+1) + nkmerlens = len(kmerlens) + + score_norm_sum = 0 + + for i in range(nkmerlens): + score_norm = score_seq(s, svmwlist[i], kmerlens[i]) + score_norm_sum += score_norm + + return score_norm_sum + + +def main(argv = sys.argv): + usage = "Usage: %prog [options] SVM_WEIGHTS TEST_SEQ" + desc = "1. take two files(one is in FASTA format to score, the other is SVM weight file generated from kmersvm_train.py) as input, 2. score each sequence in the given file" + parser = optparse.OptionParser(usage=usage, description=desc) + parser.add_option("-o", dest="output", default="kmersvm_scores.out", \ + help="set the name of output score file (default=kmersvm_scores.out)") + + parser.add_option("-q", dest="quiet", default=False, action="store_true", \ + help="supress messages (default=false)") + + (options, args) = parser.parse_args() + + if len(args) == 0: + parser.print_help() + sys.exit(0) + + if len(args) != 2: + parser.error("incorrect number of arguments") + sys.exit(0) + + ktype_str = ["", "Spectrum", "Weighted Spectrums"] + + svmwf = args[0] + seqf = args[1] + + seqs, sids = read_fastafile(seqf) + svmwlist, params = read_svmwfile_wsk(svmwf) + + if options.quiet == False: + sys.stderr.write('Options:\n') + sys.stderr.write(' kernel-type: ' + str(params.kernel) + "." + ktype_str[params.kernel] + '\n') + sys.stderr.write(' kmerlen: ' + str(params.kmerlen) + '\n') + if params.kernel == 2: + sys.stderr.write(' kmerlen2: ' + str(params.kmerlen2) + '\n') + sys.stderr.write(' output: ' + options.output + '\n') + sys.stderr.write('\n') + + sys.stderr.write('Input args:\n') + sys.stderr.write(' SVM weights file: ' + svmwf + '\n') + sys.stderr.write(' sequence file: ' + seqf + '\n') + sys.stderr.write('\n') + + sys.stderr.write('numer of sequences to score: ' + str(len(seqs)) + '\n') + sys.stderr.write('posteriorp A: ' + str(params.A) + '\n') + sys.stderr.write('posteriorp B: ' + str(params.B) + '\n') + sys.stderr.write('\n') + + f = open(options.output, 'w') + f.write("\t".join(["#seq_id", "posterior_prob", "svm_score\n"])) + + kmerlen = params.kmerlen + kmerlen2 = params.kmerlen2 + bias = params.bias + A = params.A + B = params.B + for sidx in xrange(len(seqs)): + s = seqs[sidx] + score = score_seq_wsk(s, svmwlist, kmerlen, kmerlen2) + bias + pp = 1/(1+numpy.exp(score*A+B)) + + f.write("\t".join([ sids[sidx], str(pp), str(score)]) + "\n") + + f.close() + +if __name__=='__main__': main() |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/kmersvm_train.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/scripts/kmersvm_train.py Mon Aug 20 18:07:22 2012 -0400 |
[ |
b'@@ -0,0 +1,859 @@\n+#!/usr/bin/env python\n+"""\n+\tkmersvm_train.py; train a support vector machine using shogun toolbox\n+\tCopyright (C) 2011 Dongwon Lee\n+\n+\tThis program is free software: you can redistribute it and/or modify\n+\tit under the terms of the GNU General Public License as published by\n+\tthe Free Software Foundation, either version 3 of the License, or\n+\t(at your option) any later version.\n+\n+\tThis program is distributed in the hope that it will be useful,\n+\tbut WITHOUT ANY WARRANTY; without even the implied warranty of\n+\tMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n+\tGNU General Public License for more details.\n+\n+\tYou should have received a copy of the GNU General Public License\n+\talong with this program. If not, see <http://www.gnu.org/licenses/>.\n+\n+\n+"""\n+\n+\n+\n+import sys\n+import optparse\n+import random\n+import numpy\n+from math import log, exp\n+\n+from libkmersvm import *\n+try:\n+\tfrom shogun.PreProc import SortWordString, SortUlongString\n+except ImportError:\n+\tfrom shogun.Preprocessor import SortWordString, SortUlongString\n+from shogun.Kernel import CommWordStringKernel, CommUlongStringKernel, \\\n+\t\tCombinedKernel\n+\t\t\n+from shogun.Features import StringWordFeatures, StringUlongFeatures, \\\n+\t\tStringCharFeatures, CombinedFeatures, DNA, Labels\n+from shogun.Classifier import MSG_INFO, MSG_ERROR\n+try:\n+\tfrom shogun.Classifier import SVMLight\n+except ImportError:\n+\tfrom shogun.Classifier import LibSVM\n+\n+"""\n+global variables\n+"""\n+g_kmers = []\n+g_rcmap = []\n+\n+\n+def kmerid2kmer(kmerid, kmerlen):\n+\t"""convert integer kmerid to kmer string\n+\n+\tArguments:\n+\tkmerid -- integer, id of k-mer\n+\tkmerlen -- integer, length of k-mer\n+\n+\tReturn:\n+\tkmer string\n+\t"""\n+\n+\tnts = "ACGT"\n+\tkmernts = []\n+\tkmerid2 = kmerid\n+\n+\tfor i in xrange(kmerlen):\n+\t\tntid = kmerid2 % 4\n+\t\tkmernts.append(nts[ntid])\n+\t\tkmerid2 = int((kmerid2-ntid)/4)\n+\n+\treturn \'\'.join(reversed(kmernts))\n+\n+\n+def kmer2kmerid(kmer, kmerlen):\n+\t"""convert kmer string to integer kmerid\n+\n+\tArguments:\n+\tkmerid -- integer, id of k-mer\n+\tkmerlen -- integer, length of k-mer\n+\n+\tReturn:\n+\tid of k-mer\n+\t"""\n+\n+\tnt2id = {\'A\':0, \'C\':1, \'G\':2, \'T\':3}\n+\n+\treturn reduce(lambda x, y: (4*x+y), [nt2id[x] for x in kmer])\n+\n+\n+def get_rcmap(kmerid, kmerlen):\n+\t"""mapping kmerid to its reverse complement k-mer on-the-fly\n+\n+\tArguments:\n+\tkmerid -- integer, id of k-mer\n+\tkmerlen -- integer, length of k-mer\n+\n+\tReturn:\n+\tinteger kmerid after mapping to its reverse complement\n+\t"""\n+\n+\t#1. get kmer from kmerid\n+\t#2. get reverse complement kmer\n+\t#3. get kmerid from revcomp kmer\n+\trckmerid = kmer2kmerid(revcomp(kmerid2kmer(kmerid, kmerlen)), kmerlen)\n+\n+\tif rckmerid < kmerid:\n+\t\treturn rckmerid\n+\n+\treturn kmerid\n+\n+\n+def non_redundant_word_features(feats, kmerlen):\n+\t"""convert the features from Shogun toolbox to non-redundant word features (handle reverse complements)\n+\tArguments:\n+\tfeats -- StringWordFeatures\n+\tkmerlen -- integer, length of k-mer\n+\n+\tReturn:\n+\tStringWordFeatures after converting reverse complement k-mer ids\n+\t"""\n+\n+\trcmap = g_rcmap\n+\n+\tfor i in xrange(feats.get_num_vectors()):\n+\t\tnf = [rcmap[int(kmerid)] for kmerid in feats.get_feature_vector(i)]\n+\n+\t\tfeats.set_feature_vector(numpy.array(nf, numpy.dtype(\'u2\')), i)\n+\n+\tpreproc = SortWordString()\n+\tpreproc.init(feats)\n+\ttry:\n+\t\tfeats.add_preproc(preproc)\n+\t\tfeats.apply_preproc()\n+\texcept AttributeError:\n+\t\tfeats.add_preprocessor(preproc)\n+\t\tfeats.apply_preprocessor()\t\n+\n+\treturn feats\n+\n+\n+def non_redundant_ulong_features(feats, kmerlen):\n+\t"""convert the features from Shogun toolbox to non-redundant ulong features\n+\tArguments:\n+\tfeats -- StringUlongFeatures\n+\tkmerlen -- integer, length of k-mer\n+\n+\tReturn:\n+\tStringUlongFeatures after converting reverse complement k-mer ids\n+\t"""\n+\n+\tfor i in xrange(feats.get_num_vectors()):\n+\t\tnf = [get_rcmap(int(kmerid), kmerlen) \\\n+\t\t\t\tfor kmerid in feats.get_feature_vector(i)]\n+\n+\t\tfeats.set_feature_vector(numpy.array(nf, numpy.dtype(\'u8\')), i)\n+\n+\tpreproc = SortUlongStr'..b'erlen)\n+\t\tg_rcmap = generate_rcmap_table(options.kmerlen, g_kmers)\n+\t\n+\tposf = args[0]\n+\tnegf = args[1]\n+\t\n+\tseqs_pos, sids_pos = read_fastafile(posf)\n+\tseqs_neg, sids_neg = read_fastafile(negf)\n+\tnpos = len(seqs_pos)\n+\tnneg = len(seqs_neg)\n+\tseqs = seqs_pos + seqs_neg\n+\tsids = sids_pos + sids_neg\n+\n+\tif options.weight == 0:\n+\t\toptions.weight = 1 + log(nneg/npos)\n+\n+\tif options.quiet == False:\n+\t\tsys.stderr.write(\'SVM parameters:\\n\')\n+\t\tsys.stderr.write(\' kernel-type: \' + str(options.ktype) + "." + ktype_str[options.ktype] + \'\\n\')\n+\t\tsys.stderr.write(\' svm-C: \' + str(options.svmC) + \'\\n\')\n+\t\tsys.stderr.write(\' epsilon: \' + str(options.epsilon) + \'\\n\')\n+\t\tsys.stderr.write(\' weight: \' + str(options.weight) + \'\\n\')\n+\t\tsys.stderr.write(\'\\n\')\n+\n+\t\tsys.stderr.write(\'Other options:\\n\')\n+\t\tsys.stderr.write(\' kmerlen: \' + str(options.kmerlen) + \'\\n\')\n+\t\tif options.ktype == 2:\n+\t\t\tsys.stderr.write(\' kmerlen2: \' + str(options.kmerlen2) + \'\\n\')\n+\t\tsys.stderr.write(\' outputname: \' + options.outputname + \'\\n\')\n+\t\tsys.stderr.write(\' posteriorp: \' + str(options.posteriorp) + \'\\n\')\n+\t\tif options.ncv > 0:\n+\t\t\tsys.stderr.write(\' ncv: \' + str(options.ncv) + \'\\n\')\n+\t\t\tsys.stderr.write(\' rseed: \' + str(options.rseed) + \'\\n\')\n+\t\tsys.stderr.write(\' sorted-weight: \' + str(options.sort) + \'\\n\')\n+\n+\t\tsys.stderr.write(\'\\n\')\n+\n+\t\tsys.stderr.write(\'Input args:\\n\')\n+\t\tsys.stderr.write(\' positive sequence file: \' + posf + \'\\n\')\n+\t\tsys.stderr.write(\' negative sequence file: \' + negf + \'\\n\')\n+\t\tsys.stderr.write(\'\\n\')\n+\n+\t\tsys.stderr.write(\'numer of total positive seqs: \' + str(npos) + \'\\n\')\n+\t\tsys.stderr.write(\'numer of total negative seqs: \' + str(nneg) + \'\\n\')\n+\t\tsys.stderr.write(\'\\n\')\n+\n+\t#generate labels\n+\tlabels = [1]*npos + [-1]*nneg\n+\n+\tif options.ktype == 1:\n+\t\tget_features = get_spectrum_features\n+\t\tget_kernel = get_spectrum_kernel\n+\t\tget_weights = get_sksvm_weights\n+\t\tsave_weights = save_sksvm_weights\n+\t\tsvm_classify = sksvm_classify\n+\telif options.ktype == 2:\n+\t\tget_features = get_weighted_spectrum_features\n+\t\tget_kernel = get_weighted_spectrum_kernel\n+\t\tget_weights = get_wsksvm_weights\n+\t\tsave_weights = save_wsksvm_weights\n+\t\tsvm_classify = wsksvm_classify\n+\telse:\n+\t\tsys.stderr.write(\'..unknown kernel..\\n\')\n+\t\tsys.exit(0)\n+\n+\tA = B = 0\n+\tif options.ncv > 0:\n+\t\tif options.quiet == False:\n+\t\t\tsys.stderr.write(\'..Cross-validation\\n\')\n+\n+\t\tcvlist = generate_cv_list(options.ncv, npos, nneg)\n+\t\tlabels_cv = []\n+\t\tpreds_cv = []\n+\t\tsids_cv = []\n+\t\tindices_cv = []\n+\t\tfor icv in xrange(options.ncv):\n+\t\t\t#split data into training and test set\n+\t\t\tseqs_tr, seqs_te = split_cv_list(cvlist, icv, seqs) \n+\t\t\tlabs_tr, labs_te = split_cv_list(cvlist, icv, labels)\n+\t\t\tsids_tr, sids_te = split_cv_list(cvlist, icv, sids)\n+\t\t\tindices_tr, indices_te = split_cv_list(cvlist, icv, range(len(seqs)))\n+\n+\t\t\t#train SVM\n+\t\t\tfeats_tr = get_features(seqs_tr, options)\n+\t\t\tkernel_tr = get_kernel(feats_tr, options)\n+\t\t\tsvm_cv = svm_learn(kernel_tr, labs_tr, options)\n+\n+\t\t\tpreds_cv = preds_cv + svm_classify(seqs_te, svm_cv, kernel_tr, feats_tr, options)\n+\t\t\t\n+\t\t\tlabels_cv = labels_cv + labs_te\n+\t\t\tsids_cv = sids_cv + sids_te\n+\t\t\tindices_cv = indices_cv + indices_te\n+\n+\t\toutput_cvpred = options.outputname + "_cvpred.out"\n+\t\tprediction_results = sorted(zip(indices_cv, sids_cv, preds_cv, labels_cv), key=lambda p: p[0])\n+\t\tsave_predictions(output_cvpred, prediction_results, cvlist)\n+\n+\t\tif options.posteriorp:\n+\t\t\tA, B = LMAI(preds_cv, labels_cv, labels_cv.count(-1), labels_cv.count(1))\n+\n+\t\t\tif options.quiet == False:\n+\t\t\t\tsys.stderr.write(\'Estimated Parameters:\\n\')\n+\t\t\t\tsys.stderr.write(\' A: \' + str(A) + \'\\n\')\n+\t\t\t\tsys.stderr.write(\' B: \' + str(B) + \'\\n\')\n+\n+\tif options.quiet == False:\n+\t\tsys.stderr.write(\'..SVM weights\\n\')\n+\n+\tfeats = get_features(seqs, options)\n+\tkernel = get_kernel(feats, options)\n+\tsvm = svm_learn(kernel, labels, options)\n+\tw = get_weights(svm, feats, options)\n+\tb = svm.get_bias()\n+\n+\tsave_weights(w, b, A, B, options)\n+\n+if __name__==\'__main__\': main()\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/libkmersvm.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/scripts/libkmersvm.py Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,138 @@ +""" + libkmersvm.py; common library for kmersvm_train.py and kmersvm_classify.py + Copyright (C) 2011 Dongwon Lee + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +""" + +import sys +import os +import os.path +import optparse + +from bitarray import bitarray + +def bitarray_fromfile(filename): + """ + """ + fh = open(filename, 'rb') + bits = bitarray() + bits.fromfile(fh) + + return bits, fh + +def generate_kmers(kmerlen): + """make a full list of k-mers + + Arguments: + kmerlen -- integer, length of k-mer + + Return: + a list of the full set of k-mers + """ + + nts = ['A', 'C', 'G', 'T'] + kmers = [] + kmers.append('') + l = 0 + while l < kmerlen: + imers = [] + for imer in kmers: + for nt in nts: + imers.append(imer+nt) + kmers = imers + l += 1 + + return kmers + + +def revcomp(seq): + """get reverse complement DNA sequence + + Arguments: + seq -- string, DNA sequence + + Return: + the reverse complement sequence of the given sequence + """ + rc = {'A':'T', 'G':'C', 'C':'G', 'T':'A'} + return ''.join([rc[seq[i]] for i in xrange(len(seq)-1, -1, -1)]) + + +def generate_rcmap_table(kmerlen, kmers): + """make a lookup table for reverse complement k-mer ids for speed + + Arguments: + kmerlen -- integer, length of k-mer + kmers -- list, a full set of k-mers generated by generate_kmers + + Return: + a dictionary containing the mapping table + """ + revcomp_func = revcomp + + kmer_id_dict = {} + for i in xrange(len(kmers)): + kmer_id_dict[kmers[i]] = i + + revcomp_mapping_table = [] + for kmerid in xrange(len(kmers)): + rc_id = kmer_id_dict[revcomp_func(kmers[kmerid])] + if rc_id < kmerid: + revcomp_mapping_table.append(rc_id) + else: + revcomp_mapping_table.append(kmerid) + + return revcomp_mapping_table + + +def read_fastafile(filename, subs=True): + """Read sequences from a file in FASTA format + + Arguments: + filename -- string, the name of the sequence file in FASTA format + subs -- bool, substitute 'N' with 'A' if set true + + Return: + list of sequences, list of sequence ids + """ + + sids = [] + seqs = [] + + try: + f = open(filename, 'r') + lines = f.readlines() + f.close() + + except IOError, (errno, strerror): + print "I/O error(%d): %s" % (errno, strerror) + sys.exit(0) + + seq = [] + for line in lines: + if line[0] == '>': + sids.append(line[1:].rstrip('\n').split()[0]) + if seq != []: seqs.append("".join(seq)) + seq = [] + else: + if subs: + seq.append(line.rstrip('\n').upper().replace('N', 'A')) + else: + seq.append(line.rstrip('\n').upper()) + + if seq != []: + seqs.append("".join(seq)) + + return seqs, sids |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/libkmersvm.pyc |
b |
Binary file kmersvm/scripts/libkmersvm.pyc has changed |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/make_profile.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/scripts/make_profile.py Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,101 @@ +#!/usr/bin/env python2.7 + +import os +import os.path +import sys +import random +import optparse + +from bitarray import bitarray + +def read_bed_file(filename): + """ + """ + f = open(filename, 'r') + lines = f.readlines() + f.close() + + positions = [] + + for line in lines: + if line[0] == '#': + continue + + l = line.split() + + positions.append((l[0], int(l[1]), int(l[2]))) + + return positions + + +def bitarray_fromfile(filename): + """ + """ + fh = open(filename, 'rb') + bits = bitarray() + bits.fromfile(fh) + + return bits, fh + +def get_seqid(buildname, pos): + return '_'.join( [buildname, pos[0], str(pos[1]), str(pos[2]), '+'] ) + +def make_profile(positions, buildname, basedir): + """ + """ + chrnames = sorted(set(map(lambda p: p[0], positions))) + + profiles = {} + for chrom in chrnames: + idxf_gc = os.path.join(basedir, '.'.join([buildname, chrom, 'gc', 'out'])) + idxf_rpt = os.path.join(basedir, '.'.join([buildname, chrom, 'rpt', 'out'])) + + #if os.path.exists(idxf_gc) == False or os.path.exists(idxf_rpt) == False: + # continue + bits_gc, gcf = bitarray_fromfile(idxf_gc) + bits_rpt, rptf = bitarray_fromfile(idxf_rpt) + + for pos in positions: + if pos[0] != chrom: + continue + + seqid = get_seqid(buildname, pos) + slen = pos[2]-pos[1] + gc = bits_gc[pos[1]:pos[2]].count(True) + rpt = bits_rpt[pos[1]:pos[2]].count(True) + + profiles[seqid] = (slen, gc, rpt) + + gcf.close() + rptf.close() + + return profiles + +def main(): + parser = optparse.OptionParser() + if len(sys.argv) != 5: + print "Usage:", sys.argv[0], "BEDFILE BUILDNAME BASE_DIR OUT_FILE" + sys.exit() + parser.add_option("-o", dest="output", default="seq_profile.txt") + (options,args) = parser.parse_args() + + bedfile = sys.argv[1] + buildname = sys.argv[2] + basedir = sys.argv[3] + output = options.output + + positions = read_bed_file(bedfile) + seqids = [] + for pos in positions: + seqids.append(get_seqid(buildname, pos)) + + profiles = make_profile(positions, buildname, basedir) + + f = open(output, 'w') + f.write("\t".join(["#seq_id", "length", "GC content", "repeat fraction"]) + '\n') + for seqid in seqids: + prof = profiles[seqid] + f.write('\t'.join( map(str, [seqid, prof[0], prof[1]/float(prof[0]), prof[2]/float(prof[0])]) ) + '\n') + f.close() + +if __name__ == "__main__": main() |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/nullseq_build_indices.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/scripts/nullseq_build_indices.py Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,129 @@ +import os +import sys +import tarfile +import zipfile +import optparse + +from bitarray import bitarray + +def clear_indexes(sid, buildname): + na = '.'.join([buildname, sid, "na", "out"]) + gc = '.'.join([buildname, sid, "gc", "out"]) + rpt = '.'.join([buildname, sid, "rpt", "out"]) + + #truncate files + for fn in (na, gc, rpt): + f = open(fn, 'wb') + f.close() + + +def append_indexes(seq, sid, buildname): + na = '.'.join([buildname, sid, "na", "out"]) + gc = '.'.join([buildname, sid, "gc", "out"]) + rpt = '.'.join([buildname, sid, "rpt", "out"]) + + f = open(na, 'ab') + bitarray(map(lambda c: c in 'N', seq)).tofile(f) + f.close() + + f = open(gc, 'ab') + bitarray(map(lambda c: c in 'cgCG', seq)).tofile(f) + f.close() + + f = open(rpt, 'ab') + bitarray(map(lambda c: c in 'acgt', seq)).tofile(f) + f.close() + + +def build_indexes(fn, buildname): + save_interval = 8*32*1024 + + try: + f = open(fn, 'r') + + seq = [] + sid = '' + nlines = 0 + for line in f: + if line[0] == '>': + if sid: + append_indexes("".join(seq), sid, buildname) + seq = [] + + sid = line[1:].rstrip('\n').split()[0] + clear_indexes(sid, buildname) + else: + nlines += 1 + seq.append(line.rstrip('\n')) + + if nlines % save_interval == 0: + append_indexes("".join(seq), sid, buildname) + seq = [] + + #the last remaining sequence + append_indexes("".join(seq), sid, buildname) + + except IOError, (errno, strerror): + print "I/O error(%d): %s" % (errno, strerror) + sys.exit(0) + + +def main(argv=sys.argv): + usage = "usage: %prog [options] <Chromosome File(TARBALL gzip (tar.gz) or zip)> <Genome Build Name>" + desc = "generate bit index files for generating null sequences" + + parser = optparse.OptionParser(usage=usage, description=desc) + + parser.add_option("-q", dest="quiet", default=False, action="store_true", \ + help="supress messages (default=false)") + + (options, args) = parser.parse_args() + + if len(args) == 0: + parser.print_help() + sys.exit(0) + + if len(args) != 2: + parser.error("incorrect number of arguments") + parser.print_help() + sys.exit(0) + + chrom_file = args[0] + genome = args[1] + + if zipfile.is_zipfile(chrom_file): + if options.quiet == False: + sys.stderr.write("detected file type is zip.\n") + + zipfileobj = zipfile.ZipFile(chrom_file) + + for fn in zipfileobj.namelist(): + if options.quiet == False: + sys.stderr.write(' '.join(["processing", fn, "\n"])) + + zipfileobj.extract(fn) + build_indexes(fn, genome) + os.remove(fn) + + zipfileobj.close() + + elif tarfile.is_tarfile(chrom_file): + if options.quiet == False: + sys.stderr.write("detected file type is tar.\n") + + tarfileobj = tarfile.open(chrom_file) + + for fn in tarfileobj.getnames(): + if options.quiet == False: + sys.stderr.write(' '.join(["processing", fn, "\n"])) + + tarfileobj.extract(fn) + build_indexes(fn, genome) + os.remove(fn) + + tarfileobj.close() + + else: + sys.stderr.write(' '.join(["unknown input file:", fn, "\n"])) + +if __name__ == "__main__": main() |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/nullseq_generate.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/scripts/nullseq_generate.py Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,258 @@ +import os +import os.path +import sys +import random +import optparse + +from bitarray import bitarray + +def read_bed_file(filename): + """ + """ + f = open(filename, 'r') + lines = f.readlines() + f.close() + + positions = [] + + for line in lines: + if line[0] == '#': + continue + + l = line.split() + + positions.append((l[0], int(l[1]), int(l[2]))) + + return positions + + +def bitarray_fromfile(filename): + """ + """ + fh = open(filename, 'rb') + bits = bitarray() + bits.fromfile(fh) + + return bits, fh + + +def make_profile(positions, buildname, basedir): + """ + """ + chrnames = sorted(set(map(lambda p: p[0], positions))) + + profiles = [] + for chrom in chrnames: + idxf_gc = os.path.join(basedir, '.'.join([buildname, chrom, 'gc', 'out'])) + idxf_rpt = os.path.join(basedir, '.'.join([buildname, chrom, 'rpt', 'out'])) + + #if os.path.exists(idxf_gc) == False or os.path.exists(idxf_rpt) == False: + # continue + bits_gc, gcf = bitarray_fromfile(idxf_gc) + bits_rpt, rptf = bitarray_fromfile(idxf_rpt) + + for pos in positions: + if pos[0] != chrom: + continue + + seqid = pos[0] + ':' + str(pos[1]+1) + '-' + str(pos[2]) + slen = pos[2]-pos[1] + gc = bits_gc[pos[1]:pos[2]].count(True) + rpt = bits_rpt[pos[1]:pos[2]].count(True) + + profiles.append((seqid, slen, gc, rpt)) + + gcf.close() + rptf.close() + + return profiles + + +def sample_sequences(positions, buildname, basedir, options): + """ + """ + rpt_err = options.rpt_err + gc_err = options.gc_err + max_trys = options.max_trys + norpt = options.norpt + nogc = options.nogc + + chrnames = sorted(set(map(lambda p: p[0], positions))) + profiles = make_profile(positions, buildname, basedir) + + excluded = [] + if options.excludefile: + excluded = read_bed_file(options.excludefile) + + #truncate (clear) file + f = open(options.output, 'w') + f.close() + + for chrom in chrnames: + if options.quiet == False: + sys.stderr.write("sampling from " + chrom + "\n") + + idxf_na = os.path.join(basedir, '.'.join([buildname, chrom, 'na', 'out'])) + idxf_gc = os.path.join(basedir, '.'.join([buildname, chrom, 'gc', 'out'])) + idxf_rpt = os.path.join(basedir, '.'.join([buildname, chrom, 'rpt', 'out'])) + + #this bit array is used to mark positions that are excluded from sampling + #this will be updated as we sample more sequences in order to prevent sampled sequences from overlapping + bits_na, naf = bitarray_fromfile(idxf_na) + bits_gc, gcf = bitarray_fromfile(idxf_gc) + bits_rpt, rptf = bitarray_fromfile(idxf_rpt) + + #mark excluded regions + for pos in excluded: + if pos[0] != chrom: + continue + bits_na[pos[1]:pos[2]] = True + + npos = 0 + #mark positive regions + for pos in positions: + if pos[0] != chrom: + continue + bits_na[pos[1]:pos[2]] = True + npos+=1 + + if options.count == 0: + count = options.fold*npos + else: + count = options.count + + sampled_positions = [] + while len(sampled_positions) < count: + sampled_prof = random.choice(profiles) + sampled_len = sampled_prof[1] + sampled_gc = sampled_prof[2] + sampled_rpt = sampled_prof[3] + + rpt_err_allowed = int(rpt_err*sampled_len) + gc_err_allowed = int(gc_err*sampled_len) + trys = 0 + while trys < max_trys: + trys += 1 + + pos = random.randint(1, bits_na.length() - sampled_len) + pos_e = pos+sampled_len + + if bits_na[pos:pos_e].any(): + continue + + if not norpt: + pos_rpt = bits_rpt[pos:pos_e].count(True) + if abs(sampled_rpt - pos_rpt) > rpt_err_allowed: + continue + + if not nogc: + pos_gc = bits_gc[pos:pos_e].count(True) + if abs(sampled_gc - pos_gc) > gc_err_allowed: + continue + + #accept the sampled position + #mark the sampled regions + bits_na[pos:pos_e] = True + + sampled_positions.append((chrom, pos, pos_e)) + + #print trys, chrom, pos, pos_e, sampled_len, pos_rpt, sampled_rpt, pos_gc, sampled_gc + break + else: + if options.quiet == False: + sys.stderr.write(' '.join(["fail to sample from", \ + "len=", str(sampled_len), \ + "rpt=", str(sampled_rpt), \ + "gc=", str(sampled_gc)]) + '\n') + + naf.close() + gcf.close() + rptf.close() + + f = open(options.output, 'a') + for spos in sorted(sampled_positions, key=lambda s: s[1]): + f.write('\t'.join([spos[0], str(spos[1]), str(spos[2])]) + '\n') + f.close() + + +def main(argv=sys.argv): + usage = "usage: %prog [options] <Input Bed File> <Genome Build Name> <Base Directory>" + + desc = "generate null sequences" + parser = optparse.OptionParser(usage=usage, description=desc) + + parser.add_option("-x", dest="fold", type="int", \ + default = 1, help="number of sequence to sample, FOLD times of given dataset (default=1)") + + parser.add_option("-c", dest="count", type="int", \ + default=0, help="number of sequences to sample, override -x option (default=NA, obsolete)") + + parser.add_option("-r", dest="rseed", type="int", \ + default=1, help="random number seed (default=1)") + + parser.add_option("-g", dest="gc_err", type="float", \ + default=0.02, help="GC errors allowed (default=0.02)") + + parser.add_option("-t", dest="rpt_err", type="float", \ + default=0.02, help="RPT errors allowed (default=0.02)") + + parser.add_option("-e", dest="excludefile", \ + default="", help="filename that contains regions to be excluded (default=NA)") + + parser.add_option("-G", dest="nogc", action="store_true", \ + default=False, help="do not match gc-contents") + + parser.add_option("-R", dest="norpt", action="store_true", \ + default=False, help="do not match repeats") + + parser.add_option("-m", dest="max_trys", type="int", \ + default=10000, help="number of maximum trys to sample of one sequence (default=10000)") + + parser.add_option("-o", dest="output", default="nullseq_output.bed", \ + help="set the name of output file (default=nullseq_output.bed)") + + parser.add_option("-q", dest="quiet", default=False, action="store_true", \ + help="supress messages (default=false)") + + (options, args) = parser.parse_args() + + if len(args) == 0: + parser.print_help() + sys.exit(0) + + if len(args) != 3: + parser.error("incorrect number of arguments") + parser.print_help() + sys.exit(0) + + + posfile = args[0] + buildname = args[1] + basedir = args[2] + + random.seed(options.rseed) + + if options.quiet == False: + sys.stderr.write('Options:\n') + sys.stderr.write(' N fold: ' + str(options.fold) + '\n') + sys.stderr.write(' GC match: ' + str(not options.nogc) + '\n') + sys.stderr.write(' repeat match: ' + str(not options.norpt) + '\n') + sys.stderr.write(' GC error allowed: ' + str(options.gc_err) + '\n') + sys.stderr.write(' repeat error allowed: ' + str(options.rpt_err) + '\n') + sys.stderr.write(' random seed: ' + str(options.rseed) + '\n') + sys.stderr.write(' max trys: ' + str(options.max_trys) + '\n') + sys.stderr.write(' excluded regions: ' + options.excludefile+ '\n') + sys.stderr.write(' output: ' + options.output + '\n') + sys.stderr.write('\n') + + sys.stderr.write('Input args:\n') + sys.stderr.write(' input bed file: ' + posfile+ '\n') + sys.stderr.write(' genome build name: ' + buildname+ '\n') + sys.stderr.write(' basedir: ' + basedir + '\n') + sys.stderr.write('\n') + + positions = read_bed_file(posfile) + + sample_sequences(positions, buildname, basedir, options) + +if __name__ == "__main__": main() |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/scripts/split_genome.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/scripts/split_genome.py Mon Aug 20 18:07:22 2012 -0400 |
[ |
@@ -0,0 +1,55 @@ +import os +import os.path +import sys +import optparse +import math +import re +from libkmersvm import * + +def split(bed_file,options): + split_f = open(options.output, 'w') + incr = options.incr + size = options.size + file = open(bed_file, 'rb') + + for line in file: + (name,start,length) = line.split('\t') + start = int(start) + length = int(length) + end = size + start + + while True: + coords = "".join([name,"\t",str(start),"\t",str(end),"\n"]) + split_f.write(coords) + if end + incr >= length: + end += incr-((end+incr)-length) + start += incr + coords = "".join([name,"\t",str(start),"\t",str(end),"\n"]) + split_f.write(coords) + break + else: + start += incr + end += incr + + +def main(argv=sys.argv): + usage = "usage: %prog <bed_file>" + parser = optparse.OptionParser(usage=usage) + + parser.add_option("-s", dest="size", type="int", \ + default=1000, help="set chunk size") + parser.add_option("-i", dest="incr", type="int", \ + default=500, help="set overlap size") + parser.add_option("-o", dest="output", default="split_genome_output.bed", \ + help="output BED file (default is split_genome_output.bed)") + + (options, args) = parser.parse_args() + if len(args) == 0: + parser.print_help() + sys.exit(0) + + bed_file = args[0] + + split(bed_file, options) + +if __name__ == "__main__": main() |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/seqprofile.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/seqprofile.xml Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,41 @@ +<tool id="kmersvm_seqprofile" name="Sequence Profiles"> + <description>provide length, gc content, and repeat fraction of each sequence</description> + <command interpreter="python">scripts/make_profile.py $input $dbkey ${indices_path.fields.path} seq_profile.txt</command> + <inputs> + <param format="interval" name="input" type="data" label="BED File of Regions of interest" /> + <validator type="unspecified_build" /> + <validator type="dataset_metadata_in_file" filename="nullseq_indices.loc" metadata_name="dbkey" metadata_column="0" message="Sequences are currently unavailable for the specified build." /> + <param name="indices_path" type="select" label="Available Datasets"> + <options from_file="nullseq_indices.loc"> + <column name="dbkey" index="0"/> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + <filter type="data_meta" ref="input" key="dbkey" column="0" /> + </options> + </param> + </inputs> + <outputs> + <data format="tabular" name="seq_profile.txt" from_work_dir="seq_profile.txt" /> + </outputs> + <help> + +**What it does** + +Takes as input a BED file and the file's corresponding genome build. + +Returns length, GC and repeat content information for each interval in bed file. + +---- + +**Example** + +Profile file:: + + mm9_chr10_6238300_6238926_+ 626 0.525559105431 0.0 + mm9_chr10_7757450_7758801_+ 1351 0.451517394523 0.0384900074019 + mm9_chr10_8992150_8992551_+ 401 0.411471321696 0.0 + mm9_chr10_9265550_9266026_+ 476 0.38025210084 0.0 + + </help> +</tool> |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/split_genome.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/split_genome.xml Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,47 @@ +<tool id="kmersvm_genome_split" name="Split Genome"> + <description>split genome into overlapping segments for feature prediction</description> + <command interpreter="python">scripts/split_genome.py -i $incr -s $size $bed_file</command> + <inputs> + <param name="size" type="integer" value="1000" label="Size of Fragmanent" /> + <param name="incr" type="integer" value="500" label="Size of Overlap" /> + <param format="tabular" name="bed_file" type="data" label="BED File of Regions for Prediction"/> + </inputs> + <outputs> + <data format="interval" name="split_genome_output.bed" from_work_dir="split_genome_output.bed" /> + </outputs> + <tests> + <test> + <param name="size" value="100" /> + <param name="incr" value="20" /> + <param name="bed_file" value="nullseq_test.bed" ftype="tabular"/> + <output name="output" file="split_genome_output.bed" /> + </test> + </tests> + <help> + +**What it does** + +Divides input genomic regions into regions of size N bp which overlap each other by N/2 bp. If genome-wide prediction is desired, a single BED file listing the total length of each chromosome should be provided as input.x + +**Parameters** + +Size of Genome Fragments: Size of regions into which genome will be split. + +Size of Overlap: Size of overlap between genomic regions. + +BED File of Regions for Prediction: Regions to be split according to above criteria. + +---- + +**Example** + +Given a BED file, tool will output BED file of regions of length N, which overlap by N/2 bp:: + + chr1 0 1000 + chr1 500 1500 + chr1 1000 2000 + chr1 1500 2500 + + + </help> +</tool> |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/classify_output.out --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/tool-data/classify_output.out Mon Aug 20 18:07:22 2012 -0400 |
b |
b'@@ -0,0 +1,11848 @@\n+#seq_id\tposterior_prob\tsvm_score\n+chr10:3067520-3067720\t0.99394667801\t1.10838858725\n+chr10:3077830-3078030\t0.998388845249\t1.36539005616\n+chr10:3409515-3409715\t0.98338956029\t0.91099020313\n+chr10:3410770-3410970\t0.979453344413\t0.869061282769\n+chr10:3411485-3411685\t0.998067040613\t1.3300895947\n+chr10:3419500-3419700\t0.99742139272\t1.27419578541\n+chr10:3429690-3429890\t0.999696936872\t1.68895054632\n+chr10:3446120-3446320\t0.996941521387\t1.24107693984\n+chr10:4222150-4222350\t0.982317456854\t0.898675850841\n+chr10:4478080-4478280\t0.99380470211\t1.10387476395\n+chr10:4838795-4838995\t0.997042198361\t1.24757345898\n+chr10:5111120-5111320\t0.994739919377\t1.13572331486\n+chr10:5111550-5111750\t0.99307478152\t1.08217981313\n+chr10:5142480-5142680\t0.997069959966\t1.24940367302\n+chr10:5214690-5214890\t0.990263775801\t1.01570611656\n+chr10:5333335-5333535\t0.990137298446\t1.01318383749\n+chr10:5421860-5422060\t0.988777505133\t0.987924335859\n+chr10:6243415-6243615\t0.999135518642\t1.48600854294\n+chr10:6425830-6426030\t0.998030689378\t1.32647722322\n+chr10:6822900-6823100\t0.998325382839\t1.35790186419\n+chr10:6849435-6849635\t0.998444032474\t1.37214521732\n+chr10:6928895-6929095\t0.99773174648\t1.29907136489\n+chr10:7009070-7009270\t0.996667722925\t1.2244327124\n+chr10:7758410-7758610\t0.999156186484\t1.49069510527\n+chr10:7894065-7894265\t0.998242471582\t1.34853470516\n+chr10:7947875-7948075\t0.985612327004\t0.939226437646\n+chr10:7949865-7950065\t0.975147332907\t0.831390436474\n+chr10:8111115-8111315\t0.99792149358\t1.31601317091\n+chr10:8202685-8202885\t0.984967254871\t0.930612655966\n+chr10:8218850-8219050\t0.996601482952\t1.22061096771\n+chr10:8480395-8480595\t0.994993387266\t1.14532937069\n+chr10:8481590-8481790\t0.995949403025\t1.18651873239\n+chr10:8692925-8693125\t0.998211170856\t1.34511267576\n+chr10:8725420-8725620\t0.999551733676\t1.61317427125\n+chr10:8734765-8734965\t0.993349788109\t1.09007452422\n+chr10:8959825-8960025\t0.998857366602\t1.43197334132\n+chr10:8992290-8992490\t0.999333727908\t1.53644263088\n+chr10:9182275-9182475\t0.999196233441\t1.50011175016\n+chr10:9289145-9289345\t0.979844507186\t0.872858038275\n+chr10:9357920-9358120\t0.999246599808\t1.51264388299\n+chr10:9360025-9360225\t0.998494450702\t1.37852910529\n+chr10:9361995-9362195\t0.998815694002\t1.42503353581\n+chr10:9827130-9827330\t0.999161474031\t1.49191251632\n+chr10:10669060-10669260\t0.999319439795\t1.53233397276\n+chr10:10997985-10998185\t0.989300257858\t0.997257079188\n+chr10:11826950-11827150\t0.995965189951\t1.18727745906\n+chr10:11910350-11910550\t0.968568234857\t0.784634245889\n+chr10:12059510-12059710\t0.955813691025\t0.716160930437\n+chr10:12315685-12315885\t0.982206852275\t0.897447433235\n+chr10:12459475-12459675\t0.956282753977\t0.718321054286\n+chr10:12546520-12546720\t0.997995969612\t1.32308858306\n+chr10:12577905-12578105\t0.999172792927\t1.49454458403\n+chr10:12706365-12706565\t0.999353155112\t1.54217261745\n+chr10:12940275-12940475\t0.966242823786\t0.770357626836\n+chr10:13024730-13024930\t0.985149220472\t0.933005038782\n+chr10:13106845-13107045\t0.988334062419\t0.980338493146\n+chr10:13394600-13394800\t0.988798318479\t0.988287624809\n+chr10:13395330-13395530\t0.973693834538\t0.820103107074\n+chr10:13635125-13635325\t0.999010006064\t1.45975047746\n+chr10:13739300-13739500\t0.996264558675\t1.20225380753\n+chr10:13740120-13740320\t0.998248223518\t1.34917016168\n+chr10:13805150-13805350\t0.979154223149\t0.866205362191\n+chr10:13899350-13899550\t0.99702036803\t1.24614625996\n+chr10:13913690-13913890\t0.924419160368\t0.605825300716\n+chr10:13915820-13916020\t0.989677759073\t1.00428148692\n+chr10:13919260-13919460\t0.999695552903\t1.68806861426\n+chr10:13927525-13927725\t0.970725594121\t0.798824272551\n+chr10:13928055-13928255\t0.985563831804\t0.938565772067\n+chr10:13930415-13930615\t0.999478792565\t1.58398671095\n+chr10:13936070-13936270\t0.993576217623\t1.09682207257\n+chr10:14033855-14034055\t0.994744676477\t1.13589932414\n+chr10:14048930-14049130\t0.998201778295\t1.34409746335\n+chr10:16743840-16744040\t0.99953126386\t1.60452967813\n+chr10:16795230-16795430\t0.993760339486\t1.1024854091\n+chr10:16919990-16920190\t0'..b'164490-112164690\t0.989904313364\t1.0086202364\n+chr9:112210465-112210665\t0.939144724901\t0.650817607675\n+chr9:112350045-112350245\t0.9723324971\t0.810068864698\n+chr9:112357395-112357595\t0.996433167858\t1.21122434436\n+chr9:112358375-112358575\t0.99519652857\t1.1533841301\n+chr9:112393530-112393730\t0.969116553303\t0.788149253926\n+chr9:112561720-112561920\t0.986543981925\t0.952363753844\n+chr9:113081155-113081355\t0.999519502984\t1.59973206795\n+chr9:113086370-113086570\t0.992014297339\t1.05440136805\n+chr9:113130855-113131055\t0.998143143746\t1.33787706264\n+chr9:113158030-113158230\t0.998101788247\t1.33360656023\n+chr9:113196600-113196800\t0.976548297472\t0.842895985032\n+chr9:113209170-113209370\t0.999666458531\t1.67040149616\n+chr9:113299700-113299900\t0.998421112752\t1.36931114627\n+chr9:113320710-113320910\t0.999705219621\t1.69431436886\n+chr9:113342025-113342225\t0.998443566904\t1.37208723495\n+chr9:113417000-113417200\t0.996589526329\t1.21992904191\n+chr9:113418075-113418275\t0.997178151783\t1.25670523336\n+chr9:113427660-113427860\t0.984060093037\t0.91909572536\n+chr9:113535975-113536175\t0.998615122328\t1.39471934303\n+chr9:114933215-114933415\t0.998986336216\t1.45517372698\n+chr9:115142995-115143195\t0.99725659986\t1.26217622956\n+chr9:115335915-115336115\t0.999380414499\t1.55050955799\n+chr9:115517175-115517375\t0.991260853829\t1.03680767828\n+chr9:115539380-115539580\t0.96827422879\t0.782773874444\n+chr9:115851750-115851950\t0.99576699313\t1.1779595692\n+chr9:115852250-115852450\t0.998339739029\t1.35957071107\n+chr9:116384830-116385030\t0.996413273803\t1.21014418389\n+chr9:116712615-116712815\t0.985248871915\t0.934327464667\n+chr9:116828830-116829030\t0.999544445224\t1.6100518724\n+chr9:117101265-117101465\t0.999700258831\t1.6910839993\n+chr9:117115025-117115225\t0.995494000129\t1.1658127722\n+chr9:117210100-117210300\t0.99688579579\t1.23757212724\n+chr9:117281165-117281365\t0.875138107259\t0.498081883612\n+chr9:117289670-117289870\t0.997738341399\t1.29963608667\n+chr9:118567390-118567590\t0.920294159047\t0.594676834568\n+chr9:118615425-118615625\t0.994415652453\t1.12408431021\n+chr9:118925990-118926190\t0.997225964237\t1.26002134785\n+chr9:119090380-119090580\t0.992376400816\t1.06345158853\n+chr9:119420010-119420210\t0.996665955755\t1.22432977531\n+chr9:119756465-119756665\t0.97007418124\t0.794435641452\n+chr9:119761930-119762130\t0.999762618657\t1.73623218258\n+chr9:119911210-119911410\t0.997490070312\t1.27943281009\n+chr9:119921990-119922190\t0.978520049135\t0.860280814465\n+chr9:120173140-120173340\t0.996039308618\t1.19087962552\n+chr9:120330175-120330375\t0.99928124925\t1.52176133897\n+chr9:120345150-120345350\t0.989425825787\t0.999566007281\n+chr9:120353810-120354010\t0.998037418415\t1.32714086874\n+chr9:120387575-120387775\t0.99531384117\t1.15819155775\n+chr9:120392320-120392520\t0.999559950319\t1.61675574732\n+chr9:120428850-120429050\t0.998938221179\t1.44619057884\n+chr9:120586175-120586375\t0.997535514461\t1.2829773456\n+chr9:120600275-120600475\t0.99831470895\t1.35667030079\n+chr9:120910170-120910370\t0.998963173161\t1.45079717775\n+chr9:120984540-120984740\t0.99800532218\t1.32399559069\n+chr9:121021340-121021540\t0.93015344957\t0.622290170802\n+chr9:121073775-121073975\t0.996321231248\t1.20522314289\n+chr9:121258405-121258605\t0.999999129689\t2.82158382894\n+chr9:121307560-121307760\t0.999655949502\t1.66439659132\n+chr9:121308045-121308245\t0.977434968366\t0.850529738868\n+chr9:121309930-121310130\t0.996304435101\t1.20433839056\n+chr9:121414565-121414765\t0.991036129588\t1.03185070178\n+chr9:121459475-121459675\t0.992539832769\t1.0676769286\n+chr9:121509765-121509965\t0.999116489802\t1.48179158105\n+chr9:121599470-121599670\t0.99465958827\t1.13277479682\n+chr9:121916825-121917025\t0.988351292705\t0.980627884817\n+chr9:121949170-121949370\t0.997820499323\t1.306812341\n+chr9:122512290-122512490\t0.997235286154\t1.26067452068\n+chr9:122544610-122544810\t0.987806259045\t0.971672474555\n+chr9:123293790-123293990\t0.983499736443\t0.912299688559\n+chr9:123294500-123294700\t0.996908536037\t1.23899474435\n+chr9:123376470-123376670\t0.998740589566\t1.4131207811\n+chr9:123412475-123412675\t0.973373630292\t0.817698255028\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/classify_test.fa --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/tool-data/classify_test.fa Mon Aug 20 18:07:22 2012 -0400 |
b |
b'@@ -0,0 +1,23694 @@\n+>chr10:3067520-3067720\n+CTCCTTATGCTTCTTAAACAAGCCAAATATGTTCTCCGCTCTCTTGGAATTTGCTTGCATTTTATTATTCTTCTATTTTTTTCCTACATGAAGCATGTGCCAGACGACTAAGCAGATTACTATGCGTGTTTCACACAGATTAGCAGCTGCCTTCCAGCTTCAGGAGATGGATTTGGGCAACTAAAAACAGTACTTGGGGA\n+>chr10:3077830-3078030\n+CTGGGGATACTGTACAGAAGCTTCCAAGTAGCTCTCGTCAGCTTTTCTGTGCAGGGCTAGAACATATGGTACAGTTGGCTTTAATTTCCCAAACCAACATGCTTCTAGAAGGCATGCCAGGGAGCAGATTGCCATGAAGCTCTCTGAGCGGCTGGCCAAGTCTGTTAGGATTATTGACAGTGCAAAGTCAAGTGTAAAAC\n+>chr10:3409515-3409715\n+AACTGGAGGTTCTATGTATAGGCACATTCCGGGGAAGCATAGTTTGGCATTTGAAATGATCTCAAAGGGACAGAAAACCATCTTCACACAGCCTCAGCTTGGAGCCTGGTTTCCCATGCCAGCTAGAAGTGACTTCACACTTCACCCACTCTTGAGCAGCTGCAGGCTTCACTGTTACTCTGACAATTGCCAATGCAAAT\n+>chr10:3410770-3410970\n+CATTTTTATGGTTCAGCTTTCGTGCTTAGCCCTGTCCCTCCTCTTGCAGTTTGGGTACTAGCCAGGAGTGAGTTATCAGTAAGACTCAGCTCCCAAGGCTCATTCTGATTCCCATCTTCATTGCTGACTGCATCATAACGATTCACACATTTTAAAGGTATTTTTTATGTTTTTTCTGAGCTATAATAGCCCCTGTCTTT\n+>chr10:3411485-3411685\n+AAGTGCAGTTCTCAGGCTGCTGACTCAGGGTGCTGACTTAGAGCCTTCTCCATAGCAACCCACGGGGATGACAATCTTGTTAATAACCCACAGATGGACTGAATCACACTCAGAGCATGCATGGAGATGCGAGGGGCAGGCTAATGCCTGCAGAGGCTCATTTGGCAGCGTGTATTAATTATACCCTATTTTGACTTATT\n+>chr10:3419500-3419700\n+GAGACTAAACCACTGTGCTGTGATTTTCTTCACAGAGCTGCTTCTCTGATCCATTCTTTCCTGGCACTGATTTATTATCAACTACGTGTTGGGCACAGATTCCAGAAAAAAAAAATTGGAAGTGTAAAAATGACTAAAAACAGCCTGTTTTCAAGTAGCTCACCAAAAAAGATTTAGCAGTCAATCAGCAATGGCAATAG\n+>chr10:3429690-3429890\n+TAGCAGTTGCCATATGTTAGAGAATTCTTGCTGCTGATTTTCTCCTGCTGTTGGGGATTTGGTTTCATTCCCAGAAGAGCTGAACCACATCAAGTCATCTCCATGGAAACGGCAGCCCAAACTTCCCCCGTGTTTGCTGTGACTTGCTGTTGAGTAATCAAATCGGGTGTTGTGAATGCATGTTTGCCTGTTGGAGAAGG\n+>chr10:3446120-3446320\n+CTGTGACCTTTTAAAGACACTCTGTCACCGACACAGCAGCATCCGTGCATAACAGCCCTCGATTTTATATTCAGTGTGATTCAGAGTTTGAATTCCCAAGTCCCAGATGGCAAGGATGGTGTGTAATTTGCACTTGTAGTTTTGTGTCTACCACAATACTTGGCTCAAAATTAGATTGTTTGAGAATATCGTCCATGCAT\n+>chr10:4222150-4222350\n+ATTCATTGAGATAAGTAGTTCAATCATTTTTTAAAAACACAGACACTCTTCTTCAGCTTTGGGCTCTCTAATTGCAATTCCCACAGAAGGCACAGAAGGAAGGTAGTTTATTACTTTTAGCTTTATGATTATGGGTAGTTTATTTTAAGAAAAGCAGGCATAGCCGAGGACTCCTTTGCTATAAATGCACTTTCTTTAAA\n+>chr10:4478080-4478280\n+AATTGAGCAGAACGCCTCCCCTATACATGACAAGATGACCACAATGATCATGATGTGCATTGTTGGATAACGAGCTGCTTGTGGGTATGATGATGAGTCACTCGGGACTCGGTTCATTTACATGCAATAAAAATTACAGCGTGGAAGTGCAAGGAAAAGAGCCAACTCTTTAGTCTTATTTCTTCACTGAGTAAGCTTTT\n+>chr10:4838795-4838995\n+AAAACAGCTACACAGCCTCTTTGGGAGACTGCTAATTATGATTCCAGAGAAAACTCTTCCTAGCATGGAGCCAATAGCAACCAGTTTATTTTTGGCAAGTAGCAAAATCACTTCAGTAGCACTCTAGTCATTAGGGATTGGCACCCTCTCCAAAATTGATTCGATTTAATTCTTAAAAAATACAAAAGTCTCTTCTTAAG\n+>chr10:5111120-5111320\n+TGCAGTAGCAATGCTTTCTCCACACCGTCTAGACTCTCAGCACCTTTCTTTCCTACTTCTTGTGCCAGGCATGCATCTGCTGATCTATTACCTTGTTTTTCCTCAGAAGAGTGCCAGTCCATGAGAGCCAAGATTGTGTTATTTGGAGTGCTTTATACACTGTATCTAAAACATATGTTTCCTCAATGCTAAGCAGCTGT\n+>chr10:5111550-5111750\n+CTGGTAGAAAAATCCAGAGAGTCTGGAGACAAGCCTTACGTTGCCCCCTTCCCAGACCACGCAGACAGCTTTAATTATAATGGGAAGCAGATGGCGGACAAACAGATCCTACAGTGTGTTTGGAATTCTCTCCCATGAGTCACAAATATATTTCAGGACAGTGTGGAAAACACTGACATAACAAGATGATTTTTGATATA\n+>chr10:5142480-5142680\n+ATCATGCAGGCAGGTGCAACTGTGCATCCCGTTCAGCCCGGGGGCCCACGCTGGCAGAGCAAAAGCACATGAGTCATGGGCTGCTTTTCTTAGCATTGTGTGGGCTCAGCCATCAAAACACCCAGAGTGCTAAATTTGACAGTCTGTTGTACTTATATAAAATCAAGAGATTAAGACTCTATTATGTGGATCATCCACAC\n+>chr10:5214690-5214890\n+TCATTTGGGGTGGTGGGTTTCTGTTTATATTCAGTATGTGATACATTGTAATGTGAGTCAGCTGGATGAATCCCAACCTTAGAGAGTATGTGTGAAGATGCTTACAGACTTGAGCTTCAAGCCAAAACCCTAATTAGGAGTATGAAATATGGAAATTACATATCCCTGGCTCTATCTATGCATTTGTTCTTGTGATAAGA\n+>chr10:5333335-5333535\n+TGAGATTTACAGTGTGATGTGGGGGGACATATATAGACAGTAAAGTGGTTACTATAGTGACAATAATTAGCATTGCTATCATCTCCCATAGTGACTCTTTTGTGTGAACCAGGGCAGCTGTTCAAAATTCCAAGTGCAACACAATTCTGTTAACTTCCCTCCTAATGCTGAACACTTACTTCTAAATGTGTCTGTCTATG\n+>chr10:5421860-5422060\n+TAGATCTGAGAATCAGACTTGGGGATCAAAAAAAGAAAAAATGTCCCTCTGAGAAACAAAGCAGATGGCAAAAGTGCCGTGCCACCAACCCCATCTGTATCCCAAGCAGAGCCTTGGTTTATGAAGAGATGGTTGTCATGACGATAAGCATTTGAGAAGAGTTCTGGAGAAGCTGATGAATTTATGAGTGTCTCCTAGCT\n+>chr10:6243415-6243615\n+ACAAGCATAGTATTGAAAGCCAGCCATCCGTTCTCGCTCTGTCATCCAGGAACTGAAATGCTCACAGCAGACATCAGCTGGCAGGAGGGAAGCCTGTCAACCTAGTTAACCA'..b'GTGTAATATGTAAGCTGGTTAGACAGATTGTTGGGCCCAACAAATAGTATGTGCTAACAAATGTGTATGAATGTATGTGTGTTTATGCATATTTGCTTGTGTAAA\n+>chr9:121073775-121073975\n+CTGTCACATCCTTTCTATCTGATGCTGACGGCTCAGATCCACGGCCAGTTCTCACAGCAGATGGATGGGAGAGCTGACATCACACAAAATGTGCTGAAGTATAGACAATCAATGGATGGCAGAGATGACAGGAGGCAGAAAAATATCCACAGAATACCATAAAGGTTATCACGGGCTGAGGGCCAGGGACGTCCATCAGT\n+>chr9:121258405-121258605\n+CATTGCTGATTACGTCATCCTTACATGTTAGCATCATTGCTTATGCGTTATCTTCATTGCTTCCGTGCCAGCATTGCTGCTTATGTGTTAGCACTGTTGTTTACGTGTCAGCATTGTTGTTACATGTTGGCATCATTGCTTACATCATTGCTTATGTGTCAGCATTGCTGCTTACATTGTTGCGACGTGTTAGCATCGCT\n+>chr9:121307560-121307760\n+TTTTGATCAAACCAGATTTCAACATTATCTTTTTCTGCTCATGAACAACACGGCCGGCAAGCTGTTCTGTACAGATTGTTTTCTTCTTGTGTTAATTCAGCAAGTTTGAGACAGAACTGGACCCAGCTCAGCCAGGAAACTGACATGACAGCTCAGCTCTCACTCTCGATCCCCACGGGCTGCCGCTGTTTGGCAGGGTT\n+>chr9:121308045-121308245\n+TCCCTCCCCTGGTGGCGGGGCTGCCTGTGCCTGGAGTGAGAAGGGACAGTTGGCATCAAGCTGTTCCTTGCAGACTGCCTGCTGTCATGAGCTCTCACTGGCTGGCATTGACAGTCTCAGCAGCGAAGATTGGAAAGAGAAAGCTGGAACGTCAGGCAGAGTGATGGCACGCCTGCTCTCTAGCCAGGTCCTCCCTCCCT\n+>chr9:121309930-121310130\n+GCTCCGTTAGTCTCTGCCCTGTGTGTGAGAGCTGGCGCATTCCCTGCCTGCCTTGCTTTGGATATTCACTGTGGCGTCATGGCTTTCGAACACACCACCTGTTCCTTTTAGGTATAATTAACTGTATTTTCGGTACAATGCTTAAATTGTCACAACTCAGTTCAGCTCTTCTAAGCTGGCTTTTTTGTCCATCTTTTCAT\n+>chr9:121414565-121414765\n+GGGTGAAGGAAGGAGTTTATTAGCTTCTGTGGGGATGGCTGGCAGCATCCTGCCTCCAGATGGGTGAGGAACTTCCGGCTGCCGTACCCAGGGCTGCCAAGGACTTGCAGAGGAGTGAGAGAATGATTTTTGTCTGCCAGATTGTGTTGCTTATGTAAGAGTTTGGAGCTTGGAGTCAAGAGGAAAAGGAGAAGCTCAGC\n+>chr9:121459475-121459675\n+GTGGCTTCCAAGTCCCCTGGAAAGCAGCCTTAGTCAGTAGCAAGAGCCTGTAATTACACCAAGCAGCCGTGGTGCTTCAAATGATGATTAAGGCGTGAATGAAAAAATAAAGCTCCTTTGTTTTGTGCTCATTATGTGCTCTTTCTGATGACTACATCAGAAAAAAAATCCTGCTTTTCAGACAGTTGGCTTAATTAGCA\n+>chr9:121509765-121509965\n+TCAGGAAGTGCAAATTCGTTGCAAGTTTGCCAAGTGTGGGACGCTTCTCTTCTTTCAGTCCTTCACATCCTTTCGTGACTGAGACCCGTGGTTTGCACAAACGGGTACATATTTAGCCTGATGGCTGAGGAGAAACACTTGCCAGAATTACACGCTATCTCCCTTGCTTCACAGGCTCCAGGCAGGATACCTCAGAGGTT\n+>chr9:121599470-121599670\n+CACTCTTCATCTGGGAGGAGCTCATGTGGTTAAAATGCCACTTTGTTTTCTGAATGACACTGTCTGCAGAAAGAGAGAGGCCAAGTTGAGCATCGTCAGAATGGTGCAGATGGGAAGCGAAGCGCCCGTTGCTAAGCAACCAGTAAGTCAGATTCCGAACAGAGTCAGTTATCGGAGGTCAGGTGACTGCTGCGTATTAA\n+>chr9:121916825-121917025\n+TCATTTGAGTACAAAACTCAGTGTTTTAAACCACAAGTTTTATGCGTTCACAGCATTGACAACACATTACAAATTATATGATTGAGAAATAAGTCAGACTAAGTGACTCATGCTTGTAATCTTAGTGCTTGAGAAGCTGAGGCAGGAGTATTGCCATGAGTGGGACTCCAGCCTGGACAGTATACTAATTTCCAGACCAG\n+>chr9:121949170-121949370\n+GTTCTCAACAGAAGCTTCAAAGGTCAGGAGATGGGGAAATGATGTATTTCCAACCCTGAAAGAAAATAGCTGCCAAACTCAATTACTATATATAGCGAAGTTGTTCTTCCAAGTTGAAGGAGAAATAAAGACCTTCCAAGATAAGCACGAAAGCCTGCAGAAGATGCTCAGAGGAATCTTACACACAGAAGTGAAGATAA\n+>chr9:122512290-122512490\n+GGCTTGCTTTTAAAATACTGTCTGGAAATCTTACTCATGCCTCCATGCTCCTCTTACTCATCAGCACACTGAGTCAAGCGTTTAAAAGCAAAGCTGCTCTCTTCCTCTTCCTGCTGCAGGATGTGTTGTCAGTGTGGTGCTCTGTGTTTCTTCCCACAGACTGTGGGCAAAGCCACCAACATCTTTTGAGCAAAACTGCA\n+>chr9:122544610-122544810\n+TGTTTTCAGTATAAAATTTGGGCCTTTATCCTCTTTGAAATAAAAACATAGCATGCATATTAGACAGTATTGGCAGTGAACTTCACACACTATTCTTTTATTCCTTTCCAACTATTAACTCCTATTAAACAAGACAGAGGGATAATCAGATCCCCTGTGCTCCCCCGGCTGCCTGGCGTGTGCGCTGTCTCATCCTCCCT\n+>chr9:123293790-123293990\n+AAAAAGGCTGTTCTGACTGTTGGCTGCTCCTTAAACTCAAAAGAAGCCTGATGACATACTTAAGCTTAGCTCCATATGGTGACTTTGTTGTGTTTTCATATGTTGTATGCACAGATGGTTGGAAAATAAATTCCCTGCCTCCCACCTCAGCTGGTGTCCCCAGCACCTTCTGGTCCAGGAGCAGGGTGATGAAGCATCAG\n+>chr9:123294500-123294700\n+TTATACTGGCTAAATGAAATTAAACACTGCATTTCTCATTTCTTGGAAATAGAGTGCTGAACTCACTCCAGAGAAGCCCAGACACTGAAAATAGCTTTTCTTTGTGAAGCAGTAACCATGGTGCCTGAATCTGAATCTGAATTGGTTTCAAATATAAAGCTTTGTTTTATGGGGGAGGGAGAGAAGCTGGCCATTGCTAC\n+>chr9:123376470-123376670\n+TCTTAATTGTCTGCTTTTTTATCAATGTCTTTAATTATCTTGTAAGTATCTAAAGGCTCTGCTGACACTGCTTAACAAACACCTCCTGAGTGACAGTCATGTACCTTCTAAATCTGTGAGAGCGGCTGACTGTGGTCTGAAATAGCTGTACAATCAGAGTCCTTTGCTGTCTGCCAAACTGGGCTACCAGCCCTTTGTTC\n+>chr9:123412475-123412675\n+AGTGCTTCCTCATTTACTTAGCAGCTGTAATTGCACCAGTGCTCACATGGTAACCCACTTAGTCACTGCACAGTCTGGCTTATGAGTTATTAAATGTCTTTATTCAGTTGCTTTTTTCTGAAAAAACAAAACAAAAAACCCAAATCACTGAGCTGTTTCCTTGGCATTTATTCCCAGAAACTGCGTTCAGGGAGTGGAAG\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/nullseq_indices.loc.sample --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/tool-data/nullseq_indices.loc.sample Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,12 @@ +#This is a sample file used by several tools in the KmerSVM suite. +#The nullseq_indices.loc file has this format (white space characters +#are TAB characters): +# +#<Build> <DBKey> <FullPathToFile> +#Your nullseq_indices.loc file should include an entry per line for each set of +#indces you have stored. For example: + +#mm9 Mouse(mm9) /home/example/galaxy-dist/tools/kmersvm/mm9 +#mm8 Mouse(mm8) /home/example/galaxy-dist/tools/kmersvm/nullseq_indices_mm8 +#hg18 Human(hg18) /home/example/galaxy-dist/tools/kmersvm/hg18 +#hg19 Human(hg19) /home/example/nullseq_indices_hg19 |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/sample_roc_chen.png |
b |
Binary file kmersvm/tool-data/sample_roc_chen.png has changed |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/test_negative.fa --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/tool-data/test_negative.fa Mon Aug 20 18:07:22 2012 -0400 |
b |
b'@@ -0,0 +1,1048 @@\n+>chr10:6238184-6239049\n+AGGTAAGGGACCGCGTTCTTCCCCATGCATGCTGGAGGTACCATTAATGACTTGCTGAATAAGAGAATGAATGAATGAATGAATGAATTCTATTGAAGAATGGAGAACAGTTCTGATTGTGGACGTCAGGGAGTGGGGATTGGAGGTTAGCCTTGTATCTCAGTATCTCCGATGCCTGATAGCAATTGAATGTGGAATGACTAAGCATAACATTGCTTCATCCAGCTAGAATTTTCCCACCTGCAGGGGTCAGTTCCCTGAGTGTCATCCCGCCCCCCACCCCCCATTCCTTCTCCTTAGTCATCCCACCAGGAAGTTACCACAATGTCTCTCAGAGGAGTGATGAATGCTTGATGCATTCCAGCAAGTCTGTGGGAGGGCCCAGAGTGCGGACAAAGAAAGTAAATTCCACCCTTATTAGGAGATCCTGGCCTCCTGGGGGAATTTGCCCAGCTTGCATATGGCTTCCTGTACTGTCACAGAGTGCCTCCAAGTGACAGTGTCCTGCCTTCCTGGACCCCCTTTACTTAGGCCTCACTTCCTTCCTGATTCCTCAGGCCTGATTTCGGGCAGGCAGGGACACTCCCTCTGTCCCTGAGAGCAGCCACCAGGGGAACTTCTGCGTTGAGATAACGTCACTGATGGCTCACTCCAGGAAACCTGTGAGTAGCTGCTGCTGCTCCAGGACTGTTCCATGCATTCATTTATTACAGGTCATAAGTCAGAGAGCCAGGCCCAGGAACAGGCTGGGGGAAAGAGTCCACGCTGATTTTACAGGACAGCTTTGGTCTTAGCCTGGCCATGGAGGTGGGGGTGGGGTGGGGTGGGAGATTCACCACAGGACCATCTTCCTGGCTTCATCTATG\n+>chr10:6345620-6346433\n+CGTACATGGCTTGGTACATGGACCACACAGTCTTCGTGCACACACGAGGCACATGTACATGGCCCGGTACATGGACCACACAGTCTTCGTGCACACACGAGGCACATGTACATGGCCTGGTACATGGACCACAGTCTTCGTGCACACACGAGGCACATGTACATGGCCGGGGCTGTGGGGAATGAGCCTCTACATTTACACTTTCTAGTTCCCAGTCTCCCACTGCACATCTGCAGATCACATTTCTGCATCTGCTCTCTTGGAGGAGCTCTCTGGGCCTTAGCTGCTTCCCCTGAGAGGTGAGGAATCCAGACTCAGGGCAGGTGGAGTCACATCACCTGGACTTGGGCCTAGACCCTGCTTTACCAGCAATGGTCAGGCTCCAGGCTCTGGTTTCTAGTGACTTACATGCATACAGAAAAGACTAGGACATGGACTCTCCATACAAACAGACAAGAGTGTGCCATTAGATTTCTTATTTTGTCCTCAGAAACCACTCGCTTTGTGGATGAGGGAAATACAACCCATGTGCCAGTAGACTGAAGGTGACACACCCAGGCTGTTTTCTGCAGATCACAGGCTCAGAGCCCTGCAGGACTGATGCAGTGGGGCTCTGCAGTGCTGTGTATGCCCCACTTCCACAGTGCACTCTGATGACACGGGGTGTGACACAGGTGGCTGCAAAGTGACACGGGCTGTCTGAAAGCTGAACCAAAACCACGGCCTGGAGAAACATCTCGGTCCTGATCCAGGTTTCCCACGGAGAACGAGACAACCGCTAGCAACACAATTCCACTCCTGATTTGTTAAACTG\n+>chr10:8805039-8805846\n+ATGGTGTCAGCGTTTGGATGCTGATTATGGGGTGGATCCCTGGATATGGCAGTCTCTACATGGTCCAGGAACGCCAGGGCCAAAAAGGGAGAGTGGGTGGGTAGGGGAGTGGGGGTGGGTGGGTATGGGGGACTTTTGGTATAGCATTGGAAATGTAAATGAGCTAAATACCTAATTAAAAAAAAAAAGAACTACAAGACATTTTGTTTCTCCTTCAACCGCCCTTGTCAGTCTCAGAAGTGCGTGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTTGGCCTAGGGGCAGCAAACTTCTCCAAATGCTACAGCCTATTCTTTCAAAGCTCACCTCCTGGGGTAATGGGTCATTAAAAGACAAAAGAGCGGATCAAAAGAGTAGAGTAGATAAGTGTGAGCCAAAGACAGAGGGAAGACAGCCTAGACTGAAAAGGTGGATTCAGTCATGAGGTGATGGCAGCTTAGGCGCCTTGTCTTTCCTTCACGTCTATTGTACTTACAAGGAGGAACGTGTCTGTGAGATCTCTTGTTCTTGTAATTCCCACTCTAAGAACTTATACATTATTTATTTTTATTTGGGTCCCGTCGTCTCCTGGAATGTTATCTTGACACGCTTGCCAGTGATCTGTTAATTTTTATTTTTAAAGATGTGTGGAGTCTCATTTACTCCAGAGACATGCGCCGACTTCTAAATTCCCAAGCTGGTTTTAAAATCTCAAGTCTTATTCTTCCTCCTAGCCTGGCTCTGCTTTTTTCCCACCTCAGCATTATGTCTGTGCGA\n+>chr10:12696947-12697763\n+GTGGGTGGGAATGATACAACCATACACTTGCAAATTTGAGCATGCTGCCAACCCCGGATTGTAACTGTCATATCAAAGTGAAAACTCAGTTTTTGCAGATCCCTTAGACCCCTGCAAGCATCAATGCTACTTAGAAACGTTGGCCTGCACTACATCAAGTGGAAGAGAGATTGGTCCTGCCTCTAAGTTTCTAGCCTGCCATCAAAGGCCCCCCAGGGGCTGTCTGACCCTGATGCTTCTTTTGTGAAGAAAGGACCTGCCAGTGGTCAGAGGCTGCTGGCGTTGGGACTTGCATGGCCATTTTCTTCTTGACCTCTTAGGAGACCTGCACACTATTGAGCCCATCAACAATCCATCATGGAGGCTAGAGGGACTGTCAAGGCTGAACCCAACTGATGATTTAGAAGGAGTCAGTGGTTACTGGTGGAGGGAGAGACATCTCCCTTAATGATGCAGCCACTTCTAAATCATCTTCATTGATGGGAATAGCCCTGGTGAACTTTCCTAGGCTGAGACAAAGGGACTGGGGTAGGAGGAGGGACTGGTTGGGAAGATAAAGGGGATTAGCAGGAGGGAGAGGGGAGAAAGAGAGGGTGATGGGGTGAATGTAATCAAAATACATTATATTCATGTTCCACAGTGTCATAACGAAACCCGTCATTTTGTATAATTAACATACACTTGCCCACGTACACACCAAAGTGTAAGCTCTGAAACCAGACTGCTTGAGTTGAACTGAATCATCCATTCAACTTTTTGTTTTAAACAGGCACTGGATTATTGAGTTTGTGCTTCCTCCCGTAGGAGAGAGCTTGAT\n+>chr10:20821578-20822392\n+CACTCCACATTCAAACAGCAAGATGCGCATCCAGAGAAAAGCTTTTAACGGGTGAGGGGCTACAGAGCATCAAGGGGCAGCTTGTGCTAGGGAGTCATCTACTTTGCTCTCGGGTTATTTGCGGTGTACATCTCAAGTGCACATGTGTAATGTGAAGTAGACAGTGTTCATGCTGCACACACCACCACAGCAAGATGAATAGCTTATAACCTGCTGGCATGCTTAGAAGTACAGCGAGTAAATGCTCTGCTCACAATGGAGCGCTACACCTCGAGGTAACATTTGTTCCATCCCTAACTCTATTCTATTTGACATGCTAATTGTCTTCCTTTTTCCAGAGTCCAAGCATTTGGTTTGGAAGAATAAGGCCAAAGACTCACATTGTTCCAACCGAAGGGCCAGCTGCTGGCTTTTGCACAATGAGCTTCATTGTTGAGTAAATGGAGTCTCTCAGGTGAATGTAATTATGCAAATCCTTCCCTAACTACTAGCCTTTCCTGGCCCTTCCCCAGGCTGCCCTGCCCTGGGCACACTGTCACTTC'..b'CACACTGGGGACCACACAGAGTGCCTCTTGCGCCAAAAGCTGGAAGGTTCGTCGGTTTGGCTTTTCTTTTGACCCGGGCGTTGCATACGAAATGTAGATATTTTCCTAAACGCCGGCTTTGACATTTAATGCGTTCCTAAGTAGTCTGAAATGTGGACTGGATTAGGAGGAGGCTAGGGGCCCACCGGGACACTCAGGCAAATCCCCTTAAATCCTCCAAGACAATTGTGTGTCACCACACGGATTCGCTTTGCCCCAAATGAGGAGGCTAATGAACTTCACTGCCTCCTTCCGTGACAGCCCACTCGCACCTGGCCGAGCCTCTGCGCCCGCCTCCCAGCTTTGTGTCAGGGCGTGTTTACTTTGTGTTTACTTTAGGCCAGGCCTGTTGGCTTCTAGGATCCAGTCTGACCATTAACTAAGGCACCTATCCTGGGCCATTGCTCAA\n+>chr9:119032069-119032895\n+AAATCAATAAGTTCGTGGGGACTGAGGGGTATGGTCAGAGGCAGATGCAGGGGCCACCAGTTCTTGGAGAATCCTTAAAATCCGTAGACAGCAAAGCCGACATTTGGAGACGCCTCCAGTGTGAGCGCTGAATTTTTGTAATTTCTTCCACCTATTGTGTAACCCTCTGCCTTTTTCTTTTTCTTTTTTTTAAGGCTGTTGCAACTACTGGCTTTGTAGAGCAGCCTCCTTTTGGAGTAATGCCTTCCGTGTTTGAGCTGGCCCCAGGGTACGCCATACTGTTGGAGGTAGGTAATGTGACCTGTCCCTTCTGCTCACCCTTGGTGGCTTCCTCAGAAATCCCATTCTCACCTCCCCCTCTGTGTTCTCGTGACTCACGCCCACACTGTTCCGACCCAGGTCACAGCATCAGTGGGTTCTGGATGGGCCATGTGCTTGTGTTCTGCAGGGTGATGCATATGGCTGTCTTATGACTCCCACAGGACATCCTCTGGCCTCCAAATACACCCCGTGGCACACCATCCCACATACGTACTAACTAGCTCCATTAATTTAAAAAGTCAGTGCAGGGCATTGCGGTTTGAATTTGAAATGTCTCCCATAAGCTCATCTTTCAGTTTCCTCAGCTGCTGGCATTATTTGCACAGTCTATAGAACCTTTCACAGGTGCGGCCTCACTGGTGGAAGAAGGCCAGCAGGGAGTTACAGGGAGACCGCGGATGATTACAGCCAGACCTCCGAGGGTTACAGGCAGACCACTGAGGGTTACAGGCAGACCACTGAAGGTTACAGGCAGACCTCTGAGGGTTACAGGCAGACCACTGAGG\n+>chrX:55266955-55267798\n+AGTTTTTGTCCTAAAGACCCTCAGGCCATCTTCTACCAGCATACATGATGTCTTGTTAAGATGACAGCAATGAGTCACTGGATTGTTCCTAGACACCTTGCCCTGAAAGATCAATTGATGGATAACGGGTTCCACAAATGCCTTCCAAAAAGGAAAACAAAACTATACCAGGCCCTCTATTCCTCCGAGTGGAGCTAGGAGCTGCGAAAGGCAGGCAAGTGCAGACTTTCCCTGAAAAGACGGGGCCCCAATGGTGCACTAAGTACACACCTCATTTTTCCAACAAAATTCCAGTGGAGTGACTTTCTGACTACCTACAGAGAACCTGTAGCCGGTCTACTCCGTTAGTACTGGACTGTGCCTTTTGTTTTGAGTCCTCTGTTCTAGAGCAAGGGATGTCGCGGAAGGTGACCTTGAACACCTGATCTTCCTATCTCTATTTCCTAAGTGTTGGAATTAAAGGCGTGAGCCACCACGCTGGGATGGACTCCAGCTTTGTAAAGATGTTAAGTGTGATCCCAGCAACCGCCAAATGCTACCCAGTGTCACCCTGGTGCTCAGGAATGCAAGACTCTTTAAGGAACTGCCAGCCAAGGACCCAGAGGTGATTTGAATTACAAGCTGCCTTGAGCCAGAGTTAACTCTCTCACGTACTGCCTGGCCTCTGAGGCCTCCGGGAGCGCCACAGGCGTGGTGTTTCAAAGGCGGCTCTGGGGAAGTCTGAGAATTTCCCAGAGAAGTATTCCGCCACAAACTACCGGGCTCCAACACCTTAGTCCAAGAAGAATCCCTCCAGTTCTTCCTTAGAGAGTCAGAACCGCAAAGTTTAACAGTGCGCTGTCCT\n+>chrX:96751955-96752813\n+CCCTGAACCTCAGCAAATTACCCCGGCTTTGGGAGGAGATTGGAAAAGCTTTTTTTTTTTTTGAAAGGGCATATTGTGTAACTCAGAAACCCTATTTTTAGGTATTTATCCTAAAGAAATCATTGTGGATGTGAGTGAAAACTCACGCCCTCCGTGTCCTGGGTGTTACTTATAAAAGCAAATAACTGTGTTGAATGAATGGGGAGAGGTTTCAAATTGTGTAATGGAGATTTCTGAGACTGGGCATTAACATTTTTGAATCCTCAAAGGCTGGAGGCAAGAGACCAAAAGGGTAGTTAAAAAAAAAAACTGCTGTGGTACCCAGGTCAGGTCTCTAAGGGGAGCCTCTAAAATTTGTCCTGTGACCAGACGGTGCCTTTGTTTGGCTTTGAACATTCAAGCTGGCCTCTTGGTCATGCAGATGTACCTCAAGTACCTATTCTTTGTCTTGGGGAGGCCCTGACTCCATATGCAGATGCTATAAGCGGCGATGTTAAGCATTCTGTTTAATAATGTAAAAGGCTGTCTTCGCGGTGGGCTGGGGGGCATCCTCACAGACTCTCTTGTGATGTTAAATGATTTTCAGCTTGCTCCGCTTAGGCTTTGCAATAGCAAGGGGGCAGGTGGGGCACGGGGAAGGGGGGGTCACTGACAACCCTGGCATGAGAATAGAAAGGACAGAAGAAAAGAGCTGTGGGTTTGAGGCGCCTAGGTTCCCAGGCCTGAGGCAGAAGTCTGTTTTAAAGACTGTGACCGCCAGCAGTGAGCCCTGGGAAAGGGCAGAGATTCACAGAGGAAAGCAGGTCGTTGTGGCGATTGCAATTTCTTGGATCAACATTGCTCTAAAGGAAGGTCTGTG\n+>chrX:154039112-154040019\n+AAATCTTTTAATCCCTCTAACAAAAAAAATGACAGGGATTTGGCCAAAATGAAAAGTTAAAATTATACTTACTGCTCAAAATTTCAAAAATTAAACACGGTTGAAATTTTATCTCCCTACTATTTTCCAAATGTTCTTTTGTATAAATGTGTTAGAATTTGTACTTCTGAATTTCTAAGTTTATTGGTACACTTTATTTTTTGATTTGACCTTTGTTTGCTTTCTGTAATGGCTCTGTAGTTTTAACCTTGTTCCTTCCTGGGAGGAGGCCATTAACTTAACAATGGAAATAGTTTCTTTTTGTTAGCCCAGAGCTTTGTCTGGACCTTGTACTAGGACAAGAACAAGCTACTGCGTGGCTATCTGATAACCCAAAATGTGAAGGAAATGAGTTGATTGTGATTTTCTGTTCCCAGCTTAAAGATTGAGTTCTACATCTGAGTTGATTTTCCCACCAGGAGCACAGAAGTCATAACAAAGCACTGAATTGCCTCCTGCTGGCTGCTGGAATTTAGCTGTAGTAGCAGCCAGCTGGTGGCAGCAGCAGAAGCACGATGGGAGACAGAATGGCAATGAAGGAAGCCCCGATGAAAAAAGAAAAAAGATGAAAAAGGTATTTTTCAGTAAAATAAACCTGTGGCTTAATTGGCAAAACATTTTGGATTTCTGTCAAGGAACTGTATTAAATAAAGGACACATAACTGCGCTTTTATTCATAGAATTTTCATTTCTTACAATATTTTTAGAGGGACATCAAATATGTATTTTGGCTGAGTGCAACCAGCTGCATATTAAAAATTGATACTTGGCTGAAACAGAGTATCAGTATATTTTCTAACTTCAGAGATAAATATATATGAAATATAAATTTTTAAAAGTTGTATAATTTTAAATTAAAGCATTGATAT\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/test_positive.fa --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/tool-data/test_positive.fa Mon Aug 20 18:07:22 2012 -0400 |
b |
b'@@ -0,0 +1,1048 @@\n+>chr10:6238484-6238749\n+TCATCCCACCAGGAAGTTACCACAATGTCTCTCAGAGGAGTGATGAATGCTTGATGCATTCCAGCAAGTCTGTGGGAGGGCCCAGAGTGCGGACAAAGAAAGTAAATTCCACCCTTATTAGGAGATCCTGGCCTCCTGGGGGAATTTGCCCAGCTTGCATATGGCTTCCTGTACTGTCACAGAGTGCCTCCAAGTGACAGTGTCCTGCCTTCCTGGACCCCCTTTACTTAGGCCTCACTTCCTTCCTGATTCCTCAGGCCTGATTT\n+>chr10:6345920-6346133\n+TGAGGAATCCAGACTCAGGGCAGGTGGAGTCACATCACCTGGACTTGGGCCTAGACCCTGCTTTACCAGCAATGGTCAGGCTCCAGGCTCTGGTTTCTAGTGACTTACATGCATACAGAAAAGACTAGGACATGGACTCTCCATACAAACAGACAAGAGTGTGCCATTAGATTTCTTATTTTGTCCTCAGAAACCACTCGCTTTGTGGATGAGG\n+>chr10:8805339-8805546\n+GTTGGCCTAGGGGCAGCAAACTTCTCCAAATGCTACAGCCTATTCTTTCAAAGCTCACCTCCTGGGGTAATGGGTCATTAAAAGACAAAAGAGCGGATCAAAAGAGTAGAGTAGATAAGTGTGAGCCAAAGACAGAGGGAAGACAGCCTAGACTGAAAAGGTGGATTCAGTCATGAGGTGATGGCAGCTTAGGCGCCTTGTCTTTCCT\n+>chr10:12697247-12697463\n+TTTTCTTCTTGACCTCTTAGGAGACCTGCACACTATTGAGCCCATCAACAATCCATCATGGAGGCTAGAGGGACTGTCAAGGCTGAACCCAACTGATGATTTAGAAGGAGTCAGTGGTTACTGGTGGAGGGAGAGACATCTCCCTTAATGATGCAGCCACTTCTAAATCATCTTCATTGATGGGAATAGCCCTGGTGAACTTTCCTAGGCTGAGACA\n+>chr10:20821878-20822092\n+TATTCTATTTGACATGCTAATTGTCTTCCTTTTTCCAGAGTCCAAGCATTTGGTTTGGAAGAATAAGGCCAAAGACTCACATTGTTCCAACCGAAGGGCCAGCTGCTGGCTTTTGCACAATGAGCTTCATTGTTGAGTAAATGGAGTCTCTCAGGTGAATGTAATTATGCAAATCCTTCCCTAACTACTAGCCTTTCCTGGCCCTTCCCCAGGCT\n+>chr10:21465959-21466182\n+GGGTGTGTGGATTAGTCTATGCATTTCGGTAAAGGGAGGTCTGCCATTGTCTGTGCTTATCCAAGGTGAGATAGACGAACAATGGCCCAGCAGGCAACCTTCCTGGAGTCTGATCTCCGTCTACAGACAGCCTCCTGTTTCTCATTTAACTTTTATCACCTCTAAGGATGTAAAATCTCAGCTTACCAATCACTAGCACAAAGAAACTTCAGGAGAAAGGAGGA\n+>chr10:21548928-21549240\n+GTGCAGAGCACCAGATCACAGAGGACAGCAAACAATAGCTGCTGCCCTTTTGCTTCTTTCCTTTGGGCTGGGCTTGCAGCTTCCAACCACTGGTGACTCAGTGAGTTCTGAGTGATTGCAGCTGTACATAAAAGGAAAGCGCGGAGCGGGAGGAGGGGTTGGGGGGAGGATGAAAGGAGGCCTTCCTCCGGGAAGTGGCATCTTTGCAGACCACCCATTCATTAGTAGGAATTAATTAGTCTAGGAATTGTAATCCAAATTCCTTCAAGGGATCAGGTGGGCGACACAAGCCTGTTCATCTAACCTGCAAGAC\n+>chr10:21708326-21708533\n+TTAAAACCTCCCACTTCACTTTCCACTGATTTTGACTCTCAGCCATTAGCCTTGACTTCAAACACTTGGTGAATTATCCTGGCATTCAGTTCCTGGAGGTGACATCTTTGGGAACCTTGTAGGTGGGACCACGGCGCGAAAGCAGGAAGCTGAAGAACAAAGGTAGCAATCTTTCAGCCCTTTATCATGGAATTCCTACCACCCAGGC\n+>chr10:28289736-28289985\n+AAGATGACCCTTCTTCAGATTACATTAATGCCAACTACATCGACGTAAGTGTCCTTACGGTGTCTTTCCATAGTGTGCCCACTTTATCAGCTCACGTCTGTGAACTCAATGTTTGTTGTTACCTATGACTTAATTGCTACCTTTAATGTAAAGAACGAACTGTACACTTTTCCTTGTTCTACATCCTAGTACTTACTGACACTGTTGTGCTTTCTTTCTCACCTGACTTTGGTTTGGGCTGTAGATTTGG\n+>chr10:30457257-30457471\n+ATTGTGTATCTGACCATGGGTTTGGAACTATTCCTTGGAGCCTGATGAGTCCATCAGTGGGGAGCAACCGAAGACAATTACTACCCTTCCCCCTGACCCTATCAGAAGACAATAGCTCAGTAGGTTCTAGAAGTAGCTCCCAACCCCCACCCCATGGGCCCTTCCCAGATCCATGGTTGGCTATTGACAGGGACAGTCTTGTACAGTCTAGTCCC\n+>chr10:33856167-33856381\n+TCTGGCACGAAGTATCTACGCAGTCCAATTATGTCTACGCTCTCCCTCACCCCAGGTACCTGGTTTCCAACACCACTCATCCCTTAGACTTCAACAACACACCTTCCTGTTTGCTTTCTTGCTGCCGTGGATGGTGTGGATGGTGCAGTGGATGGTGTGGATGGTGTTGTGATGGCTGTTCTTCACTTGGCAAGCTACCCTCCCTTCCCTTTAGG\n+>chr10:51392420-51392658\n+GCCCCTCCCTGGATATCTATGTGGTCCCTAGCACTGCCTGCATCAGGGACATCCCCATGTTCTCTAGTGGTAACATAAGCTATGGACATCAACACTGACCCCTGAAGCTGTATAGCTACAGACCCAGACATGGTTCTTAGCAGTAGCTCTGGCTGGGACCTCACCATAGTCCTAGGTGACAGGGCTGGCCACTCACAACAGGCCTCCACCTCCACCATGGAAGTCCAGTTACATCTCTC\n+>chr10:56094079-56094279\n+TGTTTTCAATGAATAATTGCCCTTTCTGAGTTAAAGACCTAGAAAGCCTCAGGCTAGTGTTATTTTCCTTACAACTAAACCGCCTCCCAACCATCTTGAATGCCTTTCTCCTCTTGACCTTACTGGCTTTTGTCATTCTTTTCTTTGTGGGCTTTAGAGGAAATCCATTCAAGCTGTGCGCTTTGTCTTGGAGATTCACGG\n+>chr10:58131585-58131813\n+CTGCTTATTCTGTAAACACAAGGAGGGGCCCAACATTCTACTGTTCCCAGTTTTGCAATCTTCACAAATTAAATATTTGCCTTCAAAAGCAACAACTGCGATAGCATCAGGGTAATATTCTAAAGCACCGTTTCCTCGCCCGCCAAGTGCAGAAGGTAAGGCAGCGAGCTGTGGGTGGTCTGTTTGGATAATGGTTTTCATGTGGAAACCTCAAATTGGAAACTAACTG\n+>chr10:60878634-60878850\n+CCTTGTTCCCTACACTATAAGCATCCAGATGTTTGTTTAATTCTTTCATCAGGGTTCAGAGCCTCCCATGCACTTCAGCTTTAAGTGGGTAGGAGGAGATCTCGCCAGCTGTTTTACTGGGGACACATCCTACTAGGTAACTTGGCCCACCCAGGTTTCCAAAGCCTTCAGATAAGAACCCCTGGGGTGCTCGATGAATATGTAGATAGCCACGCCC\n+>chr10:62360770-62360986\n+CCAGTACAGAAGGATGTCTATTACCATCCCAAGGTCCAGGCATAAAGGGAGTCATGAGTAAAGACAAAAGGTGCTGCCCAGCCAGGCTCAGACACAGGGCAGAAGCAC'..b'ATGTTCTACAGAACATATCCTTGGCCTTATAAAACACATCTCCATCAATCACTAATTAACAAAATGCT\n+>chr9:72617488-72617762\n+AGTCCCAAGGGTGGGGGACTGAGACATTATTCTATTGAAATGTATCTGAGGGGTCTGGAGAGGGCGCAGAGCTCAATTCTCAGCACCCATGTTGAGTGGCTCATAAGCACCTGCCATTCCAGCTCTGGAGGACCTGGTGTCTTTCTCTAGCCTCTGTGGGCACACATGTGGCATACTTACAAGTAAAAATAAACAGACATCTTTTAGAAGATGCATCTGAGGGGAGAAGCCAGCACTTTCTGGGAAGGCACAGGGAAGGAGATACATAGGGATCA\n+>chr9:72756364-72756592\n+TACAGGGATCTGACTCCCTCTTCTGACCTCCTCAGGCACGTGGCACACGTGCTTACAGACAAAACCCTTGGACACATAAACGGAAAATAAGTAAAGTTCTGCACAGGCACCTAAAGCGCCTTCTTCCCTTGATGGCCAAGGAAGTAGAATGTTCCGAGGAATGTCACCAGAGGATGCTGCCTCTTGTCTTTGCCAACAAGTGCTATGGAAGCCAGCTAATGCTGGTTAG\n+>chr9:75187616-75187817\n+CACGTCCTGACAGTGTCAATAGCTGATGTGGCTCTGTCCTTCTGACTCCGTTGTCCCTGTGCTCTGAGCCACACTTTGGGAATGAGGTGCATGACAGGGCTCCTGCTCTGCTCAGTACCCCGGGTGAATGTGTTGACTTCCTCTGGCCCTCAGTGTTCAGCTTTGCAAGGCAGTGCCCCTGTCCTGGGTTTACCCGAGGAAC\n+>chr9:75224325-75224531\n+GAGTCCACCTAGGGTGGAGTCTTCCAACCTAGGTAAAGTCTATTGTCCTTTCCTCCAAGACACCACTACCTGCTGAGAGGCTGAACCTGTTCTGACTCTGTCCTGGACAAAGCAGTGATCCTGACAATGAAGCAATGAGATCCTGGCGTGGTGGGGGATGGGCCCCCACCTTACTCCCCACCCTGCCTTTATAAGCTCAGACTCTCA\n+>chr9:77464725-77464975\n+TCAGGTTATATTAAAATACCTTAGACTCACCCCTCAGTGATGTCAGGGGCTGGCTTTTCAAATGCTGAACTCATCCATGTGTCTCCTCGCTGTTGGTGATCTCACACTGATTGTGATGTTGATTTGGAGCAGTGAGCTAAAGGCCCTGACTTAAGGTTGTAGTCTTCCCTGAAGAACACGGAGGTACCAGATGCACAGTTCAAGTGTAGTGCGTCTTTCCTGAGTTTATTGCCAGTTCCTTACTTCACACC\n+>chr9:103484970-103485173\n+GTCCTTCTAGACAGAACGACTAGAAAGCTTAGGGTTCCATTTCTCAGGTGGTCATGTGCATTTAGGCTGTGGGAGATCTGGAAGGGGAAGTGAGAGGACACTGCCCTGTTTCCATTGTGGCAGAGCAGGGCATGAAGATAGGGATGGAGAACAAGCAGCCATGAAGCAACTGTAGCTCCCATTGCAGCATCTTGGTTGCTGCTG\n+>chr9:106154273-106154493\n+AGTCTGTAAAGCTAAGCTCACAGACAAAGGCCTAATCCATCAAAGTCCACAGTTATGGAAAAGTCTCAAAATGTACTAATCTTTTTTTTTCTTCTACAGTTCTGCTTACTGTTCTTGTTAACTGGAGCATCCCATTCAAAGCTCACACCAAGAAAGACTCAGAGCTCAGAGCTACACACTGGGATACTAAACGCCTAGTGTAGTCCCCAACCAGCTGGCTA\n+>chr9:107532999-107533221\n+TCCCTAGCGTGGGAGCTCATCCAGGCCCGCCTGGGCAGACAAAGCCTTGCCCAGCCTGGGGGAGGGGTGGGAAGAGACTTTGAGGCCCAAGCCAACCAGCCCCCCATTCAAGTGCTAATGAGCTCCAGCAGGGGAGAAGTGGGTTAGGGAAGGGAGGCTGCCTCTGGACCAGCCCACCTGGCTCGCAGAGCTGGGCCACACCCCCAAGGCTCGCTCCCCAGAG\n+>chr9:107612795-107613001\n+ATGGCTACCTCCAGAGTAGGTCAAGTGGGTGCAGCCCAGCCACTAGCTCTGTCCAGCTAGCCAAATGCTCCAAGTGATCACCTCAAACACCAGTAAGTCTCATTTCAAGCCAGCGGTCCTCACAGACTTTCCCTTTACAGTGCCATATTCCAGAGAACCCAGATGACAGGAGGACGACTCAAGGCATACCCCCCACCCCTCTGCTAT\n+>chr9:108756034-108756281\n+CACAGTTGAGGCTGGGGCCCCTGGTTGCACACCTAGGCATCCAGATCGGTCAAGAATGTCTCCTCTGCATAAAGTGGGGACCTTCTACAAAGCATACAATTAGGTCCAGCCACTTAAGGGCACAGAGGCATAACATTGGCCAAGCGTAGCGCTGGGCCTTCGTTCCAGTCCTCCAGTCTCAGCAGAATGCTCCAGCTCCCAGCAGGTTAACAAGAATCGCAGGCAGGGACAGAAGCTAGGCCGACGGG\n+>chr9:114493580-114493783\n+ATTTTCCCCTAGAGAAAGGACTGCATTCCTGGGAGGGGAGCGCGGCCCTCCCCATTCACACTGGGGACCACACAGAGTGCCTCTTGCGCCAAAAGCTGGAAGGTTCGTCGGTTTGGCTTTTCTTTTGACCCGGGCGTTGCATACGAAATGTAGATATTTTCCTAAACGCCGGCTTTGACATTTAATGCGTTCCTAAGTAGTCTG\n+>chr9:119032369-119032595\n+CCTGTCCCTTCTGCTCACCCTTGGTGGCTTCCTCAGAAATCCCATTCTCACCTCCCCCTCTGTGTTCTCGTGACTCACGCCCACACTGTTCCGACCCAGGTCACAGCATCAGTGGGTTCTGGATGGGCCATGTGCTTGTGTTCTGCAGGGTGATGCATATGGCTGTCTTATGACTCCCACAGGACATCCTCTGGCCTCCAAATACACCCCGTGGCACACCATCCCAC\n+>chrX:55267255-55267498\n+GACTTTCTGACTACCTACAGAGAACCTGTAGCCGGTCTACTCCGTTAGTACTGGACTGTGCCTTTTGTTTTGAGTCCTCTGTTCTAGAGCAAGGGATGTCGCGGAAGGTGACCTTGAACACCTGATCTTCCTATCTCTATTTCCTAAGTGTTGGAATTAAAGGCGTGAGCCACCACGCTGGGATGGACTCCAGCTTTGTAAAGATGTTAAGTGTGATCCCAGCAACCGCCAAATGCTACCCAGT\n+>chrX:96752255-96752513\n+AAAAAAAAACTGCTGTGGTACCCAGGTCAGGTCTCTAAGGGGAGCCTCTAAAATTTGTCCTGTGACCAGACGGTGCCTTTGTTTGGCTTTGAACATTCAAGCTGGCCTCTTGGTCATGCAGATGTACCTCAAGTACCTATTCTTTGTCTTGGGGAGGCCCTGACTCCATATGCAGATGCTATAAGCGGCGATGTTAAGCATTCTGTTTAATAATGTAAAAGGCTGTCTTCGCGGTGGGCTGGGGGGCATCCTCACAGAC\n+>chrX:154039412-154039719\n+TTGTTAGCCCAGAGCTTTGTCTGGACCTTGTACTAGGACAAGAACAAGCTACTGCGTGGCTATCTGATAACCCAAAATGTGAAGGAAATGAGTTGATTGTGATTTTCTGTTCCCAGCTTAAAGATTGAGTTCTACATCTGAGTTGATTTTCCCACCAGGAGCACAGAAGTCATAACAAAGCACTGAATTGCCTCCTGCTGGCTGCTGGAATTTAGCTGTAGTAGCAGCCAGCTGGTGGCAGCAGCAGAAGCACGATGGGAGACAGAATGGCAATGAAGGAAGCCCCGATGAAAAAAGAAAAAAGATGA\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/test_weights.out --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/tool-data/test_weights.out Mon Aug 20 18:07:22 2012 -0400 |
b |
b'@@ -0,0 +1,2088 @@\n+#parameters:\n+#kernel=1\n+#kmerlen=6\n+#bias=4.551\\d*\n+#A=-5.16\\d*\n+#B=0.626\\d*\n+#NOTE: k-mers with large negative weights are also important. They can be found at the bottom of the list.\n+#k-mer\trevcomp\tSVM-weight\n+AGCCCA\tTGGGCT\t0.204\\d*\n+GGGTCA\tTGACCC\t0.197\\d*\n+CTTTCA\tTGAAAG\t0.175\\d*\n+ATTGGG\tCCCAAT\t0.151\\d*\n+GGGGGA\tTCCCCC\t0.151\\d*\n+CTAGTC\tGACTAG\t0.141\\d*\n+AATGGC\tGCCATT\t0.134\\d*\n+ATATGG\tCCATAT\t0.130\\d*\n+GAATGC\tGCATTC\t0.116\\d*\n+GCAGGA\tTCCTGC\t0.113\\d*\n+ATGAAA\tTTTCAT\t0.107\\d*\n+ATCACA\tTGTGAT\t0.106\\d*\n+AATGCT\tAGCATT\t0.106\\d*\n+AGTAGC\tGCTACT\t0.104\\d*\n+ACTCAG\tCTGAGT\t0.101\\d*\n+AATCAC\tGTGATT\t0.100\\d*\n+AGTGGC\tGCCACT\t0.098\\d*\n+CCGTTC\tGAACGG\t0.097\\d*\n+CAATAC\tGTATTG\t0.096\\d*\n+GTGTTA\tTAACAC\t0.094\\d*\n+ATGCTA\tTAGCAT\t0.089\\d*\n+GCACCA\tTGGTGC\t0.083\\d*\n+CTCTTC\tGAAGAG\t0.083\\d*\n+ATTCAG\tCTGAAT\t0.082\\d*\n+ACTGGA\tTCCAGT\t0.082\\d*\n+AAACTG\tCAGTTT\t0.082\\d*\n+CTAACA\tTGTTAG\t0.080\\d*\n+TACGTA\tTACGTA\t0.078\\d*\n+GTTAAC\tGTTAAC\t0.076\\d*\n+ATCTGC\tGCAGAT\t0.074\\d*\n+CGTGTA\tTACACG\t0.073\\d*\n+ATAATC\tGATTAT\t0.071\\d*\n+GCAAAC\tGTTTGC\t0.069\\d*\n+AGTTGA\tTCAACT\t0.069\\d*\n+GTGGCA\tTGCCAC\t0.067\\d*\n+AACGGG\tCCCGTT\t0.066\\d*\n+CCGATA\tTATCGG\t0.065\\d*\n+ACGAAC\tGTTCGT\t0.063\\d*\n+ACCGCT\tAGCGGT\t0.063\\d*\n+AACTGA\tTCAGTT\t0.062\\d*\n+CTCCCC\tGGGGAG\t0.061\\d*\n+AACGGA\tTCCGTT\t0.060\\d*\n+GTGTAA\tTTACAC\t0.058\\d*\n+AAAGAC\tGTCTTT\t0.054\\d*\n+AAAACG\tCGTTTT\t0.054\\d*\n+ACGTAG\tCTACGT\t0.053\\d*\n+AGGATA\tTATCCT\t0.052\\d*\n+AGATGC\tGCATCT\t0.051\\d*\n+GGCTAC\tGTAGCC\t0.051\\d*\n+ATATCC\tGGATAT\t0.049\\d*\n+GCTGAC\tGTCAGC\t0.048\\d*\n+ATAAGC\tGCTTAT\t0.047\\d*\n+CGATAG\tCTATCG\t0.047\\d*\n+CGAACC\tGGTTCG\t0.045\\d*\n+GCGCTA\tTAGCGC\t0.044\\d*\n+CAGCAA\tTTGCTG\t0.044\\d*\n+AAGCGT\tACGCTT\t0.044\\d*\n+ATTCCA\tTGGAAT\t0.042\\d*\n+CGTATA\tTATACG\t0.040\\d*\n+TCGCCA\tTGGCGA\t0.038\\d*\n+CGCCAC\tGTGGCG\t0.038\\d*\n+CACGTA\tTACGTG\t0.036\\d*\n+CGTTAC\tGTAACG\t0.035\\d*\n+CATAGG\tCCTATG\t0.034\\d*\n+ACTAGG\tCCTAGT\t0.034\\d*\n+CACGAC\tGTCGTG\t0.033\\d*\n+CAGTCA\tTGACTG\t0.033\\d*\n+GGACCA\tTGGTCC\t0.033\\d*\n+ATCGGG\tCCCGAT\t0.031\\d*\n+CATTTC\tGAAATG\t0.031\\d*\n+TGCGAA\tTTCGCA\t0.031\\d*\n+CCTTTC\tGAAAGG\t0.030\\d*\n+CGCTAA\tTTAGCG\t0.028\\d*\n+CTTGGA\tTCCAAG\t0.028\\d*\n+ACGGAG\tCTCCGT\t0.026\\d*\n+ACTATA\tTATAGT\t0.026\\d*\n+AGACTA\tTAGTCT\t0.025\\d*\n+CATGCG\tCGCATG\t0.025\\d*\n+CAAAGG\tCCTTTG\t0.024\\d*\n+CCCTAC\tGTAGGG\t0.024\\d*\n+AAGCAG\tCTGCTT\t0.023\\d*\n+CAGGAA\tTTCCTG\t0.022\\d*\n+GCGAAA\tTTTCGC\t0.022\\d*\n+CGGCAA\tTTGCCG\t0.020\\d*\n+AGCACG\tCGTGCT\t0.020\\d*\n+ACACCA\tTGGTGT\t0.018\\d*\n+CGTCCA\tTGGACG\t0.017\\d*\n+CGCTTA\tTAAGCG\t0.016\\d*\n+CATCCC\tGGGATG\t0.015\\d*\n+TAGCGA\tTCGCTA\t0.014\\d*\n+ACGTAA\tTTACGT\t0.014\\d*\n+GCCAGA\tTCTGGC\t0.013\\d*\n+CGAGCA\tTGCTCG\t0.013\\d*\n+CCAATA\tTATTGG\t0.013\\d*\n+TCGGCA\tTGCCGA\t0.013\\d*\n+AAACGC\tGCGTTT\t0.013\\d*\n+CGACAA\tTTGTCG\t0.012\\d*\n+CGTACG\tCGTACG\t0.012\\d*\n+AGTTTC\tGAAACT\t0.012\\d*\n+ACAGAA\tTTCTGT\t0.012\\d*\n+ACCGAT\tATCGGT\t0.011\\d*\n+ATACGC\tGCGTAT\t0.011\\d*\n+ACGTAT\tATACGT\t0.010\\d*\n+CCGTAA\tTTACGG\t0.010\\d*\n+CTATAC\tGTATAG\t0.009\\d*\n+CTACTA\tTAGTAG\t0.009\\d*\n+TGGCCA\tTGGCCA\t0.009\\d*\n+ACTCGA\tTCGAGT\t0.009\\d*\n+GTGGTA\tTACCAC\t0.009\\d*\n+AGTCGG\tCCGACT\t0.009\\d*\n+AGGTCG\tCGACCT\t0.008\\d*\n+CAGAAG\tCTTCTG\t0.008\\d*\n+ACGGTC\tGACCGT\t0.007\\d*\n+ACCGCA\tTGCGGT\t0.007\\d*\n+GACGAA\tTTCGTC\t0.005\\d*\n+AGAGGG\tCCCTCT\t0.005\\d*\n+CAGTAC\tGTACTG\t0.004\\d*\n+CATGTC\tGACATG\t0.004\\d*\n+ACTAGT\tACTAGT\t0.003\\d*\n+CATACG\tCGTATG\t0.003\\d*\n+CAATCG\tCGATTG\t0.003\\d*\n+ACGGAA\tTTCCGT\t0.001\\d*\n+CGCGAA\tTTCGCG\t0.001\\d*\n+TTGAAA\tTTTCAA\t0.000\\d*\n+GTCATA\tTATGAC\t0.000\\d*\n+ACGATC\tGATCGT\t6.983\\d*\n+ACGATA\tTATCGT\t0.0\n+ACGCCA\tTGGCGT\t-0.00\\d*\n+AACGAA\tTTCGTT\t-0.00\\d*\n+ACTCTC\tGAGAGT\t-0.00\\d*\n+AATACG\tCGTATT\t-0.00\\d*\n+CTGGAC\tGTCCAG\t-0.00\\d*\n+TGTTAA\tTTAACA\t-0.00\\d*\n+CCTGAA\tTTCAGG\t-0.00\\d*\n+ATCGTG\tCACGAT\t-0.00\\d*\n+TCGCAA\tTTGCGA\t-0.00\\d*\n+ACGACT\tAGTCGT\t-0.00\\d*\n+TCGACA\tTGTCGA\t-0.00\\d*\n+GCCATA\tTATGGC\t-0.00\\d*\n+AGAGAA\tTTCTCT\t-0.00\\d*\n+GATAGC\tGCTATC\t-0.00\\d*\n+GATATC\tGATATC\t-0.00\\d*\n+ATACGG\tCCGTAT\t-0.00\\d*\n+GATCCC\tGGGATC\t-0.00\\d*\n+GGTAAC\tGTTACC\t-0.00\\d*\n+AAGAAT\tATTCTT\t-0.00\\d*\n+GCGATA\tTATCGC\t-0.00\\d*\n+CTTGTA\tTACAAG\t-0.00\\d*\n+ACTCCG\tCGGAGT\t-0.00\\d*\n+ATTTGC\tGCAAAT\t-0.00\\d*\n+CGATCG\tCGATCG\t-0.00\\d*\n+CATATG\tCATATG\t-0.00\\d*\n+AAAGAT\tATCTTT\t-0.00\\d*\n+GGACGA\tTCGTCC\t-0.00\\d*\n+CGATAA\tTTATCG\t-0.00\\d*\n+TAAGCA\tTGCTTA\t-0.00\\d*\n+AGGGGG\tCCCCCT\t-0.00\\d*\n+CC'..b'GACTG\t-0.43\\d*\n+TCAAAA\tTTTTGA\t-0.43\\d*\n+ACTCAT\tATGAGT\t-0.43\\d*\n+CCTAGA\tTCTAGG\t-0.44\\d*\n+TGGAAA\tTTTCCA\t-0.44\\d*\n+TACAAA\tTTTGTA\t-0.44\\d*\n+CAGTTC\tGAACTG\t-0.44\\d*\n+CCAGGG\tCCCTGG\t-0.44\\d*\n+AGAGAC\tGTCTCT\t-0.44\\d*\n+GCAACA\tTGTTGC\t-0.44\\d*\n+AAACTT\tAAGTTT\t-0.44\\d*\n+AGACTG\tCAGTCT\t-0.44\\d*\n+CTCTCC\tGGAGAG\t-0.44\\d*\n+CTGCAG\tCTGCAG\t-0.44\\d*\n+GGCCAC\tGTGGCC\t-0.44\\d*\n+GAGACC\tGGTCTC\t-0.45\\d*\n+ACAGCT\tAGCTGT\t-0.45\\d*\n+ATAAAG\tCTTTAT\t-0.45\\d*\n+GTGTCA\tTGACAC\t-0.45\\d*\n+TCCAAA\tTTTGGA\t-0.45\\d*\n+CTGGAA\tTTCCAG\t-0.45\\d*\n+ATGAAG\tCTTCAT\t-0.45\\d*\n+AGAACA\tTGTTCT\t-0.45\\d*\n+CCAAAA\tTTTTGG\t-0.45\\d*\n+ATCTTA\tTAAGAT\t-0.45\\d*\n+AGGTCC\tGGACCT\t-0.45\\d*\n+AAAATC\tGATTTT\t-0.45\\d*\n+AGAAAA\tTTTTCT\t-0.45\\d*\n+TCTCCA\tTGGAGA\t-0.45\\d*\n+AGTTCC\tGGAACT\t-0.45\\d*\n+AAAACC\tGGTTTT\t-0.45\\d*\n+CATGTA\tTACATG\t-0.45\\d*\n+ATGTTG\tCAACAT\t-0.45\\d*\n+AAGTCA\tTGACTT\t-0.46\\d*\n+ACATCA\tTGATGT\t-0.46\\d*\n+TATAAA\tTTTATA\t-0.46\\d*\n+CATAGA\tTCTATG\t-0.46\\d*\n+GGAGGA\tTCCTCC\t-0.46\\d*\n+AGATAG\tCTATCT\t-0.46\\d*\n+AGCTGG\tCCAGCT\t-0.46\\d*\n+GCCTCA\tTGAGGC\t-0.46\\d*\n+AAATCT\tAGATTT\t-0.46\\d*\n+CTAAGA\tTCTTAG\t-0.46\\d*\n+ACTGCA\tTGCAGT\t-0.46\\d*\n+AGAATA\tTATTCT\t-0.46\\d*\n+AGCTAG\tCTAGCT\t-0.46\\d*\n+CTGAAA\tTTTCAG\t-0.46\\d*\n+CTCCAC\tGTGGAG\t-0.46\\d*\n+TTAAAA\tTTTTAA\t-0.47\\d*\n+GAGGAA\tTTCCTC\t-0.47\\d*\n+GCAGGC\tGCCTGC\t-0.47\\d*\n+ACATAT\tATATGT\t-0.47\\d*\n+ACTACT\tAGTAGT\t-0.47\\d*\n+TGGGAA\tTTCCCA\t-0.47\\d*\n+ATGTAA\tTTACAT\t-0.47\\d*\n+TGAGAA\tTTCTCA\t-0.47\\d*\n+AGAGGA\tTCCTCT\t-0.47\\d*\n+AAGGGC\tGCCCTT\t-0.47\\d*\n+ATAAAA\tTTTTAT\t-0.48\\d*\n+CCTCAG\tCTGAGG\t-0.48\\d*\n+ACCCCA\tTGGGGT\t-0.48\\d*\n+CAGGGC\tGCCCTG\t-0.48\\d*\n+AAGTGA\tTCACTT\t-0.49\\d*\n+CTCAAA\tTTTGAG\t-0.49\\d*\n+CATAAA\tTTTATG\t-0.49\\d*\n+AGCTTC\tGAAGCT\t-0.49\\d*\n+CCAGGC\tGCCTGG\t-0.49\\d*\n+AGGCCA\tTGGCCT\t-0.49\\d*\n+ATGATA\tTATCAT\t-0.49\\d*\n+TGCACA\tTGTGCA\t-0.49\\d*\n+GCTCCA\tTGGAGC\t-0.50\\d*\n+AAACAC\tGTGTTT\t-0.50\\d*\n+CCTGCC\tGGCAGG\t-0.50\\d*\n+TATATA\tTATATA\t-0.50\\d*\n+AGACAG\tCTGTCT\t-0.50\\d*\n+CTCACA\tTGTGAG\t-0.50\\d*\n+TACACA\tTGTGTA\t-0.51\\d*\n+ATAGAA\tTTCTAT\t-0.51\\d*\n+ATTTTG\tCAAAAT\t-0.51\\d*\n+GCAGCC\tGGCTGC\t-0.51\\d*\n+GGGAAA\tTTTCCC\t-0.51\\d*\n+CTGTAA\tTTACAG\t-0.52\\d*\n+AGGAGC\tGCTCCT\t-0.52\\d*\n+AGTGGA\tTCCACT\t-0.52\\d*\n+AAAACA\tTGTTTT\t-0.52\\d*\n+CAGGGG\tCCCCTG\t-0.52\\d*\n+ATTACA\tTGTAAT\t-0.52\\d*\n+AGGCAT\tATGCCT\t-0.52\\d*\n+CCACAG\tCTGTGG\t-0.53\\d*\n+CACATA\tTATGTG\t-0.53\\d*\n+GTCTGA\tTCAGAC\t-0.53\\d*\n+AGGAGG\tCCTCCT\t-0.53\\d*\n+TCCTCA\tTGAGGA\t-0.53\\d*\n+CCACCC\tGGGTGG\t-0.53\\d*\n+AGATGG\tCCATCT\t-0.53\\d*\n+AATGCC\tGGCATT\t-0.53\\d*\n+GAAGGA\tTCCTTC\t-0.53\\d*\n+ATTTTA\tTAAAAT\t-0.53\\d*\n+ATTGTG\tCACAAT\t-0.53\\d*\n+ATCATC\tGATGAT\t-0.54\\d*\n+CTAGAA\tTTCTAG\t-0.54\\d*\n+CCACCA\tTGGTGG\t-0.54\\d*\n+ACAGAT\tATCTGT\t-0.54\\d*\n+AGCAAC\tGTTGCT\t-0.54\\d*\n+GATAGA\tTCTATC\t-0.55\\d*\n+ATTCAA\tTTGAAT\t-0.55\\d*\n+AGGGCA\tTGCCCT\t-0.55\\d*\n+CACCCA\tTGGGTG\t-0.55\\d*\n+AGTGAC\tGTCACT\t-0.55\\d*\n+ACCAAA\tTTTGGT\t-0.55\\d*\n+AGCAGC\tGCTGCT\t-0.55\\d*\n+TAGATA\tTATCTA\t-0.56\\d*\n+TAGAAA\tTTTCTA\t-0.56\\d*\n+CAGGCA\tTGCCTG\t-0.57\\d*\n+CCTCTC\tGAGAGG\t-0.57\\d*\n+CTCCCA\tTGGGAG\t-0.57\\d*\n+CCCGCC\tGGCGGG\t-0.57\\d*\n+ACAAAC\tGTTTGT\t-0.57\\d*\n+GACACA\tTGTGTC\t-0.57\\d*\n+CAGCAG\tCTGCTG\t-0.58\\d*\n+ATCCAT\tATGGAT\t-0.58\\d*\n+CATATA\tTATATG\t-0.59\\d*\n+AAACAA\tTTGTTT\t-0.59\\d*\n+CTGGGA\tTCCCAG\t-0.60\\d*\n+ATAGAT\tATCTAT\t-0.60\\d*\n+ACACAT\tATGTGT\t-0.60\\d*\n+ATTTAA\tTTAAAT\t-0.61\\d*\n+ACTGGG\tCCCAGT\t-0.61\\d*\n+AAAATG\tCATTTT\t-0.61\\d*\n+ATATAT\tATATAT\t-0.61\\d*\n+CGCCCC\tGGGGCG\t-0.62\\d*\n+GAAAGA\tTCTTTC\t-0.62\\d*\n+CCTCCC\tGGGAGG\t-0.63\\d*\n+ACAGGA\tTCCTGT\t-0.64\\d*\n+AATAAA\tTTTATT\t-0.64\\d*\n+AGAAAG\tCTTTCT\t-0.64\\d*\n+CCTTCC\tGGAAGG\t-0.64\\d*\n+CCGCCC\tGGGCGG\t-0.65\\d*\n+ATACAT\tATGTAT\t-0.65\\d*\n+CCCCAG\tCTGGGG\t-0.66\\d*\n+TAAATA\tTATTTA\t-0.67\\d*\n+CATACA\tTGTATG\t-0.67\\d*\n+AACCAA\tTTGGTT\t-0.67\\d*\n+AGAAAT\tATTTCT\t-0.68\\d*\n+CCAAAG\tCTTTGG\t-0.69\\d*\n+CCTGGA\tTCCAGG\t-0.69\\d*\n+CAACCA\tTGGTTG\t-0.69\\d*\n+ACATAC\tGTATGT\t-0.71\\d*\n+ATAAAT\tATTTAT\t-0.72\\d*\n+ACCCAC\tGTGGGT\t-0.72\\d*\n+GAGGGA\tTCCCTC\t-0.74\\d*\n+AAATAA\tTTATTT\t-0.75\\d*\n+CATCCA\tTGGATG\t-0.76\\d*\n+AGGGAG\tCTCCCT\t-0.77\\d*\n+CTCCTC\tGAGGAG\t-0.77\\d*\n+TACATA\tTATGTA\t-0.77\\d*\n+CCCACC\tGGTGGG\t-0.80\\d*\n+CCCTCC\tGGAGGG\t-0.82\\d*\n+AAAGAA\tTTCTTT\t-0.94\\d*\n+CCCTGC\tGCAGGG\t-0.97\\d*\n+AAGAAA\tTTTCTT\t-0.98\\d*\n+CCCCCC\tGGGGGG\t-1.23\\d*\n+AGAGAG\tCTCTCT\t-1.32\\d*\n+AAAAAA\tTTTTTT\t-1.44\\d*\n+GAGAGA\tTCTCTC\t-1.47\\d*\n+CACACA\tTGTGTG\t-1.69\\d*\n+ACACAC\tGTGTGT\t-1.80\\d*\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/tool-data/train_predictions.out --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/tool-data/train_predictions.out Mon Aug 20 18:07:22 2012 -0400 |
b |
b'@@ -0,0 +1,1049 @@\n+#seq_id\tSVM score\tlabel\tNCV\n+chr10:6238484-6238749\t0.613347259057\t1\t4\n+chr10:6345920-6346133\t0.954807438919\t1\t0\n+chr10:8805339-8805546\t1.71419073996\t1\t2\n+chr10:12697247-12697463\t1.23892959115\t1\t1\n+chr10:20821878-20822092\t1.08715091805\t1\t2\n+chr10:21465959-21466182\t1.11018210836\t1\t3\n+chr10:21548928-21549240\t0.503404954015\t1\t4\n+chr10:21708326-21708533\t0.993516432142\t1\t3\n+chr10:28289736-28289985\t1.07694029364\t1\t3\n+chr10:30457257-30457471\t1.53235324402\t1\t3\n+chr10:33856167-33856381\t1.44887473869\t1\t3\n+chr10:51392420-51392658\t0.927847027599\t1\t1\n+chr10:56094079-56094279\t1.02615151756\t1\t3\n+chr10:58131585-58131813\t0.839197738464\t1\t2\n+chr10:60878634-60878850\t1.02551060148\t1\t4\n+chr10:62360770-62360986\t0.982635392453\t1\t4\n+chr10:67599479-67599721\t0.918360990753\t1\t2\n+chr10:67600241-67600456\t1.36691399325\t1\t1\n+chr10:69296771-69297011\t0.505226893998\t1\t1\n+chr10:72315525-72315760\t2.22059109506\t1\t1\n+chr10:76662120-76662329\t1.46282752352\t1\t0\n+chr10:79513518-79513761\t1.03653568136\t1\t4\n+chr10:79514069-79514297\t0.843029041756\t1\t0\n+chr10:82272165-82272393\t0.836947221302\t1\t3\n+chr10:84170918-84171118\t0.71367320737\t1\t4\n+chr10:85002598-85002811\t1.02189654988\t1\t3\n+chr10:92705007-92705282\t0.207390721602\t1\t3\n+chr10:107723907-107724110\t1.34978207126\t1\t2\n+chr10:110291825-110292092\t0.916139325907\t1\t1\n+chr10:110763940-110764165\t1.16214182095\t1\t2\n+chr10:120892135-120892361\t1.20816098317\t1\t2\n+chr10:126409106-126409356\t0.83925255883\t1\t2\n+chr10:126502141-126502343\t1.40795940938\t1\t1\n+chr11:4106498-4106708\t1.41962773252\t1\t0\n+chr11:6795098-6795318\t0.754184378346\t1\t2\n+chr11:8469255-8469455\t1.15209683574\t1\t2\n+chr11:9017015-9017224\t1.44835850345\t1\t0\n+chr11:11729287-11729490\t0.921633110168\t1\t2\n+chr11:17783256-17783462\t1.52794009584\t1\t4\n+chr11:18623594-18623823\t1.21571014789\t1\t4\n+chr11:20105386-20105638\t0.972923927205\t1\t1\n+chr11:32282053-32282266\t0.836437353279\t1\t4\n+chr11:33191219-33191488\t0.360272650716\t1\t3\n+chr11:44560401-44560610\t0.546689800543\t1\t0\n+chr11:47454201-47454453\t0.892497848817\t1\t4\n+chr11:60817731-60817970\t0.986462760415\t1\t0\n+chr11:69395446-69395651\t0.616714300432\t1\t0\n+chr11:70046235-70046464\t0.708522025183\t1\t2\n+chr11:71315743-71315966\t1.12910924167\t1\t0\n+chr11:79617347-79617550\t0.892366019538\t1\t1\n+chr11:79945653-79945935\t0.419727893681\t1\t3\n+chr11:80530571-80530810\t0.946024726323\t1\t2\n+chr11:86571042-86571272\t0.712348666078\t1\t3\n+chr11:90204484-90204697\t0.808809318068\t1\t4\n+chr11:95842151-95842405\t0.947212593147\t1\t1\n+chr11:96051590-96051801\t1.29363119171\t1\t3\n+chr11:100800901-100801141\t0.643164139126\t1\t0\n+chr11:116626390-116626614\t0.675512021213\t1\t0\n+chr11:117386865-117387165\t0.403986907238\t1\t1\n+chr11:117459661-117459891\t0.886427162889\t1\t4\n+chr11:118102258-118102474\t0.919632360347\t1\t3\n+chr11:118189873-118190087\t1.23378435382\t1\t1\n+chr11:119052686-119052892\t0.987935503464\t1\t1\n+chr12:4045906-4046121\t0.962679974768\t1\t3\n+chr12:4179746-4180046\t0.255231359487\t1\t4\n+chr12:4904902-4905102\t0.671647694007\t1\t4\n+chr12:12950090-12950330\t0.772336976257\t1\t0\n+chr12:36327306-36327509\t1.50669802876\t1\t4\n+chr12:40724684-40724965\t0.566849897794\t1\t1\n+chr12:42830817-42831028\t0.918401219988\t1\t2\n+chr12:52622751-52623020\t0.862379020916\t1\t3\n+chr12:53820467-53820682\t1.19429377893\t1\t0\n+chr12:56894750-56895009\t0.734793616518\t1\t4\n+chr12:71812114-71812353\t1.14416321558\t1\t4\n+chr12:72397827-72398040\t1.32301262037\t1\t4\n+chr12:73750748-73750949\t0.929825921279\t1\t0\n+chr12:74186921-74187176\t1.05170436328\t1\t1\n+chr12:74565335-74565637\t0.518998613642\t1\t1\n+chr12:86853226-86853442\t1.07330865627\t1\t1\n+chr12:87602519-87602751\t1.1719289426\t1\t2\n+chr12:87812080-87812284\t1.19962948125\t1\t3\n+chr12:87842870-87843071\t1.52318657077\t1\t4\n+chr12:88658968-88659168\t1.2525663751\t1\t1\n+chr12:89499854-89500161\t0.632134053075\t1\t4\n+chr12:92922931-92923173\t0.13718696752\t1\t0\n+chr12:101298737-101298989\t0.428134918515\t1\t4\n+chr12:107685474-107685710\t1.14175063533\t1\t2\n+chr12:109687864-109688082\t0.751780341701\t1\t0\n+chr12:116839915-116840131\t0.963921990431\t1\t4\n+chr12:116855882-116856082\t1.50156360543\t1\t0\n+chr12:119998094-1'..b'6679-37397500\t-1.09215019222\t-1\t3\n+chr7:51630323-51631137\t-1.30676405157\t-1\t0\n+chr7:71101763-71102572\t-0.974470953123\t-1\t0\n+chr7:71772889-71773720\t-1.00313803386\t-1\t4\n+chr7:73529986-73530816\t-1.08122972206\t-1\t1\n+chr7:75574735-75575601\t-1.54366870584\t-1\t0\n+chr7:80544956-80545794\t-1.42216788767\t-1\t1\n+chr7:86697573-86698409\t-0.067192746973\t-1\t1\n+chr7:88625555-88626376\t-1.30437420272\t-1\t4\n+chr7:97123917-97124732\t-1.53017337996\t-1\t2\n+chr7:105871475-105872281\t-1.73614337896\t-1\t4\n+chr7:106008485-106009364\t-1.45393891459\t-1\t1\n+chr7:106030356-106031157\t-1.10023293709\t-1\t2\n+chr7:108649842-108650662\t0.18836161647\t-1\t4\n+chr7:114379983-114380795\t-0.932069637314\t-1\t4\n+chr7:114482868-114483678\t-1.26672716467\t-1\t4\n+chr7:114828643-114829456\t-0.0459938127892\t-1\t0\n+chr7:119832029-119832888\t-1.43609175659\t-1\t0\n+chr7:119834806-119835614\t-1.1728070967\t-1\t3\n+chr7:134854266-134855094\t-0.939088085575\t-1\t1\n+chr7:135223239-135224102\t-1.00853356425\t-1\t0\n+chr7:136564522-136565325\t-0.800969524058\t-1\t1\n+chr7:137394897-137395750\t-1.1258924809\t-1\t2\n+chr7:140306187-140307004\t-1.14530217118\t-1\t1\n+chr7:144063563-144064383\t1.83022796573\t-1\t1\n+chr7:146041640-146042450\t-1.01526445816\t-1\t3\n+chr7:146549665-146550468\t-0.6409105513\t-1\t0\n+chr7:148145070-148145871\t-1.05913928394\t-1\t1\n+chr7:150576804-150577615\t0.00852496727436\t-1\t1\n+chr7:150985649-150986491\t-0.644553996402\t-1\t3\n+chr7:151031640-151032443\t-0.521851990157\t-1\t1\n+chr7:151954175-151954980\t0.641242814199\t-1\t0\n+chr8:13078824-13079645\t-1.25775391703\t-1\t2\n+chr8:14783485-14784287\t-0.836653058691\t-1\t0\n+chr8:15519374-15520257\t-1.14973788583\t-1\t3\n+chr8:18479840-18480732\t-1.07745384634\t-1\t4\n+chr8:24265541-24266341\t-1.08030168778\t-1\t0\n+chr8:35282211-35283014\t-1.3836667735\t-1\t4\n+chr8:36198109-36198922\t-0.826963234513\t-1\t4\n+chr8:37602170-37603031\t-1.04471827095\t-1\t2\n+chr8:44405874-44406689\t-0.76688997668\t-1\t0\n+chr8:46064445-46065268\t-1.31744341689\t-1\t2\n+chr8:60211369-60212178\t-1.45835611605\t-1\t1\n+chr8:66004558-66005383\t-1.196581727\t-1\t0\n+chr8:72783236-72784080\t-1.36105112842\t-1\t2\n+chr8:82069499-82070371\t-1.52627893366\t-1\t3\n+chr8:86695871-86696680\t-0.925241162905\t-1\t4\n+chr8:86971540-86972347\t-0.728561918277\t-1\t1\n+chr8:89411748-89412569\t-1.11280243674\t-1\t4\n+chr8:90498407-90499233\t-0.973330984801\t-1\t0\n+chr8:92855703-92856612\t-1.26311996178\t-1\t1\n+chr8:96891490-96892346\t-1.07022083566\t-1\t3\n+chr8:106766185-106767003\t-0.823399900353\t-1\t0\n+chr8:108495006-108495822\t-0.994981353579\t-1\t0\n+chr8:114506469-114507276\t-1.35363839068\t-1\t2\n+chr8:114556675-114557494\t-1.08170513272\t-1\t1\n+chr8:119197900-119198711\t-1.30829818248\t-1\t2\n+chr8:121001939-121002742\t-1.16711231898\t-1\t2\n+chr8:125151804-125152630\t-1.14695021366\t-1\t2\n+chr8:125257593-125258396\t-0.898404958678\t-1\t2\n+chr8:128582733-128583534\t-0.862005331114\t-1\t4\n+chr9:7633695-7634507\t-1.49548159982\t-1\t0\n+chr9:10448933-10449802\t-1.34661346051\t-1\t4\n+chr9:20818985-20819887\t-0.94274002176\t-1\t4\n+chr9:21572554-21573359\t-1.4458606182\t-1\t1\n+chr9:24765372-24766208\t-1.14433497108\t-1\t0\n+chr9:42665637-42666471\t-0.926435047364\t-1\t2\n+chr9:43095695-43096496\t-0.266564159951\t-1\t3\n+chr9:43737102-43737924\t-1.38119705362\t-1\t2\n+chr9:55811017-55811832\t-1.06545992742\t-1\t3\n+chr9:58126862-58127669\t-1.25807785039\t-1\t3\n+chr9:59320753-59321631\t-1.27274229975\t-1\t4\n+chr9:64349949-64350752\t-1.04705105492\t-1\t4\n+chr9:66696935-66697742\t-1.18357717593\t-1\t4\n+chr9:72617188-72618062\t-1.44764802427\t-1\t2\n+chr9:72756064-72756892\t-1.15293988254\t-1\t3\n+chr9:75187316-75188117\t-1.2592415165\t-1\t1\n+chr9:75224025-75224831\t-1.20598925552\t-1\t1\n+chr9:77464425-77465275\t-1.00190641686\t-1\t2\n+chr9:103484670-103485473\t-0.872105265863\t-1\t4\n+chr9:106153973-106154793\t-0.82999769989\t-1\t0\n+chr9:107532699-107533521\t-1.12970154082\t-1\t4\n+chr9:107612495-107613301\t-0.91316944298\t-1\t1\n+chr9:108755734-108756581\t-1.24403558211\t-1\t4\n+chr9:114493280-114494083\t-0.934200253342\t-1\t2\n+chr9:119032069-119032895\t-0.597589738725\t-1\t3\n+chrX:55266955-55267798\t-0.942885866594\t-1\t3\n+chrX:96751955-96752813\t-1.30697716745\t-1\t1\n+chrX:154039112-154040019\t-0.839277064438\t-1\t0\n' |
b |
diff -r 000000000000 -r 7fe1103032f7 kmersvm/train.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kmersvm/train.xml Mon Aug 20 18:07:22 2012 -0400 |
b |
@@ -0,0 +1,131 @@ +<tool id="kmersvm_train" name="Train SVM"> + <description>on regulatory DNA sequences</description> + <command interpreter="python">scripts/kmersvm_train.py -q -p -s -v $N -C $SVMC -e $EPS + #if $weight_type.weight_type_select == "custom" + -w $weight_type.weight + #end if + #if $kernel.kernel_select == "sk" + -t 1 -k $kernel.kmerlen_sk + #else + -t 2 -k $kernel.kmerlen_wsk -K $kernel.kmerlen_wsk2 + #end if + $inputA $inputB + </command> + <inputs> + <param format="fasta" name="inputA" type="data" label="Positives"/> + <param format="fasta" name="inputB" type="data" label="Negatives"/> + <conditional name="kernel"> + <param name="kernel_select" type="select" label="Kernel Type"> + <option value="sk">Spectrum Kernel</option> + <option value="wsk">Weighted Spectrum Kernel</option> + </param> + <when value="sk"> + <param name="kmerlen_sk" type="integer" value="6" label="K-mer Length"> + <validator type="in_range" message="K-mer length must be in range 5-10" min="5" max="10" /> + </param> + </when> + <when value="wsk"> + <param name="kmerlen_wsk" type="integer" value="6" label="Minimum K-mer Length"> + <validator type="in_range" message="K-mer length must be in range 5-10" min="5" max="10" /> + </param> + <param name="kmerlen_wsk2" type="integer" value="8" label="Maximum K-mer Length"> + <validator type="in_range" message="K-mer length must be in range 5-10" min="5" max="10" /> + </param> + </when> + </conditional> + <param name="N" type="select" label="N-Fold Cross Validation"> + <option value="3">3</option> + <option value="5" selected="true">5</option> + <option value="10">10</option> + </param> + <conditional name="weight_type"> + <param name="weight_type_select" type="select" label="Positive Set Weight"> + <option value="automatic">Automatic</option> + <option value="custom">Custom</option> + </param> + <when value="custom"> + <param name="weight" type="float" value="1" label="Input The Value of Positive Set Weight" /> + </when> + </conditional> + <param name="SVMC" type="integer" value="1" label="Regularization Param C" /> + <param name="EPS" type="float" value="0.00001" label="Precision Param E" /> + </inputs> + <outputs> + <data format="tabular" name="SVM_weights" from_work_dir="kmersvm_output_weights.out" label="${tool.name} on ${on_string} : Weights" /> + <data format="tabular" name="CV_predictions" from_work_dir="kmersvm_output_cvpred.out" label="${tool.name} on ${on_string} : Predictions" /> + </outputs> + <tests> + <!--SK--> + <test> + <param name="kernel_select" value="sk"/> + <param name="inputA" value="test_positive.fa" /> + <param name="inputB" value="test_negative.fa" /> + <param name="weight_type_select" value="automatic" /> + <output name="output" file="test_weights.out" compare="re_match" lines_diff="20"/> + <output name="output2" file="train_predictions.out" compare="re_match"/> + </test> + </tests> + <help> + +**Note** + +.. class:: warningmark + +All values of K-mer lengths must be between 5 and 10 bp. + +---- + +**What it does** + +Takes as input 2 FASTA files, 1 of positive sequences and 1 of negative sequences. Produces 2 outputs: + + A) Weights: list of sequences of length K ranked by score and posterior probability for that score. + + B) Predictions: results of N-fold cross validation + +---- + +**Parameters** + +Kernel: 2 choices: + + A) Spectrum Kernel: Analyzes a sequence using strings of length K. + + B) Weighted Spectrum Kernel: Analyzes a sequence using strings of range of lengths K1 - Kn. + +N-Fold Cross Validation: Number of partitions of training data used for cross validation. + +Weight: Increases importance of positive data (increase if positive sets are very trustworthy or for training with very large negative sequence sets). + +Regularization Parameter: Penalty for misclassification. Trade-off is overfitting (high parameter) versus high error rate (low parameter). + +Precision Parameter: Insensitivity zone. Affects precision of SVM by altering number of support vectors used. + +---- + +**Example** + +Weights file:: + + #parameters: + #kernel=1 + #kmerlen=6 + #bias=-1.20239998751 + #A=-1.50821617139 + #B=-0.110516009177 + #NOTE: k-mers with large negative weights are also important. They can be found at the bottom of the list. + #k-mer revcomp SVM-weight + AGGTCA TGACCT 9.32110889151 + AAGGTC GACCTT 8.22598019901 + ACCTTG CAAGGT 5.78739494153 + AGGTCG CGACCT 5.40759311635 + +Predictions file:: + + mm8_chr1_10212203_10212303_+ 3.31832111466 1 1 + mm8_chr1_103584748_103584848_+ -0.253869299667 1 3 + mm8_chr1_105299130_105299230_+ -1.03463560077 1 3 + mm8_chr1_106367772_106367872_+ 5.36528447025 1 3 + + </help> +</tool> |