Previous changeset 0:e4b87a00b1df (2016-11-23) Next changeset 2:f537d3e00eb8 (2017-06-23) |
Commit message:
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_star_index_builder commit 0d434bca5083e908114d93e11094e48f49b98ed1 |
modified:
data_manager_conf.xml tool-data/all_fasta.loc.sample tool_data_table_conf.xml.sample |
added:
README.md data_manager/macros.xml data_manager/rna_star_index_builder.py data_manager/rna_star_index_builder.xml tool-data/rnastar_index2.loc.sample |
removed:
README data_manager/rnastar_index_builder.py data_manager/rnastar_index_builder.xml tool-data/rnastar_index.loc.sample tool_dependencies.xml |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 README --- a/README Wed Nov 23 17:55:57 2016 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,40 +0,0 @@ -*What it does* - -This is a Galaxy datamanager for the rna STAR gap-aware RNA aligner. It's a hack of Dan Blankenberg's BWA data manager -and works on any fasta file you have already downloaded with the all fasta data manager - start there! - -Warning - this is not well tested and there are some complexities to do with splice junction annotation in rna star -indexes - feedback welcomed. Send code. - -Note, currently you'll need a small patch to prevent an error when you try to generate splice junction indexes described at -https://bitbucket.org/galaxy/galaxy-central/pull-request/510/fix-for-data-manager-failure-to-update-a#comment-3265356 - -Please read the fine manual - that and the google group are the places to learn about the options above. - -*Note on sjdbOverhang* - -From https://groups.google.com/forum/#!topic/rna-star/h9oh10UlvhI:: - - James is right, using large enough --sjdbOverhang is safer and should not generally cause any problems with reads of varying length. If your reads are very short, <50b, then I would strongly recommend using optimum --sjdbOverhang=mateLength-1 - By mate length I mean the length of one of the ends of the read, i.e. it's 100 for 2x100b PE or 1x100b SE. For longer reads you can simply use generic --sjdbOverhang 100. - It is a bit confusing because of the way I named this parameter. --sjdbOverhang Noverhang is only used at the genome generation step for constructing the reference sequence out of the annotations. - Basically, the Noverhang exonic bases from the donor site and Noverhang exonic bases from the acceptor site are spliced together for each of the junctions, and these spliced sequences are added to the genome sequence. - - At the mapping stage, the reads are aligned to both genomic and splice sequences simultaneously. If a read maps to one of spliced sequences and crosses the "junction" in the middle of it, the coordinates of two pspliced pieces are translated back to genomic space and added to the collection of mapped pieces, which are then all "stitched" together to form the final alignment. Since in the process of "maximal mapped length" search the read is split into pieces of no longer than --seedSearchStartLmax (=50 by default) bases, even if the read (mate) is longer than --sjdbOverhang, it can still be mapped to the spliced reference, as long as --sjdbOverhang > --seedSearchStartLmax. - - Cheers - Alex - -*Note on gene model requirements for splice junctions* - -From https://groups.google.com/forum/#!msg/rna-star/3Y_aaTuzBrE/lUylTB8h5vMJ:: - - When you generate a genome with annotations, you need to specify --sjdbOverhang value, which ideally should be equal to (oneMateLength-1), or you could use a generic value of ~100. - - Your gtf lines look fine to me. STAR needs 3 features from a GTF file: - 1. Chromosome names in col.1 that agree with chromosome names in genome .fasta files. If you have "chr2L" names in the genome .fasta files, and "2L" in the .gtf file, then you need to use --sjdbGTFchrPrefix chr option. - 2. 'exon' in col.3 for the exons of all transcripts (this name can be changed with --sjdbGTFfeatureExon) - 3. 'transcript_id' attribute that assigns each exon to a transcript (--this name can be changed with --sjdbGTFtagExonParentTranscript) - - Cheers - Alex |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 README.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Fri Apr 21 12:36:31 2017 -0400 |
b |
@@ -0,0 +1,40 @@ +##What it does## + +This is a Galaxy datamanager for the rna STAR gap-aware RNA aligner. It's a hack of Dan Blankenberg's BWA data manager +and works on any fasta file you have already downloaded with the all fasta data manager - start there! + +Warning - this is not well tested and there are some complexities to do with splice junction annotation in rna star +indexes - feedback welcomed. Send code. + +Note, currently you'll need a small patch to prevent an error when you try to generate splice junction indexes described at +https://bitbucket.org/galaxy/galaxy-central/pull-request/510/fix-for-data-manager-failure-to-update-a#comment-3265356 + +Please read the fine manual - that and the google group are the places to learn about the options above. + +*Note on sjdbOverhang* + +From https://groups.google.com/forum/#!topic/rna-star/h9oh10UlvhI:: + + James is right, using large enough --sjdbOverhang is safer and should not generally cause any problems with reads of varying length. If your reads are very short, <50b, then I would strongly recommend using optimum --sjdbOverhang=mateLength-1 + By mate length I mean the length of one of the ends of the read, i.e. it's 100 for 2x100b PE or 1x100b SE. For longer reads you can simply use generic --sjdbOverhang 100. + It is a bit confusing because of the way I named this parameter. --sjdbOverhang Noverhang is only used at the genome generation step for constructing the reference sequence out of the annotations. + Basically, the Noverhang exonic bases from the donor site and Noverhang exonic bases from the acceptor site are spliced together for each of the junctions, and these spliced sequences are added to the genome sequence. + + At the mapping stage, the reads are aligned to both genomic and splice sequences simultaneously. If a read maps to one of spliced sequences and crosses the "junction" in the middle of it, the coordinates of two pspliced pieces are translated back to genomic space and added to the collection of mapped pieces, which are then all "stitched" together to form the final alignment. Since in the process of "maximal mapped length" search the read is split into pieces of no longer than --seedSearchStartLmax (=50 by default) bases, even if the read (mate) is longer than --sjdbOverhang, it can still be mapped to the spliced reference, as long as --sjdbOverhang > --seedSearchStartLmax. + + Cheers + Alex + +*Note on gene model requirements for splice junctions* + +From https://groups.google.com/forum/#!msg/rna-star/3Y_aaTuzBrE/lUylTB8h5vMJ:: + + When you generate a genome with annotations, you need to specify --sjdbOverhang value, which ideally should be equal to (oneMateLength-1), or you could use a generic value of ~100. + + Your gtf lines look fine to me. STAR needs 3 features from a GTF file: + 1. Chromosome names in col.1 that agree with chromosome names in genome .fasta files. If you have "chr2L" names in the genome .fasta files, and "2L" in the .gtf file, then you need to use --sjdbGTFchrPrefix chr option. + 2. 'exon' in col.3 for the exons of all transcripts (this name can be changed with --sjdbGTFfeatureExon) + 3. 'transcript_id' attribute that assigns each exon to a transcript (--this name can be changed with --sjdbGTFtagExonParentTranscript) + + Cheers + Alex |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 data_manager/macros.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_manager/macros.xml Fri Apr 21 12:36:31 2017 -0400 |
b |
@@ -0,0 +1,20 @@ +<macros> + <xml name="requirements"> + <requirements> + <requirement type="package" version="2.5.2b">star</requirement> + <requirement type="package" version="0.1.19">samtools</requirement> + </requirements> + </xml> + <token name="@FASTQ_GZ_OPTION@"> + --readFilesCommand zcat + </token> + <xml name="citations"> + <citations> + <citation type="doi">10.1093/bioinformatics/bts635</citation> + </citations> + </xml> + <xml name="@SJDBOPTIONS@"> + <param argument="--sjdbGTFfile" type="data" format="gff3,gtf" label="Gene model (gff3,gtf) file for splice junctions" optional="true" help="Exon junction information for mapping splices"/> + <param argument="--sjdbOverhang" type="integer" min="1" value="100" label="Length of the genomic sequence around annotated junctions" help="Used in constructing the splice junctions database. Ideal value is ReadLength-1"/> + </xml> +</macros> |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 data_manager/rna_star_index_builder.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_manager/rna_star_index_builder.py Fri Apr 21 12:36:31 2017 -0400 |
[ |
@@ -0,0 +1,30 @@ +#!/usr/bin/env python + +import json +import optparse + + +def main(): + parser = optparse.OptionParser() + parser.add_option( '--config-file', dest='config_file', action='store', type="string") + parser.add_option( '--value', dest='value', action='store', type="string" ) + parser.add_option( '--dbkey', dest='dbkey', action='store', type="string" ) + parser.add_option( '--name', dest='name', action='store', type="string" ) + parser.add_option( '--subdir', dest='subdir', action='store', type="string" ) + parser.add_option( '--data-table', dest='data_table', action='store', type="string" ) + parser.add_option( '--withGTF', dest='withGTF', action='store_true' ) + (options, args) = parser.parse_args() + + if options.dbkey in [ None, '', '?' ]: + raise Exception( '"%s" is not a valid dbkey. You must specify a valid dbkey.' % ( options.dbkey ) ) + + withGTF = "0" + if options.withGTF: + withGTF = "1" + + data_manager_dict = {'data_tables': {options.data_table: [dict( value=options.value, dbkey=options.dbkey, name=options.name, path=options.subdir, withGTF=withGTF )]}} + open( options.config_file, 'wb' ).write( json.dumps( data_manager_dict ) ) + + +if __name__ == "__main__": + main() |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 data_manager/rna_star_index_builder.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_manager/rna_star_index_builder.xml Fri Apr 21 12:36:31 2017 -0400 |
[ |
@@ -0,0 +1,133 @@ +<tool id="rna_star_index_builder_data_manager" name="rnastar index2" tool_type="manage_data" version="0.0.4" profile="17.01"> + <description>builder</description> + + <macros> + <import>macros.xml</import> + </macros> + + <expand macro="requirements" /> + + <command><![CDATA[ + #import json, os + #set params = json.loads( open( str($out_file) ).read() ) + #set target_directory = $params[ 'output_data' ][0]['extra_files_path'].encode('ascii', 'replace') + #set subdir = os.path.basename(target_directory) + + mkdir -p '${target_directory}/${subdir}' && + + STAR + --runMode genomeGenerate + --genomeFastaFiles '${all_fasta_source.fields.path}' + --genomeDir '${target_directory}/${subdir}' + #if str($GTFconditional.GTFselect) == "withGTF": + --sjdbGTFfile '${GTFconditional.sjdbGTFfile}' + --sjdbOverhang '${GTFconditional.sjdbOverhang}' + #end if + --runThreadN \${GALAXY_SLOTS:-2} && + + python ${__tool_directory__}/rna_star_index_builder.py + --config-file '${out_file}' + --value '${all_fasta_source.fields.value}' + --dbkey '${all_fasta_source.fields.dbkey}' + #if $name: + --name '$name' + #else + --name '${all_fasta_source.fields.name}' + #end if + #if str($GTFconditional.GTFselect) == "withGTF": + --withGTF 1 + #end if + --data-table 'rnastar_index2' + --subdir '${subdir}' + ]]></command> + <inputs> + <param name="all_fasta_source" type="select" label="Source FASTA Sequence"> + <options from_data_table="all_fasta"/> + </param> + <param name="name" + type="text" + value="" + label="Informative name for sequence index" + help="By using different settings, you may have several indices per reference genome. Give an appropriate description to the index to distinguish between indices"/> + <conditional name="GTFconditional"> + <param name="GTFselect" type="select" label="Reference genome with or without an annotation" help="Must the index have been created WITH a GTF file (if not you can specify one afterward)."> + <option value="withoutGTF">use genome reference without builtin gene-model</option> + <option value="withGTF">use genome reference with builtin gene-model</option> + </param> + <when value="withGTF"> + <param argument="--sjdbGTFfile" type="data" format="gff3,gtf" label="Gene model (gff3,gtf) file for splice junctions" optional="false" help="Exon junction information for mapping splices"/> + <param argument="--sjdbOverhang" type="integer" min="1" value="100" label="Length of the genomic sequence around annotated junctions" help="Used in constructing the splice junctions database. Ideal value is ReadLength-1"/> + </when> + <when value="withoutGTF" /> + </conditional> + </inputs> + + <outputs> + <data name="out_file" format="data_manager_json"/> + </outputs> + + <!-- not available in planemo at the moment of writing + <tests> + <test> + <param name="all_fasta_source" value="phiX.fa"/> + <param name="sequence_name" value="phiX"/> + <param name="sequence_id" value="minimal-settings"/> + <param name="modelformat" value="None"/> + + <output name="out_file" file="test_star_01.data_manager_json"/> + </test> + </tests> + --> + + <help> + +.. class:: infomark + +<![CDATA[ +*What it does* + +This is a Galaxy datamanager for the rna STAR gap-aware RNA aligner. + +Please read the fine manual - that and the google group are the places to learn about the options above. + +*Memory requirements* + +To run efficiently, RNA-STAR requires enough free memory to +hold the SA-indexed reference genome in RAM. For Human Genome hg19 this +index is about 27GB and running RNA-STAR requires approximately ~30GB of RAM. +For custom genomes, the rule of thub is to multiply the size of the +reference FASTA file by 9 to estimated required amount of RAM. + +*Note on sjdbOverhang* + +From https://groups.google.com/forum/#!topic/rna-star/h9oh10UlvhI:: + + James is right, using large enough --sjdbOverhang is safer and should not generally cause any problems with reads of varying length. If your reads are very short, <50b, then I would strongly recommend using optimum --sjdbOverhang=mateLength-1 + By mate length I mean the length of one of the ends of the read, i.e. it's 100 for 2x100b PE or 1x100b SE. For longer reads you can simply use generic --sjdbOverhang 100. + It is a bit confusing because of the way I named this parameter. --sjdbOverhang Noverhang is only used at the genome generation step for constructing the reference sequence out of the annotations. + Basically, the Noverhang exonic bases from the donor site and Noverhang exonic bases from the acceptor site are spliced together for each of the junctions, and these spliced sequences are added to the genome sequence. + + At the mapping stage, the reads are aligned to both genomic and splice sequences simultaneously. If a read maps to one of spliced sequences and crosses the "junction" in the middle of it, the coordinates of two pspliced pieces are translated back to genomic space and added to the collection of mapped pieces, which are then all "stitched" together to form the final alignment. Since in the process of "maximal mapped length" search the read is split into pieces of no longer than --seedSearchStartLmax (=50 by default) bases, even if the read (mate) is longer than --sjdbOverhang, it can still be mapped to the spliced reference, as long as --sjdbOverhang > --seedSearchStartLmax. + + Cheers + Alex + +*Note on gene model requirements for splice junctions* + +From https://groups.google.com/forum/#!msg/rna-star/3Y_aaTuzBrE/lUylTB8h5vMJ:: + + When you generate a genome with annotations, you need to specify --sjdbOverhang value, which ideally should be equal to (oneMateLength-1), or you could use a generic value of ~100. + + Your gtf lines look fine to me. STAR needs 3 features from a GTF file: + 1. Chromosome names in col.1 that agree with chromosome names in genome .fasta files. If you have "chr2L" names in the genome .fasta files, and "2L" in the .gtf file, then you need to use --sjdbGTFchrPrefix chr option. + 2. 'exon' in col.3 for the exons of all transcripts (this name can be changed with --sjdbGTFfeatureExon) + 3. 'transcript_id' attribute that assigns each exon to a transcript (--this name can be changed with --sjdbGTFtagExonParentTranscript) + + Cheers + Alex + +**Notice:** If you leave name, description, or id blank, it will be generated automatically. +]]> + </help> + <expand macro="citations" /> +</tool> |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 data_manager/rnastar_index_builder.py --- a/data_manager/rnastar_index_builder.py Wed Nov 23 17:55:57 2016 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,121 +0,0 @@ -#!/usr/bin/env python -# Dan Blankenberg -# adapted from Dan's BWA one for rna star -# ross lazarus sept 2014 -# fixed some stupid bugs January 2015 -import json -import optparse -import os -import subprocess -import sys -import tempfile - -CHUNK_SIZE = 2**20 -ONE_GB = 2**30 - -DEFAULT_DATA_TABLE_NAME = "rnastar_index" - - -def get_id_name( params, dbkey, fasta_description=None): - # TODO: ensure sequence_id is unique and does not already appear in location file - sequence_id = params['param_dict']['sequence_id'] - if not sequence_id: - sequence_id = dbkey - - sequence_name = params['param_dict']['sequence_name'] - if not sequence_name: - sequence_name = fasta_description - if not sequence_name: - sequence_name = dbkey - return sequence_id, sequence_name - - -def _add_data_table_entry( data_manager_dict, data_table_name, data_table_entry ): - data_manager_dict['data_tables'] = data_manager_dict.get( 'data_tables', {} ) - data_manager_dict['data_tables'][ data_table_name ] = data_manager_dict['data_tables'].get( data_table_name, [] ) - data_manager_dict['data_tables'][ data_table_name ].append( data_table_entry ) - return data_manager_dict - - -def build_rnastar_index(data_manager_dict, fasta_filename, target_directory, - dbkey, sequence_id, sequence_name, data_table_name, - sjdbOverhang, sjdbGTFfile, sjdbFileChrStartEnd, - sjdbGTFtagExonParentTranscript, sjdbGTFfeatureExon, - sjdbGTFchrPrefix, n_threads): - # TODO: allow multiple FASTA input files - fasta_base_name = os.path.basename( fasta_filename ) - sym_linked_fasta_filename = os.path.join( target_directory, fasta_base_name ) - os.symlink( fasta_filename, sym_linked_fasta_filename ) - # print >> sys.stdout,'made',sym_linked_fasta_filename - cl = ['STAR', '--runMode', 'genomeGenerate', '--genomeFastaFiles', sym_linked_fasta_filename, '--genomeDir', target_directory, '--runThreadN', n_threads ] - - if sjdbGTFfile: - cl += [ '--sjdbGTFfeatureExon', sjdbGTFfeatureExon, '--sjdbGTFtagExonParentTranscript', sjdbGTFtagExonParentTranscript] - if (sjdbGTFchrPrefix > ''): - cl += ['--sjdbGTFchrPrefix', sjdbGTFchrPrefix] - cl += ['--sjdbOverhang', sjdbOverhang, '--sjdbGTFfile', sjdbGTFfile] - elif sjdbFileChrStartEnd: - cl += ['--sjdbFileChrStartEnd', sjdbFileChrStartEnd, '--sjdbOverhang', sjdbOverhang] - - tmp_stderr = tempfile.NamedTemporaryFile( prefix="tmp-data-manager-rnastar-index-builder-stderr" ) - proc = subprocess.Popen( args=cl, shell=False, cwd=target_directory, stderr=tmp_stderr.fileno() ) - return_code = proc.wait() - if return_code: - tmp_stderr.flush() - tmp_stderr.seek(0) - print >> sys.stderr, "Error building index:" - while True: - chunk = tmp_stderr.read( CHUNK_SIZE ) - if not chunk: - break - sys.stderr.write( chunk ) - sys.exit( return_code ) - tmp_stderr.close() - data_table_entry = dict( value=sequence_id, dbkey=dbkey, name=sequence_name, path=fasta_base_name ) - data_manager_dict = _add_data_table_entry( data_manager_dict, data_table_name, data_table_entry ) - return data_manager_dict - - -def main(): - # Parse Command Line - parser = optparse.OptionParser() - parser.add_option( '--fasta_filename', dest='fasta_filename', action='store', type="string", default=None, help='fasta_filename' ) - parser.add_option( '--fasta_dbkey', dest='fasta_dbkey', action='store', type="string", default=None, help='fasta_dbkey' ) - parser.add_option( '--fasta_description', dest='fasta_description', action='store', type="string", default=None, help='fasta_description' ) - parser.add_option( '--data_table_name', dest='data_table_name', action='store', type="string", default=None, help='data_table_name' ) - parser.add_option( '--sjdbGTFfile', type="string", default=None ) - parser.add_option( '--sjdbGTFchrPrefix', type="string", default=None ) - parser.add_option( '--sjdbGTFfeatureExon', type="string", default=None ) - parser.add_option( '--sjdbGTFtagExonParentTranscript', type="string", default=None ) - parser.add_option( '--sjdbFileChrStartEnd', type="string", default=None ) - parser.add_option( '--sjdbOverhang', type="string", default='100' ) - parser.add_option( '--runThreadN', type="string", default='4' ) - (options, args) = parser.parse_args() - filename = args[0] - params = json.loads( open( filename ).read() ) - target_directory = params[ 'output_data' ][0]['extra_files_path'].encode('ascii', 'replace') - dbkey = options.fasta_dbkey - if dbkey in [ None, '', '?' ]: - raise Exception( '"%s" is not a valid dbkey. You must specify a valid dbkey.' % ( dbkey ) ) - - sequence_id, sequence_name = get_id_name( params, dbkey=dbkey, fasta_description=options.fasta_description ) - - try: - os.mkdir( target_directory ) - except OSError: - pass - # build the index - data_manager_dict = build_rnastar_index( - data_manager_dict={}, fasta_filename=options.fasta_filename, - target_directory=target_directory, dbkey=dbkey, sequence_id=sequence_id, - sequence_name=sequence_name, data_table_name=options.data_table_name, - sjdbOverhang=options.sjdbOverhang, sjdbGTFfile=options.sjdbGTFfile, - sjdbFileChrStartEnd=options.sjdbFileChrStartEnd, - sjdbGTFtagExonParentTranscript=options.sjdbGTFtagExonParentTranscript, - sjdbGTFfeatureExon=options.sjdbGTFfeatureExon, - sjdbGTFchrPrefix=options.sjdbGTFchrPrefix, n_threads=options.runThreadN) - open( filename, 'wb' ).write( json.dumps( data_manager_dict ) ) - - -if __name__ == "__main__": - main() |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 data_manager/rnastar_index_builder.xml --- a/data_manager/rnastar_index_builder.xml Wed Nov 23 17:55:57 2016 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,127 +0,0 @@ -<tool id="rnastar_index_builder_data_manager" name="rnastar index" tool_type="manage_data" version="0.0.2"> - <description>builder</description> - <requirements> - <requirement type="package" version="2.4.0d">rnastar</requirement> - </requirements> - <command interpreter="python"> -rnastar_index_builder.py "${out_file}" --fasta_filename "${all_fasta_source.fields.path}" ---fasta_dbkey "${all_fasta_source.fields.dbkey}" --fasta_description "${all_fasta_source.fields.name}" ---runThreadN 1 -#if $genemodel.modelformat=="gff3": -#import pipes - --sjdbGTFchrPrefix ${ pipes.quote( str( $genemodel.sjdbGTFchrPrefix ) ) or "''" } - --sjdbOverhang "${genemodel.sjdbOverhang}" - --sjdbGTFfile "${genemodel.sjdbGTFfile}" - --sjdbGTFtagExonParentTranscript ${ pipes.quote( str( $genemodel.sjdbGTFtagExonParentTranscript ) ) or "''" } - --sjdbGTFfeatureExon ${ pipes.quote( str( $genemodel.sjdbGTFfeatureExon ) ) or "''" } -#end if -#if $genemodel.modelformat=="bed": - --sjdbFileChrStartEnd "${genemodel.sjdbFileChrStartEnd}" - --sjdbOverhang "${genemodel.sjdbOverhang}" -#end if -#if $genemodel.modelformat=="None": - --sjdbOverhang 0 -#end if ---data_table_name "rnastar_index" - </command> - <inputs> - <param name="all_fasta_source" type="select" label="Source FASTA Sequence"> - <options from_data_table="all_fasta"/> - </param> - <param type="text" name="sequence_name" value="" label="Informative name for sequence index" /> - <param type="text" name="sequence_id" value="" label="ID for sequence index" /> - <conditional name="genemodel"> - <param name="modelformat" type="select" - label="Choose the format of gene model data from your history - bed or gff3" - help="This will be the source of splice junction indexing if required"> - <option value="gff3" selected="true">gff3,gtf</option> - <option value="bed">BED - tabular chr,start,end,strand</option> - <option value="None">None - no splice junction index</option> - </param> - <when value="gff3"> - <param type="data" format="gff3,gff" name="sjdbGTFfile" value="" label="Gene model - must be gff3 or compatible and must match the input genome" - help="Required if you want to index splice junctions during index generation." /> - - <param type="text" name="sjdbGTFchrPrefix" value="chr" label="String prefix for GTF chromosomes" - help='GTF prefix for chromosome names (e.g. "chr" to use ENSMEBL annotations with UCSC geneomes)' > - <sanitizer invalid_char=""> - <valid initial="string.printable"/> - </sanitizer> - </param> - <param type="text" name="sjdbGTFfeatureExon" value="exon" label="GTF feature to use as exon marker" - help="GTF feature type in GTF file to be used as exons for building transcripts - use what's in your GTF"> - <sanitizer invalid_char=""> - <valid initial="string.printable"/> - </sanitizer> - </param> - - <param type="text" name="sjdbGTFtagExonParentTranscript" value="transcript_id" label="GTF feature to define for each exon's parents" - help="GTF tag name to be used as exons' parents for building transcripts - use what's in your gene model file eg parent for gff3"> - <sanitizer invalid_char=""> - <valid initial="string.printable"/> - </sanitizer> - </param> - - <param type="integer" name="sjdbOverhang" value="100" label="Splice junction overhang. If=0, splice junction database NOT used" - help="integer length of the donor/acceptor sequence on each side, (mate_length - 1)" /> - - </when> - <when value='bed'> - <param type="data" format="bed" name="sjdbFileChrStartEnd" value="" label="Introns as a tabular bed (chr,start,end,strand) file matching the input genome" - help="Required if you want to index splice junctions during index generation." /> - <param type="integer" name="sjdbOverhang" value="100" label="Splice junction overhang. If=0, splice junction database NOT used" - help="integer length of the donor/acceptor sequence on each side, (mate_length - 1)" /> - </when> - <when value='None'> - </when> - </conditional> - </inputs> - <outputs> - <data name="out_file" format="data_manager_json"/> - </outputs> - <help> - -.. class:: infomark - -<![CDATA[ -*What it does* - -This is a Galaxy datamanager for the rna STAR gap-aware RNA aligner. - -Please read the fine manual - that and the google group are the places to learn about the options above. - -*Note on sjdbOverhang* - -From https://groups.google.com/forum/#!topic/rna-star/h9oh10UlvhI:: - - James is right, using large enough --sjdbOverhang is safer and should not generally cause any problems with reads of varying length. If your reads are very short, <50b, then I would strongly recommend using optimum --sjdbOverhang=mateLength-1 - By mate length I mean the length of one of the ends of the read, i.e. it's 100 for 2x100b PE or 1x100b SE. For longer reads you can simply use generic --sjdbOverhang 100. - It is a bit confusing because of the way I named this parameter. --sjdbOverhang Noverhang is only used at the genome generation step for constructing the reference sequence out of the annotations. - Basically, the Noverhang exonic bases from the donor site and Noverhang exonic bases from the acceptor site are spliced together for each of the junctions, and these spliced sequences are added to the genome sequence. - - At the mapping stage, the reads are aligned to both genomic and splice sequences simultaneously. If a read maps to one of spliced sequences and crosses the "junction" in the middle of it, the coordinates of two pspliced pieces are translated back to genomic space and added to the collection of mapped pieces, which are then all "stitched" together to form the final alignment. Since in the process of "maximal mapped length" search the read is split into pieces of no longer than --seedSearchStartLmax (=50 by default) bases, even if the read (mate) is longer than --sjdbOverhang, it can still be mapped to the spliced reference, as long as --sjdbOverhang > --seedSearchStartLmax. - - Cheers - Alex - -*Note on gene model requirements for splice junctions* - -From https://groups.google.com/forum/#!msg/rna-star/3Y_aaTuzBrE/lUylTB8h5vMJ:: - - When you generate a genome with annotations, you need to specify --sjdbOverhang value, which ideally should be equal to (oneMateLength-1), or you could use a generic value of ~100. - - Your gtf lines look fine to me. STAR needs 3 features from a GTF file: - 1. Chromosome names in col.1 that agree with chromosome names in genome .fasta files. If you have "chr2L" names in the genome .fasta files, and "2L" in the .gtf file, then you need to use --sjdbGTFchrPrefix chr option. - 2. 'exon' in col.3 for the exons of all transcripts (this name can be changed with --sjdbGTFfeatureExon) - 3. 'transcript_id' attribute that assigns each exon to a transcript (--this name can be changed with --sjdbGTFtagExonParentTranscript) - - Cheers - Alex - -**Notice:** If you leave name, description, or id blank, it will be generated automatically. -]]> - </help> - <citations> - <citation type="doi">doi: 10.1093/bioinformatics/bts635</citation> - </citations> -</tool> |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 data_manager_conf.xml --- a/data_manager_conf.xml Wed Nov 23 17:55:57 2016 -0500 +++ b/data_manager_conf.xml Fri Apr 21 12:36:31 2017 -0400 |
b |
@@ -1,22 +1,24 @@ <?xml version="1.0"?> <data_managers> - - <data_manager tool_file="data_manager/rnastar_index_builder.xml" id="rnastar_index_builder" version="0.0.1"> - <data_table name="rnastar_index"> + <data_manager tool_file="data_manager/rna_star_index_builder.xml" id="rna_star_index_builder" version="0.0.3"> + <data_table name="rnastar_index2"> <output> <column name="value" /> <column name="dbkey" /> <column name="name" /> <column name="path" output_ref="out_file" > <move type="directory" relativize_symlinks="True"> - <!-- <source>${path}</source>--> <!-- out_file.extra_files_path is used as base by default --> <!-- if no source, eg for type=directory, then refers to base --> - <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${dbkey}/rnastar_index/${value}</target> + <!-- <source>${path}</source> + out_file.extra_files_path is used as base by default + if no source, eg for type=directory, then refers to base + --> + <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${dbkey}/rnastar_index2/${value}</target> </move> - <value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/rnastar_index/${value}/${path}</value_translation> + <value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/rnastar_index2/${value}/${path}</value_translation> <value_translation type="function">abspath</value_translation> </column> + <column name="withGTF" /> </output> </data_table> </data_manager> - </data_managers> |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 tool-data/all_fasta.loc.sample --- a/tool-data/all_fasta.loc.sample Wed Nov 23 17:55:57 2016 -0500 +++ b/tool-data/all_fasta.loc.sample Fri Apr 21 12:36:31 2017 -0400 |
b |
@@ -4,13 +4,13 @@ #all_fasta.loc. This file has the format (white space characters are #TAB characters): # -#<unique_build_id> <dbkey> <display_name> <file_path> +#<unique_build_id> <dbkey> <display_name> <file_path> # #So, all_fasta.loc could look something like this: # -#apiMel3 apiMel3 Honeybee (Apis mellifera): apiMel3 /path/to/genome/apiMel3/apiMel3.fa -#hg19canon hg19 Human (Homo sapiens): hg19 Canonical /path/to/genome/hg19/hg19canon.fa -#hg19full hg19 Human (Homo sapiens): hg19 Full /path/to/genome/hg19/hg19full.fa +#apiMel3 apiMel3 Honeybee (Apis mellifera): apiMel3 /path/to/genome/apiMel3/apiMel3.fa +#hg19canon hg19 Human (Homo sapiens): hg19 Canonical /path/to/genome/hg19/hg19canon.fa +#hg19full hg19 Human (Homo sapiens): hg19 Full /path/to/genome/hg19/hg19full.fa # #Your all_fasta.loc file should contain an entry for each individual #fasta file. So there will be multiple fasta files for each build, |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 tool-data/rnastar_index.loc.sample --- a/tool-data/rnastar_index.loc.sample Wed Nov 23 17:55:57 2016 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,38 +0,0 @@ -#This is a sample file distributed with Galaxy that enables tools -#to use a directory of BWA indexed sequences data files. You will need -#to create these data files and then create a bwa_index.loc file -#similar to this one (store it in this directory) that points to -#the directories in which those files are stored. The bwa_index.loc -#file has this format (longer white space characters are TAB characters): -# -#<unique_build_id> <dbkey> <display_name> <file_path> -# -#So, for example, if you had phiX indexed stored in -#/depot/data2/galaxy/phiX/base/, -#then the bwa_index.loc entry would look like this: -# -#phiX174 phiX phiX Pretty /depot/data2/galaxy/phiX/base/phiX.fa -# -#and your /depot/data2/galaxy/phiX/base/ directory -#would contain phiX.fa.* files: -# -#-rw-r--r-- 1 james universe 830134 2005-09-13 10:12 phiX.fa.amb -#-rw-r--r-- 1 james universe 527388 2005-09-13 10:12 phiX.fa.ann -#-rw-r--r-- 1 james universe 269808 2005-09-13 10:12 phiX.fa.bwt -#...etc... -# -#Your bwa_index.loc file should include an entry per line for each -#index set you have stored. The "file" in the path does not actually -#exist, but it is the prefix for the actual index files. For example: -# -#phiX174 phiX phiX174 /depot/data2/galaxy/phiX/base/phiX.fa -#hg18canon hg18 hg18 Canonical /depot/data2/galaxy/hg18/base/hg18canon.fa -#hg18full hg18 hg18 Full /depot/data2/galaxy/hg18/base/hg18full.fa -#/orig/path/hg19.fa hg19 hg19 /depot/data2/galaxy/hg19/base/hg19.fa -#...etc... -# -#Note that for backwards compatibility with workflows, the unique ID of -#an entry must be the path that was in the original loc file, because that -#is the value stored in the workflow for that parameter. That is why the -#hg19 entry above looks odd. New genomes can be better-looking. -# |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 tool-data/rnastar_index2.loc.sample --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/rnastar_index2.loc.sample Fri Apr 21 12:36:31 2017 -0400 |
b |
@@ -0,0 +1,23 @@ +#This is a sample file distributed with Galaxy that enables tools +#to use a directory of rna-star indexed sequences data files. You will +#need to create these data files and then create a rnastar_index2.loc +#file similar to this one (store it in this directory) that points to +#the directories in which those files are stored. The rnastar_index2.loc +#file has this format (longer white space characters are TAB characters): +# +#<unique_build_id> <dbkey> <display_name> <file_base_path> <withGTF> +# +#The <with_gtf> column should be 1 or 0, indicating whether the index was made +#with an annotation (i.e., --sjdbGTFfile and --sjdbOverhang were used) or not, +#respecively. +# +#Note that STAR indices can become quite large. Consequently, it is only +#advisable to create indices with annotations if it's known ahead of time that +#(A) the annotations won't be frequently updated and (B) the read lengths used +#will also rarely vary. If either of these is not the case, it's advisable to +#create indices without annotations and then specify an annotation file and +#maximum read length (minus 1) when running STAR. +# +#hg19 hg19 hg19 full /mnt/galaxyIndices/genomes/hg19/rnastar 0 +#hg19Ensembl hg19Ensembl hg19 full with Ensembl annotation /mnt/galaxyIndices/genomes/hg19Ensembl/rnastar 1 + |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 tool_data_table_conf.xml.sample --- a/tool_data_table_conf.xml.sample Wed Nov 23 17:55:57 2016 -0500 +++ b/tool_data_table_conf.xml.sample Fri Apr 21 12:36:31 2017 -0400 |
b |
@@ -1,12 +1,12 @@ <tables> <!-- Locations of all fasta files under genome directory --> - <table name="all_fasta" comment_char="#"> + <table name="all_fasta" comment_char="#" allow_duplicate_entries="False"> <columns>value, dbkey, name, path</columns> <file path="tool-data/all_fasta.loc" /> </table> <!-- Locations of indexes in the BWA mapper format --> - <table name="rnastar_index" comment_char="#"> - <columns>value, dbkey, name, path</columns> - <file path="tool-data/rnastar_index.loc" /> + <table name="rnastar_index2" comment_char="#" allow_duplicate_entries="False"> + <columns>value, dbkey, name, path, withGTF</columns> + <file path="tool-data/rnastar_index2.loc" /> </table> </tables> |
b |
diff -r e4b87a00b1df -r cdc4d8a998e1 tool_dependencies.xml --- a/tool_dependencies.xml Wed Nov 23 17:55:57 2016 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,6 +0,0 @@ -<?xml version="1.0"?> -<tool_dependency> - <package name="rnastar" version="2.4.0d"> - <repository changeset_revision="54c96a529c59" name="package_rnastar_2_4_0d" owner="iuc" toolshed="https://toolshed.g2.bx.psu.edu" /> - </package> -</tool_dependency> |