# HG changeset patch # User damion # Date 1439150870 14400 # Node ID 5c5027485f7d113b2d4fad28a800cf6268cdf9a8 # Parent d31a1bd74e6349ad52de432e9a1708e2f34100c8 Uploaded correct file diff -r d31a1bd74e63 -r 5c5027485f7d LICENSE.md --- a/LICENSE.md Sun Aug 09 16:05:40 2015 -0400 +++ b/LICENSE.md Sun Aug 09 16:07:50 2015 -0400 @@ -1,7 +1,7 @@ Source Code License An Open Source Initiative (OSI) approved license -ffp_phylogeny source code is licensed under the Academic Free License version 3.0. +versioned_data source code is licensed under the Academic Free License version 3.0. Licensed under the Academic Free License version 3.0 diff -r d31a1bd74e63 -r 5c5027485f7d README.md --- a/README.md Sun Aug 09 16:05:40 2015 -0400 +++ b/README.md Sun Aug 09 16:07:50 2015 -0400 @@ -1,47 +1,52 @@ -Feature Frequency Profile Phylogenies -===================================== +# Versioned Data System +The Galaxy and command line Versioned Data System manages the retrieval of current and past versions of selected reference sequence databases from local data stores. +--- -Introduction ------------- - -FFP (Feature frequency profile) is an alignment free comparison tool for phylogenetic analysis and text comparison. It can be applied to nucleotide sequences, complete genomes, proteomes and even used for text comparison. This software is a Galaxy (http://galaxyproject.org) tool for calculating FFP on one or more fasta sequence or text datasets. - -The original command line ffp-phylogeny code is at http://ffp-phylogeny.sourceforge.net/ . This tool uses Aaron Petkau's modified version: https://github.com/apetkau/ffp-3.19-custom . Aaron has quite a good writeup of the technique as well at https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny . +0. Overview +1. [Setup for Admins](doc/setup.md) + 1. [Galaxy tool installation](doc/galaxy_tool_install.md) + 2. [Server data stores](doc/data_stores.md) + 3. [Data store examples](doc/data_store_examples.md) + 4. [Galaxy "Versioned Data" library setup](doc/galaxy_library.md) + 5. [Workflow configuration](doc/workflows.md) + 6. [Permissions, security, and maintenance](doc/maintenance.md) + 7. [Problem solving](doc/problem_solving.md) +2. [Using the Galaxy Versioned data tool](doc/galaxy_tool.md) +3. [System Design](doc/design.md) +4. [Background Research](doc/background.md) +5. [Server data store and galaxy library organization](doc/data_store_org.md) +6. [Data Provenance and Reproducibility](doc/data_provenance.md) +7. [Caching System](doc/caching.md) -**Installation Note** : Your Galaxy server will need the groff package to be installed on it first (to generate ffp-phylogeny man pages). A cryptic error will occur if it isn't: "troff: fatal error: can't find macro file s". This is different from the "groff-base" package. +--- + +## Overview + +This tool can be used on a server both via the command line and via the Galaxy bioinformatics workflow platform using the "Versioned Data" tool. Different kinds of content are suited to different archiving technologies, so the system provides a few storage system choices. -This Galaxy tool prepares a mini-pipeline consisting of **[ffpry | ffpaa | ffptxt] > [ ffpfilt | ffpcol > ffprwn] > ffpjsd > ffptree** . The last step is optional - by deselecting the "Generate Tree Phylogeny" checkbox, the tool will output a distance matrix rather than a Newick (.nhx) formatted tree file. +* Fasta sequences - accession ids, descriptions and their sequences - are suited to storage as 1 line key-value pair records in a key-value store. Here we introduce a low-tech file-based database plugin for this kind of data called **Kipper**. It is suited entirely to the goal of producing complete versioned files. This covers much of the sequencing archiving problem for reference databases. Consult https://github.com/Public-Health-Bioinformatics/kipper for up-to-date information on Kipper. -Each sequence or text file has a profile containing tallies of each feature found. A feature is a string of valid characters of given length. +* A **git** archiving system plugin is also provided for software file tree archiving, with a particular file differential (diff) compression benefit for documents that have sentence-like lines added and deleted between versions. + +* Super-large files that are not suited to Kipper or git can be handled by a simple "**folder**" data store holds each version of file(s) in a separate compressed archive. -For nucleotide data, by default each character (ATGC) is grouped as either purine(R) or pyrmidine(Y) before being counted. +* **Biomaj** (our reference database maintenance software) can be configured to download and store separate version files. A Biomaj plugin allows direct selection of versioned files within its "data bank" folders. + +The Galaxy Versioned Data tool below, shows the interface for retrieving versions of reference database. The tool lets you select the fasta database to retrieve, and then one or more workflows. The system then generates and caches the versioned data in the data library; then links it into one's history; then runs the workflow(s) to get the derivative data (a Blast database say) and then caches that back into the data library. Future requests for that versioned data and derivatives (keyed by workflow id and input data version ids) will return the data already from cache rather than regenerating it, until the cache is deleted. + +![galaxy versioned data tool form](https://github.com/Public-Health-Bioinformatics/versioned_data/blob/master/doc/galaxy_tool_form.png) -For amino acid data, by default each character is grouped into one of the following: (ST),(DE),(KQR),(IVLM),(FWY),C,G,A,N,H,P. Each group is represented by the first character in its series. +## Project goals -One other key concept is that a given feature, e.g. "TAA" is counted in forward AND reverse directions, mirroring the idea that a feature's orientation is not so important to distinguish when it comes to alignment-free comparison. The counts for "TAA" and "AAT" are merged. +* **To enable reproducible molecular biology research:** To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at reference sequence databases corresponding to a particular past date or version identifier. This recall can also explain the difference between what was known in the past vs. currently. + +* **To reduce hard drive space.** Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + .... (Compressing each file individually is an option but even better we could store just the differences between subsequent versions.) -The labeling of the resulting counted feature items is perhaps the trickiest concept to master. Due to computational efficiency measures taken by the developers, a feature that we see on paper as "TAC" may be stored and labeled internally as "GTA", its reverse compliment. One must look for the alternative if one does not find the original. - -Also note that in amino acid sequences the stop codon "*" (or any other character that is not in the Amino acid alphabet) causes that character frame not to be counted. Also, character frames never span across fasta entries. +* **Maximize speed of archive recall.** Understanding that the archived version files can be large, we'd ideally like a versioned file to be retrieved in the time it takes to write a file of that size to disk. Caching this data and its derivatives (makeblastdb databases for example) is important. -A few tutorials: - * http://sourceforge.net/projects/ffp-phylogeny/files/Documentation/tutorial.pdf - * https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny - -------- -**Note** +* **Improve sequence archive management.** Provide an admin interface for managing regular scheduled import and log of reference sequence databases from our own and 3rd party sources like NCBI and cpndb.ca . -Taxonomy label details: If each file contains one profile, the file's name is used to label the profile. If each file contains fasta sequences to profile individually, their fasta identifiers will be used to label them. The "short labels" option will find the shortest label that uniquely identifies each profile. Either way, there are some quirks: ffpjsd clips labels to 10 characters if they are greater than 50 characters, so all labels are trimmed to 50 characters first. Also "id" is prefixed to any numeric label since some tree visualizers won't show purely numeric labels. In the accidental case where a Fasta sequence label is a duplicate of a previous one it will be prefixed by "DupLabel-". - -The command line ffpjsd can hang if one provides an l-mer length greater than the length of file content. One must identify its process id ("ps aux | grep ffpjsd") and kill it ("kill [process id]"). - -Finally, it is possible for the ffptree program to generate a tree where some of the branch distances are negative. See https://www.biostars.org/p/45597/ +* Integrate database versioning into the Galaxy workflow management software without adding a lot of complexity. -------- -**References** - -The development of the ffp-phylogeny command line software should be attributed to: - -Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 2009;106(8):2677-2682. doi:10.1073/pnas.0813249106. - +* A bonus would be to enable the efficient sharing of versioned data between computers/servers. diff -r d31a1bd74e63 -r 5c5027485f7d __init__.py diff -r d31a1bd74e63 -r 5c5027485f7d bccdc_macros.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bccdc_macros.xml Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,27 @@ + + + + + + + + + + + + + + + + @BINARY@ + + + + + + + + REFERENCES. + + + diff -r d31a1bd74e63 -r 5c5027485f7d data_store_utils.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_store_utils.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,26 @@ + +def version_cache_setup(dataset_id, data_file_cache_folder, cacheable_dataset): + """ UNUSED: Idea was to enable caching of workflow products outside of galaxy for use by others. + CONSIDER METACODE. NOT INTEGRATED, NOT TESTED. + """ + data_file_cache_name = os.path.join(data_file_cache_folder, dataset_id ) #'blastdb.txt' + if os.path.isfile(data_file_cache_name): + pass + else: + if os.path.isdir(data_file_cache_folder): + shutil.rmtree(data_file_cache_folder) + os.makedirs(data_file_cache_folder) + # Default filename=false means we're supplying the filename. + gi.datasets.download_dataset(dataset_id, file_path=data_file_cache_name, use_default_filename=False, wait_for_completion=True) # , maxwait=12000) is a default of 3 hours + + # Generically, any dataset might have subfolders - to check we have to + # see if galaxy dataset file path has contents at _files suffix. + # Find dataset_id in version retrieval history datasets, and get its folder path, and copy _files over... + galaxy_dataset_folder = cacheable_dataset['file_name'][0:-4] + '_files' + time.sleep(2) + if os.path.isdir(galaxy_dataset_folder) \ + and not os.path.isdir(data_file_cache_folder + '/files/'): + print 'Copying ' + galaxy_dataset_folder + ' to ' + data_file_cache_folder + # Copy program makes target folder. + shutil.copytree(galaxy_dataset_folder, data_file_cache_folder + '/files/') # , symlinks=False, ignore=None + diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/__init__.py diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/fasta_format.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/fasta_format.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,195 @@ +#!/usr/bin/python +# Simple comparison and conversion tools for big fasta data + +import sys, os, optparse + +VERSION = "1.0.0" + +class MyParser(optparse.OptionParser): + """ + Provides a better class for displaying formatted help info. + From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output. + """ + def format_epilog(self, formatter): + return self.epilog + + +def split_len(seq, length): + return [seq[i:i+length] for i in range(0, len(seq), length)] + + +def check_file_path(file_path, message = "File "): + + path = os.path.normpath(file_path) + # make sure any relative paths are converted to absolute ones + if not os.path.isdir(os.path.dirname(path)) or not os.path.isfile(path): + # Not an absolute path, so try default folder where script was called: + path = os.path.normpath(os.path.join(os.getcwd(), path) ) + if not os.path.isfile(path): + print message + "[" + file_path + "] doesn't exist!" + sys.exit(1) + + return path + + +class FastaFormat(object): + + def __main__(self): + + options, args = self.get_command_line() + + if options.code_version: + print VERSION + return VERSION + + if len(args) > 0: + file_a = check_file_path(args[0]) + else: file_a = False + + if len(args) > 1: + file_b = check_file_path(args[1]) + else: file_b = False + + if options.to_fasta == True: + # Transform from key-value file to regular fasta format: 1 line for identifier and description(s), remaining 80 character lines for fasta data. + + with sys.stdout as outputFile : + for line in open(file_a,'r') if file_a else sys.stdin: + line_data = line.rsplit('\t',1) + if len(line_data) > 1: + outputFile.write(line_data[0] + '\n' + '\n'.join(split_len(line_data[1],80)) ) + else: + # Fasta one-liner didn't have any sequence data + outputFile.write(line_data[0]) + + #outputFile.close() #Otherwise terminal never looks like it closes? + + + elif options.to_keyvalue == True: + # Transform from fasta format to key-value format: + # Separates sequence lines are merged and separated from id/description line by a tab. + with sys.stdout as outputFile: + start = True + for line in open(file_a,'r') if file_a else sys.stdin: + if line[0] == ">": + if start == False: + outputFile.write('\n') + else: + start = False + outputFile.write(line.strip() + '\t') + + else: + outputFile.write(line.strip()) + + + elif options.compare == True: + + if len(args) < 2: + print "Error: Need two fasta file paths to compare" + sys.exit(1) + + file_a = open(file_a,'r') + file_b = open(file_b,'r') + + p = 3 + count_a = 0 + count_b = 0 + sample_length = 50 + + while True: + + if p&1: + a = file_a.readline() + count_a += 1 + + if p&2: + b = file_b.readline() + count_b += 1 + + if not a or not b: # blank line still has "cr\lf" in it so doesn't match here + print "EOF" + break + + a = a.strip() + b = b.strip() + + if a == b: + p = 3 + + elif a < b: + sys.stdout.write('f1 ln %s: -%s\nvs. %s \n' % (count_a, a[0:sample_length] , b[0:sample_length])) + p = 1 + + else: + sys.stdout.write('f2 ln %s: +%s\nvs. %s \n' % (count_b, b[0:sample_length] , a[0:sample_length])) + p = 2 + + + if count_a % 1000000 == 0: + print "At line %s:" % count_a + + # For testing: + #if count_a > 50: + # + # print "Quick exit at line 500" + # sys.exit(1) + + + for line in file_a.readlines(): + count_a += 1 + sys.stdout.write('f1 ln %s: -%s\nvs. %s \n' % (count_a, line[0:sample_length] )) + + for line in file_b.readlines(): + count_b += 1 + sys.stdout.write('f2 ln %s: +%s\nvs. %s \n' % (count_b, line[0:sample_length] )) + + + + + def get_command_line(self): + """ + *************************** Parse Command Line ***************************** + + """ + parser = MyParser( + description = 'Tool for comparing two fasta files, or transforming fasta data to single-line key-value format, or visa versa.', + usage = 'fasta_format.py [options]', + epilog=""" + Note: This tool uses stdin and stdout for transforming fasta data. + + Convert from key-value data to fasta format data: + fasta_format.py [file] -f --fasta + cat [file] | fasta_format.py -f --fasta + + Convert from fasta format data to key-value data: + fasta_format.py [file] -k --keyvalue + cat [file] | fasta_format.py -k --keyvalue + + Compare two fasta format files: + fasta_format.py [file1] [file2] -c --compare + + Return version of this code: + fasta_format.py -v --version + +""") + + parser.add_option('-c', '--compare', dest='compare', default=False, action='store_true', + help='Compare two fasta files') + + parser.add_option('-f', '--fasta', dest='to_fasta', default=False, action='store_true', + help='Transform key-value file to fasta file format') + + parser.add_option('-k', '--keyvalue', dest='to_keyvalue', default=False, action='store_true', + help='Transform fasta file format to key-value format') + + parser.add_option('-v', '--version', dest='code_version', default=False, action='store_true', + help='Return version of this code.') + + return parser.parse_args() + + +if __name__ == '__main__': + + fasta_format = FastaFormat() + fasta_format.__main__() + diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/kipper.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/kipper.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,1086 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- + +import subprocess +import datetime +import dateutil.parser as parser2 +import calendar +import optparse +import re +import os +import sys +from shutil import copy +import tempfile +import json +import glob +import gzip + + + + +CODE_VERSION = '1.0.0' +REGEX_NATURAL_SORT = re.compile('([0-9]+)') +KEYDB_LIST = 1 +KEYDB_EXTRACT = 2 +KEYDB_REVERT = 3 +KEYDB_IMPORT = 4 + +class MyParser(optparse.OptionParser): + """ + Provides a better class for displaying formatted help info. + From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output. + """ + def format_epilog(self, formatter): + return self.epilog + +def stop_err( msg ): + sys.stderr.write("%s\n" % msg) + sys.exit(1) + +class Kipper(object): + + + def __init__(self): + # Provide defaults + self.db_master_file_name = None + self.db_master_file_path = None + self.metadata_file_path = None + self.db_import_file_path = None + self.output_file = None # By default, printed to stdout + self.volume_id = None + self.version_id = None # Note, this is natural #, starts from 1; + self.metadata = None + self.options = None + self.compression = '' + + _nowabout = datetime.datetime.utcnow() + self.dateTime = long(_nowabout.strftime("%s")) + + self.delim = "\t" + self.nl = "\n" + + + def __main__(self): + """ + Handles all command line options for creating kipper archives, and extracting or reverting to a version. + """ + options, args = self.get_command_line() + self.options = options + + if options.code_version: + print CODE_VERSION + return CODE_VERSION + + # *********************** Get Master kipper file *********************** + if not len(args): + stop_err('A Kipper database file name needs to be included as first parameter!') + + self.db_master_file_name = args[0] #accepts relative path with file name + + self.db_master_file_path = self.check_folder(self.db_master_file_name, "Kipper database file") + # db_master_file_path is used from now on; db_master_file_name is used just for metadata labeling. + # Adjust it to remove any relative path component. + self.db_master_file_name = os.path.basename(self.db_master_file_name) + + if os.path.isdir(self.db_master_file_path): + stop_err('Error: Kipper data file "%s" is actually a folder!' % (self.db_master_file_path) ) + + self.metadata_file_path = self.db_master_file_path + '.md' + + # Returns path but makes sure its folder is real. Must come before get_metadata() + self.output_file = self.check_folder(options.db_output_file_path) + + + # ************************* Get Metadata ****************************** + if options.initialize: + if options.compression: + self.compression = options.compression + + self.set_metadata(type=options.initialize, compression=self.compression) + + self.get_metadata(options); + + self.check_date_input(options) + + if options.version_id or (options.extract and options.version_index): + if options.version_index: + vol_ver = self.version_lookup(options.version_index) + + else: + # Note version_id info overrides any date input above. + vol_ver = self.get_version(options.version_id) + + if not vol_ver: + stop_err("Error: Given version number or name does not exist in this database") + + (volume, version) = vol_ver + self.volume_id = volume['id'] + self.version_id = version['id'] + self.dateTime = float(version['created']) + else: + # Use latest version by default + if not self.version_id and len(self.metadata['volumes'][-1]['versions']) > 0: + self.volume_id = self.metadata['volumes'][-1]['id'] + self.version_id = self.metadata['volumes'][-1]['versions'][-1]['id'] + + # ************************** Action triggers ************************** + + if options.volume == True: + # Add a new volume to the metadata + self.metadata_create_volume() + self.write_metadata(self.metadata) + + if options.db_import_file_path != None: + # Any time an import file is specified, this is the only action: + self.try_import_file(options) + return + + if options.metadata == True: + # Writes metadata to disk or stdout + self.write_metadata2(self.metadata) + return + + if options.extract == True: + # Defaults to pulling latest version + if not (self.version_id): + stop_err('Error: Please supply a version id (-n [number]) or date (-d [date]) to extract.') + + if self.output_file and os.path.isdir(self.output_file): + # A general output file name for the data store as a whole + output_name = self.metadata['file_name'] + if output_name == '': + # Get output file name from version's original import file_name + output_name = self.metadata['volumes'][self.volume_id-1]['versions'][self.version_id-1]['file_name'] + # But remove the .gz suffix if it is there (refactor later). + if output_name[-3:] == '.gz': + output_name = output_name[0:-3] + self.output_file = os.path.join(self.output_file, output_name) + + self.db_scan_action(KEYDB_EXTRACT) + return + + if options.revert == True: + if not (options.version_id or options.dateTime or options.unixTime): + stop_err('Error: Please supply a version id (-n [number]) or date (-d [date]) to revert to.') + + # Send database back to given revision + if self.output_file and self.output_file == os.path.dirname(self.db_master_file_path): + self.output_file = self.get_db_path() + self.db_scan_action(KEYDB_REVERT) + return + + # Default to list datastore versions + self.get_list() + + + def get_db_path(self, volume_id = None): + #Note: metadata must be established before this method is called. + if volume_id is None: volume_id = self.volume_id + return self.db_master_file_path + '_' + str(volume_id) + self.metadata['compression'] + + + def get_temp_output_file(self, action = None, path=None): + # Returns write handle (+name) of temp file. Returns gzip interface if compression is on. + if path == None: + path = self.output_file + + temp = tempfile.NamedTemporaryFile(mode='w+t',delete=False, dir=os.path.dirname(path) ) + + # If compression is called for, then we have to switch to gzip handler on the temp name: + if action in [KEYDB_REVERT, KEYDB_IMPORT] and self.metadata['compression'] == '.gz': + temp.close() + temp = myGzipFile(temp.name, 'wb') + + return temp + + + def get_list(self): + volumes = self.metadata['volumes'] + for ptr in range(0, len(volumes)): + volume = volumes[ptr] + if ptr < len(volumes)-1: + ceiling = str(volumes[ptr+1]['floor_id'] - 1) + else: + ceiling = '' + print "Volume " + str(ptr+1) + ", Versions " + str(volume['floor_id']) + "-" + ceiling + + for version in volume['versions']: + print str(version['id']) + ": " + self.dateISOFormat(float(version['created'])) + '_v' + version['name'] + + + def set_metadata(self, type='text', compression=''): + """ + Request to initialize metadata file + Output metadata to stdio or to -o output file by way of temp file. + If one doesn't include -o, then output goes to stdio; + If one includes only -o, then output overwrites .md file. + If one includes -o [filename] output overwrites [filename] + + Algorithm processes each line as it comes in database. This means there + is no significance to the version_ids ordering; earlier items in list can + in fact be later versions of db. So must resort and re-assign ids in end. + @param type string text or fasta etc. + """ + if os.path.isfile(self.metadata_file_path): + stop_err('Error: Metadata file "%s" exists. You must remove it before generating a new one.' % (self.metadata_file_path) ) + + self.metadata_create(type, compression) + + volumes = glob.glob(self.db_master_file_path + '_[0-9]*') + + volumes.sort(key=lambda x: natural_sort_key(x)) + for volume in volumes: + # Note: scanned volumes must be consecutive from 1. No error detection yet. + self.metadata_create_volume(False) + versions = self.metadata['volumes'][-1]['versions'] + import_modified = os.path.getmtime(volume) + dbReader = bigFileReader(volume) + version_ids = [] + db_key_value = dbReader.read() + old_key = '' + while db_key_value: + + (created_vid, deleted_vid, db_key, restofline) = db_key_value.split(self.delim, 3) + version = versions[self.version_dict_lookup(version_ids, long(created_vid), import_modified)] + version['rows'] +=1 + if old_key != db_key: + version['keys'] +=1 + old_key = db_key + + version['inserts'] += 1 + if deleted_vid: + version = versions[self.version_dict_lookup(version_ids, long(deleted_vid), import_modified)] + version['deletes'] += 1 + + db_key_value = dbReader.read() + + # Reorder, and reassign numeric version ids: + versions.sort(key=lambda x: x['id']) + for ptr, version in enumerate(versions): + version['id'] = ptr+1 + + # If first master db volume doesn't exist, then this is an initialization situation + if len(volumes) == 0: + self.metadata_create_volume() + self.create_volume_file() + + with open(self.metadata_file_path,'w') as metadata_handle: + metadata_handle.write(json.dumps(self.metadata, sort_keys=True, indent=4, separators=(',', ': '))) + + return True + + + def get_metadata(self, options): + """ + Read in json metadata from file, and set file processor [fasta|text] engine accordingly. + """ + + if not os.path.isfile(self.metadata_file_path): + #stop_err('Error: Metadata file "%s" does not exist. You must regenerate it with the -m option before performing other actions.' % (self.metadata_file_path) ) + stop_err('Error: Unable to locate the "%s" metadata file. It should accompany the "%s" file. Use the -M parameter to initialize or regenerate the basic file.' % (self.metadata_file_path, self.db_master_file_name) ) + + with open(self.metadata_file_path,'r') as metadata_handle: + self.metadata = json.load(metadata_handle) + + # ******************* Select Kipper Pre/Post Processor ********************** + # FUTURE: More processor options here - including custom ones referenced in metadata + if self.metadata['type'] == 'fasta': + self.processor = VDBFastaProcessor() # for fasta sequence databases + else: + self.processor = VDBProcessor() # default text + + # Handle any JSON metadata defaults here for items that aren't present in previous databases. + if not 'compression' in self.metadata: + self.metadata['compression'] = '' + + + def write_metadata(self, content): + """ + Called when data store changes occur (revert and import). + If they are going to stdout then don't stream metadata there too. + """ + if self.output_file: self.write_metadata2(content) + + + def write_metadata2(self,content): + + with (open(self.metadata_file_path,'w') if self.output_file else sys.stdout) as metadata_handle: + metadata_handle.write(json.dumps(content, sort_keys=True, indent=4, separators=(',', ': '))) + + + def metadata_create(self, type, compression, floor_id=1): + """ + Initial metadata structure + """ + file_name = self.db_master_file_name.rsplit('.',1) + self.metadata = { + 'version': CODE_VERSION, + 'name': self.db_master_file_name, + 'db_file_name': self.db_master_file_name, + # A guess about what best base file name would be to write versions out as + 'file_name': file_name[0] + '.' + type, + 'type': type, + 'description': '', + 'processor': '', # Processing that overrides type-matched processor. + 'compression': self.compression, + 'volumes': [] + } + + + def metadata_create_volume(self, file_create = True): + # Only add a volume if previous volume has at least 1 version in it. + if len(self.metadata['volumes']) == 0 or len(self.metadata['volumes'][-1]['versions']) > 0: + id = len(self.metadata['volumes']) + 1 + volume = { + 'floor_id': self.get_last_version()+1, + 'id': id, + 'versions': [] + } + self.metadata['volumes'].append(volume) + self.volume_id = id + if file_create: + self.create_volume_file() + + + return id + + else: + stop_err("Error: Didn't create a new volume because last one is empty already.") + + + def create_volume_file(self): + + if self.metadata['compression'] == '.gz': + gzip.open(self.get_db_path(), 'wb').close() + else: + open(self.get_db_path(),'w').close() + + + def metadata_create_version(self, mydate, file_name = '', file_size = 0, version_name = None): + id = self.get_last_version()+1 + if version_name == None: + version_name = str(id) + + version = { + 'id': id, + 'created': mydate, + 'name': version_name, + 'file_name': file_name, + 'file_size': file_size, + 'inserts': 0, + 'deletes': 0, + 'rows': 0, + 'keys': 0 + } + self.metadata['volumes'][-1]['versions'].append(version) + + return version + + + def get_version(self, version_id = None): + if version_id is None: + version_id = self.version_id + + for volume in self.metadata['volumes']: + for version in volume['versions']: + if version_id == version['id']: + return (volume, version) + + return False + + + def version_lookup(self, version_name): + for volume in self.metadata['volumes']: + for version in volume['versions']: + if version_name == version['name']: + return (volume, version) + + return False + + + def version_dict_lookup(self, version_ids, id, timestamp = None): + if id not in version_ids: + version_ids.append(id) + version = self.metadata_create_version(timestamp) + + return version_ids.index(id) + + + #****************** Methods Involving Scan of Master Kipper file ********************** + + def db_scan_action (self, action): + """ + #Python 2.6 needs this reopened if it was previously closed. + #sys.stdout = open("/dev/stdout", "w") + """ + dbReader = bigFileReader(self.get_db_path()) + # Setup temp file: + if self.output_file: + temp_file = self.get_temp_output_file(action=action) + + # Use temporary file so that db_output_file_path switches to new content only when complete + with (temp_file if self.output_file else sys.stdout) as output: + db_key_value = dbReader.read() + + while db_key_value: + if action == KEYDB_EXTRACT: + okLines = self.version_extract(db_key_value) + + elif action == KEYDB_REVERT: + okLines = self.version_revert(db_key_value) + + if okLines: + output.writelines(okLines) + + db_key_value = dbReader.read() + + # Issue: metadata lock while quick update with output_file??? + if self.output_file: + if action == KEYDB_EXTRACT: + self.processor.postprocess_file(temp_file.name) + + # Is there a case where we fail to get to this point? + os.rename(temp_file.name, self.output_file) + + if action == KEYDB_REVERT: + # When reverting, clear all volumes having versions > self.version_id + # Takes out volume structure too. + volumes = self.metadata['volumes'] + for volptr in range(len(volumes)-1, -1, -1): + volume = volumes[volptr] + if volume['floor_id'] > self.version_id: #TO REVERT IS TO KILL ALL LATER VOLUMES. + os.remove(self.get_db_path(volume['id'])) + versions = volume['versions'] + for verptr in range(len(versions)-1, -1, -1): + if versions[verptr]['id'] > self.version_id: + popped = versions.pop(verptr) + if len(versions) == 0 and volptr > 0: + volumes.pop(volptr) + + self.write_metadata(self.metadata) + + + def db_scan_line(self, db_key_value): + """ + FUTURE: transact_code will signal how key/value should be interpreted, to + allow for differential change storage from previous entries. + """ + # (created_vid, deleted_vid, transact_code, restofline) = db_key_value.split(self.delim,3) + (created_vid, deleted_vid, restofline) = db_key_value.split(self.delim,2) + if deleted_vid: deleted_vid = long(deleted_vid) + return (long(created_vid), deleted_vid, restofline) + + + def version_extract(self, db_key_value): + (created_vid, deleted_vid, restofline) = self.db_scan_line(db_key_value) + + if created_vid <= self.version_id and (not deleted_vid or deleted_vid > self.version_id): + return self.processor.postprocess_line(restofline) + + return False + + + def version_revert(self, db_key_value): + """ + Reverting database here. + """ + (created_vid, deleted_vid, restofline) = self.db_scan_line(db_key_value) + + if created_vid <= self.version_id: + if (not deleted_vid) or deleted_vid <= self.version_id: + return [str(created_vid) + self.delim + str(deleted_vid) + self.delim + restofline] + else: + return [str(created_vid) + self.delim + self.delim + restofline] + return False + + + def check_date_input(self, options): + """ + """ + if options.unixTime != None: + try: + _userTime = float(options.unixTime) + # if it is not a float, triggers exception + except ValueError: + stop_err("Given Unix time could not be parsed [" + options.unixTime + "]. Format should be [integer]") + + elif options.dateTime != None: + + try: + _userTime = parse_date(options.dateTime) + + except ValueError: + stop_err("Given date could not be parsed [" + options.dateTime + "]. Format should include at least the year, and any of the other more granular parts, in order: YYYY/MM/DD [H:M:S AM/PM]") + + else: + return False + + _dtobject = datetime.datetime.fromtimestamp(float(_userTime)) # + self.dateTime = long(_dtobject.strftime("%s")) + + + # Now see if we can set version_id by it. We look for version_id that has created <= self.dateTime + for volume in self.metadata['volumes']: + for version in volume['versions']: + if version['created'] <= self.dateTime: + self.version_id = version['id'] + self.volume_id = volume['id'] + else: + break + + return True + + + def check_folder(self, file_path, message = "Output directory for "): + """ + Ensures file folder path for output file exists. + We don't want to create output in a mistaken location. + """ + if file_path != None: + + path = os.path.normpath(file_path) + if not os.path.isdir(os.path.dirname(path)): + # Not an absolute path, so try default folder where script launched from: + path = os.path.normpath(os.path.join(os.getcwd(), path) ) + if not os.path.isdir(os.path.dirname(path)): + stop_err(message + "[" + path + "] does not exist!") + + return path + return None + + + def check_file_path(self, file, message = "File "): + + path = os.path.normpath(file) + # make sure any relative paths are converted to absolute ones + if not os.path.isdir(os.path.dirname(path)) or not os.path.isfile(path): + # Not an absolute path, so try default folder where script was called: + path = os.path.normpath(os.path.join(os.getcwd(),path) ) + if not os.path.isfile(path): + stop_err(message + "[" + path + "] doesn't exist!") + return path + + + def try_import_file(self, options): + """ + Create new version from comparison of import data file against Kipper + Note "-o ." parameter enables writing back to master database. + """ + self.db_import_file_path = self.check_file_path(options.db_import_file_path, "Import data file ") + + check_file = self.processor.preprocess_validate_file(self.db_import_file_path) + if not check_file: + stop_err("Import data file isn't sorted or composed correctly!") + + # SET version date to creation date of import file. + import_modified = os.path.getmtime(self.db_import_file_path) + + original_name = os.path.basename(self.db_import_file_path) + # creates a temporary file, which has conversion into 1 line key-value records + temp = self.processor.preprocess_file(self.db_import_file_path) + if (temp): + + self.db_import_file_path = temp.name + + self.import_file(original_name, import_modified, options.version_index) + + os.remove(temp.name) + + + def import_file(self, file_name, import_modified, version_index = None): + """ + Imports from an import file (or temp file if transformation done above) to + temp Kipper version which is copied over to main database on completion. + + Import algorithm only works if the import file is already sorted in the same way as the Kipper database file + + @uses self.db_import_file_path string A file full of one line key[tab]value records. + @uses self.output_file string A file to save results in. If empty, then stdio. + + @uses dateTime string Date time to mark created/deleted records by. + @puses delim char Separator between key/value pairs.ake it the function. + + @param file_name name of file being imported. This is stored in version record so that output file will be the same. + """ + delim = self.delim + + + file_size = os.path.getsize(self.db_import_file_path) + if version_index == None: + version_index = str(self.get_last_version()+1) + + self.volume_id = self.metadata['volumes'][-1]['id'] #For get_db_path() call below. + + if self.output_file: + temp_file = self.get_temp_output_file(action=KEYDB_IMPORT, path=self.get_db_path()) + + # We want to update database here when output file is db itself. + if os.path.isdir(self.output_file): + self.output_file = self.get_db_path() + + version = self.metadata_create_version(import_modified, file_name, file_size, version_index) + version_id = str(version['id']) + + with (temp_file if self.output_file else sys.stdout) as outputFile : + dbReader = bigFileReader(self.get_db_path()) + importReader = bigFileReader(self.db_import_file_path) + old_import_key='' + + while True: + + db_key_value = dbReader.turn() + #if import_key_value + import_key_value = importReader.turn() + + # Skip empty or whitespace lines: + if import_key_value and len(import_key_value.lstrip()) == 0: + import_key_value = importReader.read() + continue + + if not db_key_value: # eof + while import_key_value: # Insert remaining import lines: + (import_key, import_value) = self.get_key_value(import_key_value) + outputFile.write(version_id + delim + delim + import_key + delim + import_value) + import_key_value = importReader.read() + version['inserts'] += 1 + version['rows'] += 1 + + if import_key != old_import_key: + version['keys'] += 1 + old_import_key = import_key + + break # Both inputs are eof, so exit + + elif not import_key_value: # db has key that import file no longer has, so mark each subsequent db line as a delete of the key (if it isn't already) + while db_key_value: + (created_vid, deleted_vid, dbKey, dbValue) = db_key_value.split(delim,3) + version['rows'] += 1 + + if deleted_vid: + outputFile.write(db_key_value) + else: + outputFile.write(created_vid + delim + version_id + delim + dbKey + delim + dbValue) + version['deletes'] += 1 + + db_key_value = dbReader.read() + break + + else: + (import_key, import_value) = self.get_key_value(import_key_value) + (created_vid, deleted_vid, dbKey, dbValue) = db_key_value.split(delim,3) + + if import_key != old_import_key: + version['keys'] += 1 + old_import_key = import_key + + # All cases below lead to writing a row ... + version['rows'] += 1 + + if import_key == dbKey: + # When the keys match, we have enough information to act on the current db_key_value content; + # therefore ensure on next pass that we read it. + dbReader.step() + + if import_value == dbValue: + outputFile.write(db_key_value) + + # All past items marked with insert will also have a delete. Step until we find one + # not marked as a delete... or a new key. + if deleted_vid: # Good to go in processing next lines in both files. + pass + else: + importReader.step() + + else: # Case where value changed - so process all db_key_values until key no longer matches. + + # Some future pass will cause import line to be written to db + # (when key mismatch occurs) as long as we dont advance it (prematurely). + if deleted_vid: + #preserve deletion record. + outputFile.write(db_key_value) + + else: + # Mark record deletion + outputFile.write(created_vid + delim + version_id + delim + dbKey + delim + dbValue) + version['deletes'] += 1 + # Then advance since new key/value means new create + + else: + # Natural sort doesn't do text sort on numeric parts, ignores capitalization. + dbKeySort = natural_sort_key(dbKey) + import_keySort = natural_sort_key(import_key) + # False if dbKey less; Means db key is no longer in sync db, + if cmp(dbKeySort, import_keySort) == -1: + + if deleted_vid: #Already marked as a delete + outputFile.write(db_key_value) + + else: # Write dbKey as a new delete + outputFile.write(created_vid + delim + version_id + delim + dbKey + delim + dbValue) + version['deletes'] += 1 + # Advance ... there could be another db_key_value for deletion too. + dbReader.step() + + else: #DB key is greater, so insert import_key,import_value in db. + # Write a create record + outputFile.write(version_id + delim + delim + import_key + delim + import_value) + version['inserts'] += 1 + importReader.step() # Now compare next two candidates. + + if self.output_file: + # Kipper won't write an empty version - since this is usually a mistake. + # If user has just added new volume though, then slew of inserts will occur + # even if version is identical to tail end of previous volume version. + if version['inserts'] > 0 or version['deletes'] > 0: + #print "Temp file:" + temp_file.name + os.rename(temp_file.name, self.output_file) + self.write_metadata(self.metadata) + else: + os.remove(temp_file.name) + + + def get_last_version(self): + """ + Returns first Volume version counting from most recent. + Catch is that some volume might be empty, so have to go to previous one + """ + for ptr in range(len(self.metadata['volumes'])-1, -1, -1): + versions = self.metadata['volumes'][ptr]['versions'] + if len(versions) > 0: + return versions[-1]['id'] + + return 0 + + + # May want to move this to individual data store processor since it can be sensitive to different kinds of whitespace then. + def get_key_value(self, key_value): + # ACCEPTS SPLIT AT ANY WHITESPACE PAST KEY BY DEFAULT + kvparse = key_value.split(None,1) + #return (key_value[0:kvptr], key_value[kvptr:].lstrip()) + return (kvparse[0], kvparse[1] if len(kvparse) >1 else '') + + + def dateISOFormat(self, atimestamp): + return datetime.datetime.isoformat(datetime.datetime.fromtimestamp(atimestamp)) + + + def get_command_line(self): + """ + *************************** Parse Command Line ***************************** + + """ + parser = MyParser( + description = 'Maintains versions of a file-based database with comparison to full-copy import file updates.', + usage = 'kipper.py [kipper database file] [options]*', + epilog=""" + + All outputs go to stdout and affect no change in Kipper database unless the '-o' parameter is supplied. (The one exception to this is when the -M regenerate metadata command is provided, as described below.) Thus by default one sees what would happen if an action were taken, but must take an additional step to affect the data. + + '-o .' is a special request that leads to: + * an update of the Kipper database for --import or --revert actions + * an update of the .md file for -M --rebuild action + + As well, when -o parameter is a path, and not a specific filename, then kipper.py looks up what the appropriate output file name is according to the metadata file. + + USAGE + + Initialize metadata file and Kipper file. + kipper.py [database file] -M --rebuild [type of database:text|fasta] + + View metadata (json) file. + kipper.py [database file] -m --metadata + + Import key/value inserts/deletes based on import file. (Current date used). + kipper.py [database file] -i --import [import file] + e.g. + kipper.py cpn60 -i sequences.fasta # outputs new master database to stdout; doesn't rewrite it. + kipper.py cpn60 -i sequences.fasta -o . # rewrites cpn60 with new version added. + + Extract a version of the file based on given date/time + kipper.py [database file] -e --extract -d datetime -o [output file] + + Extract a version of the file based on given version Id + kipper.py [database file] -e --extract -n [version id] -o [output file] + + List versions of dbFile key/value pairs (by date/time) + kipper.py [database file] + kipper.py [database file] -l --list + + Have database revert to previous version. Drops future records, unmarks corresponding deletes. + kipper.py [database file] -r --revert -d datetime -o [output file] + + Return version of this code: + kipper.py -v --version + """) + + # Data/Metadata changing actions + parser.add_option('-M', '--rebuild', type='choice', dest='initialize', choices=['text','fasta'], + help='(Re)generate metadata file [name of db].md . Provide the type of db [text|fasta| etc.].') + + parser.add_option('-i', '--import', type='string', dest='db_import_file_path', + help='Import key/value inserts/deletes based on delta comparison with import file') + + parser.add_option('-e', '--extract', dest='extract', default=False, action='store_true', + help='Extract a version of the file based on given date/time') + + parser.add_option('-r', '--revert', dest='revert', default=False, action='store_true', + help='Have database revert to previous version (-d date/time required). Drops future records, unmarks corresponding deletes.') + + parser.add_option('-V', '--volume', dest='volume', default=False, action='store_true', + help='Add a new volume to the metadata. New imports will be added here.') + + # Passive actions + parser.add_option('-m', '--metadata', dest='metadata', default=False, action='store_true', + help='View metadata file [name of db].md') + + parser.add_option('-l', '--list', dest='list', default=False, action='store_true', + help='List versions of dbFile key/value pairs (by date/time)') + + parser.add_option('-c', '--compression', dest='compression', type='choice', choices=['.gz'], + help='Enable compression of database. options:[.gz]') + + # Used "v" for standard code version identifier. + parser.add_option('-v', '--version', dest='code_version', default=False, action='store_true', + help='Return version of kipper.py code.') + + parser.add_option('-o', '--output', type='string', dest='db_output_file_path', + help='Output to this file. Default is to stdio') + + parser.add_option('-I', '--index', type='string', dest='version_index', + help='Provide title (index) e.g. "1.4" of version being imported/extracted.') + + parser.add_option('-d', '--date', type='string', dest='dateTime', + help='Provide date/time for sync, extract or revert operations. Defaults to now.') + parser.add_option('-u', '--unixTime', type='int', dest='unixTime', + help='Provide Unix time (integer) for sync, extract or revert operations.') + parser.add_option('-n', '--number', type='int', dest='version_id', + help='Provide a version id to extract or revert to.') + + return parser.parse_args() + + +class VDBProcessor(object): + + delim = '\t' + nl = '\n' + + #def preprocess_line(self, line): + # return [line] + + def preprocess_file(self, file_path): + temp = tempfile.NamedTemporaryFile(mode='w+t',delete=False, dir=os.path.dirname(file_path) ) + copy (file_path, temp.name) + temp.close() + sort_a = subprocess.call(['sort','-sfV','-t\t','-k1,1', '-o',temp.name, temp.name]) + return temp #Enables temp file name to be used by caller. + + + def preprocess_validate_file(self, file_path): + + # Do import file preprocessing: + # 1) Mechanism to verify if downloaded file is complete - check md5 hash? + # 4) Could test file.newlines(): returns \r, \n, \r\n if started to read file (1st line). + # 5) Could auto-uncompress .tar.gz, bz2 etc. + # Ensures "[key] [value]" entries are sorted + # "sort --check ..." returns nothing if sorted, or e.g "sort: sequences_A.fastx.sorted:12: disorder: >114 AJ009959.1 … " + + # if not subprocess.call(['sort','--check','-V',db_import_file_path]): #very fast check + # subprocess.call(['sort','-V',db_import_file_path]): + + return True + + def postprocess_line(self, line): + #Lines are placed in array so that one can map to many in output file + return [line] + + def postprocess_file(self, file_path): + return False + + def sort(self, a, b): + pass + + +class VDBFastaProcessor(VDBProcessor): + + + def preprocess_file(self, file_path): + """ + Converts input fasta data into one line tab-delimited record format, then sorts. + """ + temp = tempfile.NamedTemporaryFile(mode='w+t',delete=False, dir=os.path.dirname(file_path) ) + fileReader = bigFileReader(file_path) + line = fileReader.read() + old_line = '' + while line: + line = line.strip() + if len(line) > 0: + + if line[0] == '>': + if len(old_line): + temp.write(old_line + self.nl) + lineparse = line.split(None,1) + key = lineparse[0].strip() + if len(lineparse) > 1: + description = lineparse[1].strip().replace(self.delim, ' ') + else: + description = '' + old_line = key[1:] + self.delim + description + self.delim + + else: + old_line = old_line + line + + line = fileReader.read() + + if len(old_line)>0: + temp.write(old_line+self.nl) + + temp.close() + + # Is this a consideration for natural sort in Python vs bash sort?: + # *** WARNING *** The locale specified by the environment affects sort order. + # Set LC_ALL=C to get the traditional sort order that uses native byte values. + #-s stable; -f ignore case; V natural sort (versioning) ; -k column, -t tab delimiter + sort_a = subprocess.call(['sort', '-sfV', '-t\t', '-k1,1', '-o',temp.name, temp.name]) + + return temp #Enables temp file name to be used by caller. + + + def postprocess_line(self, line): + """ + Transform Kipper fasta 1 line format key/value back into output file line(s) - an array + + @param line string containing [accession id][TAB][description][TAB][fasta sequence] + @return string containing lines each ending with newline, except end. + """ + line_data = line.split('\t',2) + # Set up ">[accession id] [description]\n" : + fasta_header = '>' + ' '.join(line_data[0:2]) + '\n' + # Put fasta sequences back into multi-line; note trailing item has newline. + sequences= self.split_len(line_data[2],80) + if len(sequences) and sequences[-1].strip() == '': + sequences[-1] = '' + + return fasta_header + '\n'.join(sequences) + + + def split_len(self, seq, length): + return [seq[i:i+length] for i in range(0, len(seq), length)] + + +class bigFileReader(object): + """ + This provides some advantage over reading line by line, and as well has a system + for skipping/not advancing reads - it has a memory via "take_step" about whether + it should advance or not - this is used when the master database and the import + database are feeding lines into a new database. + + Interestingly, using readlines() with byte hint parameter less + than file size improves performance by at least 30% over readline(). + + FUTURE: Adjust buffer lines dynamically based on file size/lines ratio? + """ + + def __init__(self, filename): + self.lines = [] + # This simply allows any .gz repository to be opened + # It isn't connected to the Kipper metadata['compression'] feature. + if filename[-3:] == '.gz': + self.file = gzip.open(filename,'rb') + else: + self.file = open(filename, 'rb', 1) + + self.line = False + self.take_step = True + self.buffer_size=1000 # Number of lines to read into buffer. + + + def turn(self): + """ + When accessing bigFileReader via turn mechanism, we get current line if no step; + otherwise with step we read new line. + """ + if self.take_step == True: + self.take_step = False + return self.read() + return self.line + + + def read(self): + if len(self.lines) == 0: + self.lines = self.file.readlines(self.buffer_size) + if len(self.lines) > 0: + self.line = self.lines.pop(0) + #if len(self.lines) == 0: + # self.lines = self.file.readlines(self.buffer_size) + #make sure each line doesn't include carriage return + return self.line + + return False + + + def readlines(self): + """ + Small efficiency: + A test on self.lines after readLines() call can control loop. + Bulk write of remaining buffer; ensures lines array isn't copied + but is preserved when self.lines is removed + """ + self.line = False + if len(self.lines) == 0: + self.lines = self.file.readlines(self.buffer_size) + if len(self.lines) > 0: + shallowCopy = self.lines[:] + self.lines = self.file.readlines(self.buffer_size) + return shallowCopy + return False + + + def step(self): + self.take_step = True + + + +# Enables use of with ... syntax. See https://mail.python.org/pipermail/tutor/2009-November/072959.html +class myGzipFile(gzip.GzipFile): + def __enter__(self): + if self.fileobj is None: + raise ValueError("I/O operation on closed GzipFile object") + return self + + def __exit__(self, *args): + self.close() + + +def natural_sort_key(s, _nsre = REGEX_NATURAL_SORT): + return [int(text) if text.isdigit() else text.lower() + for text in re.split(_nsre, s)] + + +def generic_linux_sort(self): + import locale + locale.setlocale(locale.LC_ALL, "C") + yourList.sort(cmp=locale.strcoll) + + +def parse_date(adate): + """ + Convert human-entered time into linux integer timestamp + + @param adate string Human entered date to parse into linux time + + @return integer Linux time equivalent or 0 if no date supplied + """ + adate = adate.strip() + if adate > '': + adateP = parser2.parse(adate, fuzzy=True) + #dateP2 = time.mktime(adateP.timetuple()) + # This handles UTC & daylight savings exactly + return calendar.timegm(adateP.timetuple()) + return 0 + + +if __name__ == '__main__': + + kipper = Kipper() + kipper.__main__() + diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/tester.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/tester.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,109 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- +import optparse +import sys +import difflib +import os + + +class MyParser(optparse.OptionParser): + """ + From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output + Provides a better class for displaying formatted help info in epilog() portion of optParse; allows for carriage returns. + """ + def format_epilog(self, formatter): + return self.epilog + +def stop_err( msg ): + sys.stderr.write("%s\n" % msg) + sys.exit(1) + + +def __main__(self): + + + """ + (This is run only in context of command line.) + FUTURE: ALLOW GLOB IMPORT OF ALL FILES OF GIVEN SUFFX, each to its own version, initial date taken from file date + FUTURE: ALLOW dates of versions to be adjusted. + """ + options, args = self.get_command_line() + self.options = options + + if options.test_ids: + return self.test(options.test_ids) + + +def get_command_line(self): + """ + *************************** Parse Command Line ***************************** + + """ + parser = MyParser( + description = 'Tests a program against given input/output files.', + usage = 'tester.py [program] [input files] [output files] [parameters]', + epilog=""" + + + parser.add_option('-t', '--tests', dest='test_ids', help='Enter "all" or comma-separated id(s) of tests to run.') + + return parser.parse_args() + + + +def test(self, test_ids): + # Future: read this spec from test-data folder itself? + tests = { + '1': {'input':'a1','outputs':'','options':''} + } + self.test_suite('keydb.py', test_ids, tests, '/tmp/') + + +def test_suite(self, program, test_ids, tests, output_dir): + + if test_ids == 'all': + test_ids = sorted(tests.keys()) + else: + test_ids = test_ids.split(',') + + for test_id in test_ids: + if test_id in tests: + test = tests[test_id] + test['program'] = program + test['base_dir'] = base_dir = os.path.dirname(__file__) + executable = os.path.join(base_dir,program) + if not os.path.isfile(executable): + stop_err('\n\tUnable to locate ' + executable) + # Each output file has to be prefixed with the output (usualy /tmp/) folder + test['tmp_output'] = (' ' + test['outputs']).replace(' ',' ' + output_dir) + # Note: output_dir output files don't get cleaned up after each test. Should they?! + params = '%(base_dir)s/%(program)s %(base_dir)s/test-data/%(input)s%(tmp_output)s %(options)s' % test + print("Test " + test_id + ': ' + params) + print("................") + os.system(params) + print("................") + for file in test['outputs'].split(' '): + try: + f1 = open(test['base_dir'] + '/test-data/' + file) + f2 = open(output_dir + file) + except IOError as details: + stop_err('Error in test setup: ' + str(details)+'\n') + + #n=[number of context lines + diff = difflib.context_diff(f1.readlines(), f2.readlines(), lineterm='',n=0) + # One Galaxy issue: it doesn't convert entities when user downloads file. + # BUT IT appears to when generating directly to command line? + print '\nCompare ' + file + print '\n'.join(list(diff)) + + else: + stop_err("\nExpecting one or more test ids from " + str(sorted(tests.keys()))) + + stop_err("\nTest finished.") + + +if __name__ == '__main__': + + keydb = KeyDb() + keydb.__main__() + diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_biomaj.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/vdb_biomaj.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,71 @@ +#!/usr/bin/python +## ******************************* Biomaj FOLDER ********************************* +## + +import os +import time +import glob +import vdb_common +import vdb_data_stores + +class VDBBiomajDataStore(vdb_data_stores.VDBDataStore): + + versions = None + library = None + + def __init__(self, retrieval_obj, spec_file_id): + """ + Provides list of available versions where each version is a data file sitting in a Biomaj data store subfolder. e.g. + + /projects2/ref_databases/biomaj/ncbi/blast/silva_rna/ silva_rna_119/flat + /projects2/ref_databases/biomaj/ncbi/genomes/Bacteria/ Bacteria_2014-07-25/flat + + """ + super(VDBBiomajDataStore, self).__init__(retrieval_obj, spec_file_id) + + self.library = retrieval_obj.library + + versions = [] + # Linked, meaning that our data source spec file pointed to some folder on the server. + # Name of EACH subfolder should be a label for each version, including date/time and version id. + + base_file_path = os.path.join(os.path.dirname(self.base_file_name),'*','flat','*') + try: + #Here we need to get subfolder listing of linked file location. + for item in glob.glob(base_file_path): + + # Only interested in files, and not ones in the symlinked /current/ subfolder + # Also, Galaxy will strip off .gz suffixes - WITHOUT UNCOMPRESSING FILES! + # So, best to prevent data store from showing .gz files in first place. + if os.path.isfile(item) and not '/current/' in item and not item[-3:] == '.gz': + #Name includes last two subfolders: /[folder]/flat/[name] + item_name = '/'.join(item.rsplit('/',3)[1:]) + # Can't count on creation date being spelled out in name + created = vdb_common.parse_date(time.ctime(os.path.getmtime(item))) + versions.append({'name':item_name, 'id':item_name, 'created': created}) + + except Exception as err: + # This is the first call to api so api url or authentication erro can happen here. + versions.append({ + 'name':'Error: Unable to get version list: ' + err.message, + 'id':'', + 'created':'' + }) + + self.versions = sorted(versions, key=lambda x: x['name'], reverse=True) + + + def get_version(self, version_name): + """ + Return server path of requested version info + + @uses library_label_path string Full hierarchic label of a library file or folder, PARENT of version id folder. + @uses base_file_name string Server absolute path to data_store spec file + @param version_name alphaneumeric string + """ + self.version_label = version_name + self.library_version_path = os.path.join(self.library_label_path, self.version_label) + + #linked to some other folder, spec is location of base_file_name + self.version_path = os.path.join(self.data_store_path, version_name) + diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_data_stores.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/vdb_data_stores.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,142 @@ +import vdb_common +import vdb_retrieval # For message text + +class VDBDataStore(object): + + """ + Provides data store engine super-class with methods to list available versions, and to generate a version (potentially providing a link to cached version). Currently have options for git, folder, and kipper. + + get_data_store_gateway() method loads the appropriate data_stores/vdb_.... data store variant. + + """ + + def __init__(self, retrieval_obj, spec_file_id): + """ + Note that api is only needed for data store type = folder. + + @init self.type + @init self.base_file_name + @init self.library_label_path + @init self.data_store_path + @init self.global_retrieval_date + + @sets self.library_version_path + @sets self.version_label + @sets self.version_path + """ + + self.admin_api = retrieval_obj.admin_api + self.library_id = retrieval_obj.library_id + self.library_label_path = retrieval_obj.get_library_label_path(spec_file_id) + try: + # Issue: Error probably gets trapped but is never reported back to Galaxy from thread via form's call? + # It appears galaxy user needs more than just "r" permission on this file, oddly? + spec = self.admin_api.libraries.show_folder(self.library_id, spec_file_id) + + except IOError as e: + print 'Tried to fetch library folder spec file: %s. Check permissions?"' % spec_file_id + print "I/O error({0}): {1}".format(e.errno, e.strerror) + sys.exit(1) + + self.type = retrieval_obj.test_data_store_type(spec['name']) + + #Server absolute path to data_store spec file (can be Galaxy .dat file representing Galaxy library too. + + self.base_file_name = spec['file_name'] + + # In all cases a pointer file's content (data_store_path) + # should point to a real server folder (that galaxy has permission to read). + # Exception to this is for pointer.folder, where content can be empty, + # in which case idea is to use library folder contents directly. + + with open(self.base_file_name,'r') as path_spec: + self.data_store_path = path_spec.read().strip() + if len(self.data_store_path) > 0: + # Let people forget to put a trailing slash on the folder path. + if not self.data_store_path[-1] == '/': + self.data_store_path += '/' + + # Generated on subsequent subclass call + self.library_version_path = None + self.version_label = None + self.version_path = None + + + def get_version_options(self, global_retrieval_date=0, version_name=None, selection=False): + """ + Provides list of available versions of a given archive. List is filtered by + optional global_retrieval_date or version id. For date filter, the version immediately + preceeding given datetime (includes same datetime) is returned. + All comparisons are done by version NAME not id, because underlying db id might change. + + If global_retrieval_date datetime preceeds first version, no filtering is done. + + @param global_retrieval_date long unix time date to test entry against. + @param version_name string Name of version + @param version_id string Looks like a number, or '' to pull latest id. + """ + + data = [] + + date_match = vdb_common.date_matcher(global_retrieval_date) + found=False + + for ptr, item in enumerate(self.versions): + created = float(item['created']) + item_name = item['name'] + + # Note version_id is often "None", so must come last in conjunction. + selected = (found == False) \ + and ((item_name == version_name ) or date_match.next(created)) + + if selected == True: + found = True + if selection==True: + return item_name + + # Folder type data stores should already have something resembling created date in name. + if type(self).__name__ in 'VDBFolderDataStore VDBBiomajDataStore': + item_label = item['name'] + else: + item_label = vdb_common.lightDate(created) + '_' + item['name'] + + data.append([item_label, item['name'], selected]) + + if not found and len(self.versions) > 0: + if global_retrieval_date: # Select oldest date version since no match above. + item_name = data[-1][1] + data[-1][2] = True + else: + item_name = data[0][1] + data[0][2] = True + if selection == True: + return item_name + + + # For cosmetic display: Natural sort takes care of version keys that have mixed characters/numbers + data = sorted(data, key=lambda el: vdb_common.natural_sort_key(el[0]), reverse=True) #descending + + #Always tag the first item as the most current one + if len(data) > 0: + data[0][0] = data[0][0] + ' (current)' + else: + data.append([vdb_retrieval.VDB_DATASET_NOT_AVAILABLE + ' Is pointer file content right? : ' + self.data_store_path,'',False]) + + """ + globalFound = False + for i in range(len(data)): + if data[i][2] == True: + globalFound = True + break + + if globalFound == False: + data[0][2] = True # And select it if no other date has been selcted + + """ + + return data + + + def get_version(self, version_name): + # All subclasses must define this. + pass diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_folder.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/vdb_folder.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,90 @@ +#!/usr/bin/python +## ******************************* FILE FOLDER ********************************* +## + +import re +import os +import vdb_common +import vdb_data_stores + +class VDBFolderDataStore(vdb_data_stores.VDBDataStore): + + versions = None + library = None + + def __init__(self, retrieval_obj, spec_file_id): + """ + Provides list of available versions where each version is indicated by the + existence of a folder that contains its content. This content can be directly + in the galaxy Versioned Data folder tree, OR it can be linked to another folder + on the server. In the latter case, galaxy will treat the Versioned Data folders as caches. + View of versions filters out any folders that are used for derivative data caching. + """ + super(VDBFolderDataStore, self).__init__(retrieval_obj, spec_file_id) + + self.library = retrieval_obj.library + + versions = [] + + # If data source spec file has no content, use the library folder directly. + # Name of EACH subfolder should be a label for each version, including date/time and version id. + if self.data_store_path == '': + try: + lib_label_len = len(self.library_label_path) +1 + + for item in self.library: + # If item is under library_label_path ... + if item['name'][0:lib_label_len] == self.library_label_path + '/': + item_name = item['name'][lib_label_len:len(item['name'])] + if item_name.find('/') == -1 and item_name.find('_') != -1: + (item_date, item_version) = item_name.split('_',1) + created = vdb_common.parse_date(item_date) + versions.append({'name':item_name, 'id':item_name, 'created': created}) + + except Exception as err: + # This is the first call to api so api url or authentication erro can happen here. + versions.append({ + 'name':'Software Error: Unable to get version list: ' + err.message, + 'id':'', + 'created':'' + }) + + else: + + base_file_path = self.data_store_path + #base_file_path = os.path.dirname(self.base_file_name) + #Here we need to get directory listing of linked file location. + for item_name in os.listdir(base_file_path): # Includes files and folders + # Only interested in folders + if os.path.isdir( os.path.join(base_file_path, item_name)) and item_name.find('_') != -1: + (item_date, item_version) = item_name.split('_',1) + created = vdb_common.parse_date(item_date) + versions.append({'name':item_name, 'id':item_name, 'created': created}) + + + self.versions = sorted(versions, key=lambda x: x['name'], reverse=True) + + def get_version(self, version_name): + """ + Return server path of requested version info - BUT ONLY IF IT IS LINKED. + IF NOT LINKED, returns None for self.version_path + + QUESTION: DOES GALAXY AUTOMATICALLY HANDLE tar.gz/zip decompression? + + @uses library_label_path string Full hierarchic label of a library file or folder, PARENT of version id folder. + @uses base_file_name string Server absolute path to data_store spec file + + @param version_name alphaneumeric string (git tag) + """ + self.version_label = version_name + self.library_version_path = os.path.join(self.library_label_path, self.version_label) + + if self.data_store_path == '': + # In this case version content is held in library directly; + self.version_path = self.base_file_name + + else: + + #linked to some other folder, spec is location of base_file_name + self.version_path = os.path.join(self.data_store_path, version_name) + diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_git.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/vdb_git.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,106 @@ +#!/usr/bin/python +## ********************************* GIT ARCHIVE ******************************* +## + +import os, sys +import vdb_common +import vdb_data_stores +import subprocess +import datetime + +class VDBGitDataStore(vdb_data_stores.VDBDataStore): + + def __init__(self, retrieval_obj, spec_file_id): + """ + Archive is expected to be in "master/" subfolder of data_store_path on server. + """ + super(VDBGitDataStore, self).__init__(retrieval_obj, spec_file_id) + self.command = 'git' + gitPath = self.data_store_path + 'master/' + if not os.path.isdir(os.path.join(gitPath,'.git') ): + print "Error: Unable to locate git archive file: " + gitPath + sys.exit(1) + + command = [self.command, '--git-dir=' + gitPath + '.git', '--work-tree=' + gitPath, 'for-each-ref','--sort=-*committerdate', "--format=%(*committerdate:raw) %(refname)"] + # to list just 1 id: command.append('refs/tags/' + version_id) + # git --git-dir=/.../NCBI_16S/master/.git --work-tree=/.../NCBI_16S/master/ tag + items, error = subprocess.Popen(command,stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate() + items = items.split("\n") #Loop through list of tags + versions = [] + for ptr, item in enumerate(items): + + # Ignore master branch name; time is included as separate field in all other cases + if item.strip().find(" ") >0: + (vtime, voffset, name) = item.split(" ") + created = vdb_common.get_unix_time(vtime, voffset) + item_name = name[10:] #strip 'refs/tags/' part off + versions.append({'name':item_name, 'id':item_name, 'created': created}) + + self.versions = versions + + + def get_version(self, version_name): + """ + Returns server folder path to version folder containing git files for a given version_id (git tag) + + FUTURE: TO AVOID USE CONFLICTS FOR VERSION RETRIEVAL, GIT CLONE INTO TEMP FOLDER + with -s / --shared and -n / --no-checkout (to avoid head build) THEN CHECKOUT version + ... + REMOVE CLONE GIT REPO + + @param galaxy_instance object A Bioblend galaxy instance + @param library_id string Identifier for a galaxy data library + @param library_label_path string Full hierarchic label of a library file or folder, PARENT of version id folder. + + @param base_folder_id string a library folder id under which version files should exist + @param version_id alphaneumeric string (git tag) + + """ + version = self.get_metadata_version(version_name) + + if not version: + print 'Error: Galaxy was not able to find the given version id in the %s data store.' % self.version_path + sys.exit( 1 ) + + version_name = version['name'] + self.version_path = os.path.join(self.data_store_path, version_name) + self.version_label = vdb_common.lightDate(version['created']) + '_v' + version_name + self.library_version_path = os.path.join(self.library_label_path, self.version_label) + + # If Data Library Versioned Data folder doesn't exist for this version, then create it + if not os.path.exists(self.version_path): + try: + os.mkdir(self.version_path) + except: + print 'Error: Galaxy was not able to create data store folder "%s". Check permissions?' % self.version_path + sys.exit( 1 ) + + if os.listdir(self.version_path) == []: + + git_path = self.data_store_path + 'master/' + # RETRIEVE LIST OF FILES FOR GIVEN GIT TAG (using "ls-tree". + # It can happen independently of git checkout) + + command = [self.command, '--git-dir=%s/.git' % git_path, 'ls-tree','--name-only','-r', version_name] + items, error = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate() + git_files = items.split('\n') + + # PERFORM GIT CHECKOUT + command = [self.command, '--git-dir=%s/.git' % git_path, '--work-tree=%s' % git_path, 'checkout', version_name] + results, error = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate() + + vdb_common.move_files(git_path, self.version_path, git_files) + + + + def get_metadata_version(self, version_name=''): + if version_name == '': + return self.versions[0] + + for version in self.versions: + if str(version['name']) == version_name: + return version + + return False + + diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_kipper.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_stores/vdb_kipper.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,102 @@ +#!/usr/bin/python +## ********************************* Kipper ************************************* +## + +import vdb_common +import vdb_data_stores +import subprocess +import json +import glob +import os +import sys + +class VDBKipperDataStore(vdb_data_stores.VDBDataStore): + + metadata = None + versions = None + command = None + + def __init__(self, retrieval_obj, spec_file_id): + """ + Metadata (version list) is expected in "master/" subfolder of data_store_path on server. + + Future: allow for more than one database / .md file in a folder? + """ + super(VDBKipperDataStore, self).__init__(retrieval_obj, spec_file_id) + # Ensure we're working with this tool's version of Kipper. + self.command = os.path.join(os.path.dirname(sys._getframe().f_code.co_filename),'kipper.py') + self.versions = [] + + metadata_path = os.path.join(self.data_store_path,'master','*.md') + metadata_files = glob.glob(metadata_path) + + if len(metadata_files) == 0: + return # Handled by empty list error in versioned_data_form.py + + metadata_file = metadata_files[0] + + with open(metadata_file,'r') as metadata_handle: + self.metadata = json.load(metadata_handle) + for volume in self.metadata['volumes']: + self.versions.extend(volume['versions']) + + if len(self.versions) == 0: + print "Error: Unable to locate metadata file: " + metadata_path + sys.exit(1) + + self.versions = sorted(self.versions, key=lambda x: x['id'], reverse=True) + + + def get_version(self, version_name): + """ + Trigger populating of the appropriate version folder. + Returns path information to version folder, and its appropriate library label. + + @param version_id string + """ + + version = self.get_metadata_version(version_name) + + if not version: + print 'Error: Galaxy was not able to find the given version id in the %s data store.' % self.version_path + sys.exit( 1 ) + + version_name = version['name'] + self.version_path = os.path.join(self.data_store_path, version_name) + self.version_label = vdb_common.lightDate(version['created']) + '_v' + version_name + + print self.library_label_path + print self.version_label + + self.library_version_path = os.path.join(self.library_label_path, self.version_label) + + if not os.path.exists(self.version_path): + try: + os.mkdir(self.version_path) + except: + print 'Error: Galaxy was not able to create data store folder "%s". Check permissions?' % self.version_path + sys.exit( 1 ) + + # Generate cache if folder is empty (can take a while): + if os.listdir(self.version_path) == []: + + db_file = os.path.join(self.data_store_path, 'master', self.metadata["db_file_name"] ) + command = [self.command, db_file, '-e','-I', version_name, '-o', self.version_path ] + + try: + result = subprocess.call(command); + except: + print 'Error: Galaxy was not able to run the kipper.py program successfully for this job: "%s". Check permissions?' % self.version_path + sys.exit( 1 ) + + + def get_metadata_version(self, version_name=''): + if version_name == '': + return self.versions[0] + + for version in self.versions: + if str(version['name']) == version_name: + return version + + return False + diff -r d31a1bd74e63 -r 5c5027485f7d doc/README.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/README.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,16 @@ +## Documentation Index +0. [Overview](../README.md) +1. [Setup for Admins](setup.md) + 1. [Galaxy tool installation](galaxy_tool_install.md) + 2. [Server data stores](data_stores.md) + 4. [Galaxy "Versioned Data" library setup](galaxy_library.md) + 5. [Workflow configuration](workflows.md) + 6. [Permissions, security, and maintenance](maintenance.md) + 7. [Problem solving](problem_solving.md) +2. [Using the Galaxy Versioned data tool](galaxy_tool.md) +3. [System Design](design.md) +4. [Background Research](background.md) +5. [Server data store and galaxy library organization](data_store_org.md) +6. [Data Provenance and Reproducibility](data_provenance.md) +7. [Caching System](caching.md) + diff -r d31a1bd74e63 -r 5c5027485f7d doc/background.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/background.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,44 @@ +# Background Research + +### Git Archiving + +Git was investigated as a generic solution for versioning textual files but in a number of cases its performance was detrimental. Initially the git archive option seemed to fail on fasta files of any size. In tests git was replacing one version with another by a diff that contained deletes of every original line followed by inserts of every line in new file. + +By reformatting the incoming fasta file so that each line held one fasta record >[sequence id] [description] [sequence], git's diff algorithm seemed to regain its efficiency. The traditional format: + +``` +>gi|324234235|whatever a description +ATGCCGTAAGATTC ... +ATGCCGTAAGATTC ... +ATGCCGTAAGATTC ... +>gi|25345345|another entry +etc. +``` + +was converted to a 1 line format: + +``` +>gi|324234235|whatever a description [tab] ATGCCGTAAGATTCATGCCGTAAGATTCATGCCGTAAGATTC ... +>gi|25345345|another entry ... +``` + +This approach seemed to archive fasta files very efficiently. However it turned out this only worked for some kinds of fasta data. Git's file differential algorithm failed possibly because of a lack of variation from line to line of a file - it may need lines to have different lengths, word count, or variation in character content (it seemed to be ok for some nucleotide data, but we found it occasionally failed on uniprot v1 to v11 protein data. + +There is a git feature to disable diff analysis for files above a certain file size, which then makes git archive size comparable to a straight text-file compression algorithm (zip/gzip). Overall, a practical limit of 15gb has been reported for git archive size. The main barrier seems to be content format; archive retrieval does get slower over time if diffs are being calculated. + +Git also has has very high memory requirements to do its processing, which is a server liability. Tests indicate that for some inputs, git is successful for file sizes up to at least a few gigabytes. + +### XML Archiving + +We also examined the versioning of XML content. The XArch: http://xarch.sourceforge.net/ approach is very promising as a generic solution, though it entails slow processing due to the nature of xml documents. It compares current and past xml documents and datestamps the additions/deletions to an xml structure. A good article summarizing the XML archive problem: http://useless-factor.blogspot.ca/2008/01/matching-diffing-and-merging-xml.html + + +### Other Database Solutions + +A number of key-value databases exist (e.g. LevelDB, LMDB, and DiscoDB) but they are designed to gain quick access to individual key-value entries, not to query properties of each key essential to version construction, namely to filter every key in the database by when it was created or deleted. Some key-value databases have features that could support speedy version retrieval. + +Google's Leveldb - a straight key-value file based database - may work, though we would need to think up a scheme for including create & delete times in the key for each fasta identifier. Possibly 1 key=fasta id record indicating existence of fasta identifier sequence; and subsequent fastaId.created.deleted key as per the keydb solution. An advantage here is that compression is built-in. In the case where incoming data consists a full file to synchronize with, inserts are easy to account for - but deletes can be very tricky since finding them involves scanning through the entire leveldb's key list, with the assumption that they are sorted the same way as the incoming data (the keydb approach) (VERIFY THIS???) + +For processing an incoming entire version of key-value content, the functionality required is that the keys are stored in a reliable sort order, and that they can be read fast sequentially . During this loop we can retrieve all the create/update/delete transactions of each key by version or date. (Usually the db has no ability to structure the value data though, so we would usually have to create such a schema). This is basically what keydb does. + +For processing separate incoming files that contain only inserts and deletes, the job is much the same, only we skip swaths of the master database content. At some point it becomes more attractive performance-wise to implement an indexed key value store, thus avoiding the sorting and sequential access necessary for simpler key-value dbs. diff -r d31a1bd74e63 -r 5c5027485f7d doc/bad_history_item.png Binary file doc/bad_history_item.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/caching.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/caching.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,15 @@ +# Caching System + +The Versioned Data tool caching system operates in two stages, one covering server folders that contain data store version retrievals, and the second covering Galaxy's "Versioned Data" data library folder behaviour. The "Versioned Data" library folder files mainly link to the server folders having the respective content, so cache management consists of creating/deleting the server's versioned data as well as Galaxy's link to it. As well the "Versioned Data" library has a "Workflow Cache" folder which stores all derivative (output) workflow datasets as requested by Versioned Data tool users. + +The program that periodically clears out the Galaxy link and server data cache is called *versioned_data_cache_clear.py*, and should be set up by a Galaxy administrator to run on a monthly basis, say. It leaves each database's latest version in the cache. + +--- + +### Galaxy behaviour when user history references to missing cached data occur + +If a server data store version folder is deleted, the galaxy data library and user history references to it will be broken. The user will experience a "Not Found" display in their workspace when clicking on a linked datastore like that in their history. The user can simply rerun the "Versioned Data Retrieval" tool to restore the data. + +![message when selected dataset file no longer exists on the server](bad_history_item.png) + +In the example above, clicking on the view (eye) icon for "155:16s_rdp.fasta" triggered this message. To remedy the situation, the user simply reruns the "154: Versioned Data Retrieval" step below to regenerate the dataset(s). diff -r d31a1bd74e63 -r 5c5027485f7d doc/data_provenance.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/data_provenance.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,12 @@ +## Data Provenance and Reproducibility + +When a user selects particular version id or date for versioned data retrieval, this is recorded for future reference, and can be seen in a history item's "View details" (info icon) report, in the "Input Parameters" section. But if a user left the global date field blank or didn't select a particular version of a data source, they or another user can still rerun a Versioned Data retrieval to recreate the results by noting the original history item's view details "Created" date and entering it into the global retrieval date of the form. + +![A history dataset has a detailed view link](history_view_details.png) + +![Data provenance information is available in the detail view](history_dataset_details.png) + +Also, particular dates/versions of a Versioned Data history item's retrieved data are shown in its "Edit Attributes" (pencil icon) report in the "Info" field. + +Because Galaxy also preserves the version id of any galaxy tool it runs (e.g. the makeblastdb version #), rerunning a history/workflow that has these tools should also apply the appropriate software version to generate the secondary data as well. +However, the tool version ids contained within a workflow are not recorded by the versioned data tool per se.; they exist only in the selected workflow's design template, so some care must be taken to freeze or version any workflows used to generate derivative data. diff -r d31a1bd74e63 -r 5c5027485f7d doc/data_stores.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/data_stores.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,96 @@ +# Server Data Stores + +These folders hold the git, Kipper, and plain folder data stores. A data store connection directly to a Biomaj databank is also possible, though it is limited to those databanks that have .fasta files in their /flat/ subfolder. How you want to store your data depends on the data itself (size, format) and its frequency of use; generally large fasta databases should be held in a Kipper data store. Generally, each git or Kipper data store needs to have: +* An appropriately named folder; +* A sub-folder called "master/", with the data store files set up in it; +* A file called "pointer.[type of data store]" which contains a path to itself. This file will be linked into a galaxy data library. +Note that all these folders need to be accessible by the same user that runs your Galaxy installation for Galaxy integration to work. Specifically Galaxy needs recursive rwx permissions on these folders. + +To start, you may want to set up the Kipper RDP RNA database (included in the Kipper repository RDP-test-case folder, see README.md file there). It comes complete with the folder structure and files for the Kipper example that follows. You will have to adjust the pointer.Kipper file content appropriately. + +The data store folders can be placed within a single folder, or may be in different locations on the server as desired. In our example versioned data stores are all located under /projects2/reference_dbs/versioned/ . + +## Data Store Examples + +### Kipper data store example: A data store for the RDP RNA database: +```bash + /projects2/reference_dbs/versioned/RDP_RNA/ + /projects2/reference_dbs/versioned/RDP_RNA/pointer.kipper + /projects2/reference_dbs/versioned/RDP_RNA/master/ + /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_1 + /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_2 + /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna.md +``` +To start a Kipper data store from scratch, go into the master folder, initialize a Kipper data store there, and import a version of a content file into the Kipper db. + +```bash +cd master/ +kipper.py rdp_rna -M fasta +kipper.py rdp_rna -i [file to import] -o . +``` + +Kipper stores and retrieves just one file at a time. Currently there is no provision for retrieving multiple files from different Kipper data stores at once. + +Note that very large temporary files can be generated during the archive/recall process. For example, a compressed 10Gb NCBI "nr" input fasta file may be resorted and reformatted; or a 10Gb file may be transformed from Kipper format and output to a file. For this reason we have located any necessary temporary files in the input and output folders specified in the Kipper command line. (The system /tmp/ folder can be too small to fit them). + + +### Folder data store example + +In this scenario the data we want to archive probably isn't of a key-value nature, nor is it amenable to diff storage via git, so we're storing each version as a separate file. We don't need the "master" sub-folder since there is no master database, but we do need 1 additional folder to store each version's file(s). + +The version folder names must be in the format **[date]_[version id]** to convey to users the date and version id of each version. The folder names will be displayed directly in the Galaxy Versioned Data tool's selectable list of versions. Note that this can allow for various date and time granularity, e.g. "2005_v1" and "2005-01-05 10:24_v1" are both acceptable folder names. Note that several versions can be published on the same day. + +Example of a refseq50 protein database as a folder data store: + +```bash +/projects/reference_dbs/versioned/refseq50/ +/projects/reference_dbs/versioned/refseq50/pointer.folder +/projects/reference_dbs/versioned/refseq50/2005-01-05_v1/file.fasta +/projects/reference_dbs/versioned/refseq50/2005-01-05_v2/file.fasta +/projects/reference_dbs/versioned/refseq50/2005-02-04_v3/file.fasta +... +/projects/reference_dbs/versioned/refseq50/2005-05-24_v4/file.fasta +... +/projects/reference_dbs/versioned/refseq50/2005-09-27_v5/file.fasta +``` + +etc... + +A data store of type "folder" doesn't have to be stored outside of galaxy. Exactly the same folder structure can be set up directly within the galaxy data library, and files can be uploaded inside them. The one drawback to this approach is that then other (non-galaxy platform) server users can't have easy access to version data. + +Needless to say, administrators should **never delete these files since they are not cached!** + +### Git Data Store example + +A git data store for versions of the NCBI 16S microbial database would look like: + +``` +/projects2/reference_dbs/versioned/ncbi_16S/ +/projects2/reference_dbs/versioned/ncbi_16S/pointer.git +/projects2/reference_dbs/versioned/ncbi_16S/master/.git (hidden file) +``` + +One must initialize a git repository (or clone one) into the master/ folder. The Versioned Data system depends on use of git 'tags' to specify versions of data. See the Git Data Store section for details. +To start a git repository from scratch, go into the master folder, initialize git there, copy versioned content into the folder, and then commit it. Finally add a git tag that describes the version identifier. The versioned data system only distinguishes versions by their tag name. Thus one can have several commits between versions. + +``` +cd master/ +git init +cp [files from wherever] ./ +git add [those files] +git commit -m 'various changes' +... +git add [changed files] +git commit -m 'various changes' +... +git tag -a v1 -m v1 +``` + +Once your tag is defined it will be listed in Galaxy as a version + + +### Biomaj data store example + +In this scenario the data we want versioned access to is sitting directly in the /flat/ folder of a Biomaj databank. Each version is a separate file that Biomaj manages. Biomaj can be set to keep all old versions alongside any new one it downloads, or it can limit the total # of versions to a fixed number (with the oldest removed when the newest arrives). + +*coming soon...* diff -r d31a1bd74e63 -r 5c5027485f7d doc/design.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/design.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,45 @@ +# System Design + +## Server Versioned data store folder and Galaxy data library folder structure + +Galaxy's data library will be the main portal to the server locations where versioned data are kept. We have designed the system so that versioned data sources usually exist outside of galaxy in "data store folders". One assumption is that because of the size of the databases involved, and because we are mostly concerned with providing tools with direct access or access to indexes on this data, we will make use of the file system to hold and organize the data store and derived products rather than consider a SQL or NoSQL database warehouse approach. + +Our versioned data tool scans the galaxy data library for folders that signal where these data stores are. It only searches for content inside a data library called "Versioned Data". The folder hierarchy under this is flexible - a Galaxy admin can create a hierarchy to their liking, dividing versioned data first by data source (NCBI, 3rd party, etc.) or by type (viral, bacteral, rna, eucaryote, protein, nucleotide, etc.). Eventually a "versioned data folder" is encountered in the hierarchy. It has a specific format: + +A versioned data folder has a marker file that indicates what kind of archival technology is at work in it. The marker file also yields the server file path where the archive data exists. (The data library folders themselves are just textual names and so don't provide any information for sniffing where they are or what they contain.) + +* pointer.kipper signals a Kipper-managed data store. + +* pointer.git signals a git-managed data store. + +* pointer.folder signals a basic versioned file storage folder. No caching mechanism applies here; files are permanent unless deleted manually by an admin. Content of this file either points to a server folder, or is left empty to indicate that library should be used directly to store permanent versioned data folder content. + +* pointer.biomaj signals a basic versioned file storage folder that points directly to Biomaj databank folders. No caching mechanism applies here; versioned data folders exist to the extent that Biomaj has been configured to keep them. Each data bank version folder will be examined for content in its /flat/ subfolder; those files will be listed as data to retrieve. + +The marker file can have permissions set so that only Galaxy admins could see it. + +Under the versioned data folder a user can see archive folders whose contents are linked to caches of particular archive data versions (version date and id indicated in folder name). More than one sub-folder of archived data can exist since different users may have them in use. The archived data are a COPY of the master archive folder at cache-generation time. + +For example, if using git to retrieve a version, the git database for that version is recalled, then a server folder is created to cache that archive version, and all of the git archive's contents are copied over to it. A Galaxy Versioned Data library cached folder's files are usually symlinked to the archive copy folder elsewhere on the server; this is designed so the system can also be used independently of Galaxy by other command line tools). + +If the archiving system has a cached version of a particular date available, this is marked by the presence of an "YYYY[-MM-DD-HH-MM-SS]_[VERSION ID]" folder. If this folder does not exist, the archive needs to be regenerated. Archive dates can be as basic as year, or can include finer granularity of month, day, hour etc. + +In the server data store, the folder hierarchy for a data store basically provides file storage for the data store, as well as specifically named folders for cached data. + +A data store can be broken into separate volumes such that a particular volume covers a range of version ids. As well, to facilitate parallel processing, a client server can hold a master database in a set of files distinguished by the range of keys they cover, e.g. one master_a.kipper handles all keys beginning with 'a'. + +``` +/.../[database name]/ + master/[datastore name]_[volume #] + master/[datastore name]_[volume #]_[key prefix] (If database is chunked by key prefix) + [version date]_[version_id]/ (A particular extracted version's data e.g. "2014-01-01_v5/file 1 .... file N") +``` + +Derivative data like blast databases are stored under a folder called "Workflow cache" that exists immediately under the versioned data library. The folder hierarchy here consists of a folder for each workflow by workflow id, followed by a folder coded by the id's of each of the workflows inputs; these reference dataset ids that exist in the versioned data library. The combination of workflow id and its input dataset ids yeids a cache of one or more output files that we can count on to be the right output function of the workflow + versioned data input. In the future, this derivative data could be accessed from links under a similar folder structure in the server data store folders. + +``` + workflow_cache/ Contains links to galaxy generated workflow products like blast databases. + workflow_id + [dataset_id]_[dataset_id]... (inputs to the workflow coded in folder name) + Workflow output files for given dataset(s) +``` diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_data_library.png Binary file doc/galaxy_data_library.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_library.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/galaxy_library.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,36 @@ +## Galaxy "Versioned Data" Library Setup + +1. Create a Galaxy data library called "Versioned Data". The name is important - it must be located by the Versioned Data tool. + +![Galaxy Data Library](galaxy_data_library.png) + +2. Set permissions so the special "versioneddata@localhost.com" user has add/update permissions on the "Versioned Data" library. + +3. Then add any folder structure you want under Versioned Data. Top level folders could be "Bacteria, Virus, Eukaryote", or "NCBI, ... ". Underlying folders can hold versioned data for particular bacteria / virus databases, e.g. "NCBI nt". + +4. Connect a **"pointer.[data store type]"** file (a simple text file) to any folder on your server that you want to activate as a versioned data store. Any folder that has a "pointer.[data store type]" file in it will be treated as a folder containing versioned content, as illustrated above. These folders will then be listed (by name) in the Versioned Data tool's list of data stores. Within these folders, links to caches of retrieved versioned data will be kept (shown as "cached data" items in illustration). For "folder" and "biomaj" data stores, links will be to permanent files, not cached ones. + +For example, on Galaxy page Shared Data > Data Libraries > Versioned Data, there is a folder/file: + + `Bacterial/RDP RNA/pointer.kipper` + +which contains one line of text: + + `/projects2/ref_databases/versioned/rdp_rna/` + +That enables the Versioned Data tool to list the Kipper archive as "RDP RNA" (name of folder that pointer file is in), and to know where its data store is. Beneath this folder other cached folders will accumulate. + +The "upload files to a data library" page (shown below) has the ability to link to the pointer file directly in the data store folder. Select the displayed "Upload files from filesystem paths" option (available only if you access this form from the admin menu in Galaxy). Enter the path and FILENAME (literally "pointer.kipper" in this case) of your pointer file. If you forget the filename, galaxy will link all the content files, which will need to be deleted before continuing. + +![Link pointer file to Versioned Data subfolder](library_dataset_upload.png) + +Preferably select the "Link to files without copying into Galaxy" option as well. This isn't required, but linking to the pointer file enables easier diagnosis of folder path problems. + +Then submit the form; if an error occurs then verify that the pointer file path is correct. + +**Note: a folder called "Workflow Cache" is automatically created within the Versioned Data folder to hold cached workflow results as triggered by the Versioned Data tool. No maintenance of this folder is needed.** + +### Direct use of Data Library Folder + +Note: The "folder" data store is a simple data store type - marking a folder this way enables the Versioned Data tool to directly list the subfolders of this Library folder (by name), with expectation that each folder's contents can be placed in user's history directly. The one optional twist to this is that if the folder's .pointer file is sym-linked to another folder on the server, then folder contents will be retrieved from the symlinked location. + diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_tool.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/galaxy_tool.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,26 @@ +# The Galaxy Versioned Data Tool + +This tool retrieves links to current or past versions of fasta or other types of data from a cache kept in the Galaxy data library called "Versioned Data". It then places them into the current history so that subsequent tools can work with that data. A blast search can be carried out on a version of a fasta database from a year ago for example. + +![Galaxy Versioned Data Tool](versioned_data_retrieval.png) + +You can select one or more files by version date or id. (This list is supplied from the Shared Data > Data Libraries > Versioned Data folder that has been set up by a Galaxy administrator). + +In the versioned data tool, user selects a data source, and then selects a version to retrieve (by date or version id). +If a cached version of that database exists, it is linked into user's history. +Otherwise a new version of it is created, placed in cache, and linked into history. +The Versioned Data form starts with an optional top-level "Global retrieval date" which is applied to all selected databases. This can be overridden by a retrieval date or version that you supply for a particular database. + +Finally, if you just select a data source to retrieve, but no global retrieval date or particular versions, the most recent version of the selected data source will be retrieved. + +The caching system caches both the versioned data and workflow data that the tool generates. If you request versioned data or derivative data that isn't cached, then (depending on the size of the archive) it may take time to regenerate. + + + +## Generation of workflow data + +The Workflows section allows you to select one or more pre-defined workflows to execute on the versioned data. Currently this includes any workflow that begins with the phrase "Versioned: ". The results are placed in your history for use by other tools or workflows. + +Currently workflow parameters must be entirely specified ("canned"), when the workflow is created/updated, rather than being specified at runtime. This means that a separate workflow with fixed settings must be predefined for each desired retrieval process (e.g. a blastdb with regions of low complexity filtered out, which requires a few steps to execute -dustmasker + makeblastdb etc). + +Any user that needs more specific parameters for a reference database creation can just invoke the tools/steps after using the Versioned Data tool to retrieve the raw fasta data. The only drawback in this case is that the derivative data can't be cached - it has to be redone each time the tool is run. diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_tool_form.png Binary file doc/galaxy_tool_form.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_tool_install.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/galaxy_tool_install.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,22 @@ +## Galaxy tool installation + +1. **This step requires a restart of Galaxy**: Edit your Galaxy config file ("universe_wsgi.conf" in older installs): + 1. Add the tool's default user name "**versioneddata@localhost.com**" to the admin_user's list line. + 1. Ensure the Galaxy API is activated: "enable_api = True" + +2. Then ensure **you** have an API key (from User > Api keys menu). + +3. Login to Galaxy as an administrator, and install the Versioned Data tool. Go to the Galaxy Admin > Tool Sheds > "Search and browse tool sheds" menu. Currently this tool is available from the tool shed at https://toolshed.g2.bx.psu.edu/view/damion/versioned_data . The "versioned_data" repository is located under the "sequence analysis" category. + +4. After it is installed, display the tool form once in Galaxy - this allows the tool to do the remaining setup of its "versioneddata" user. The tool automatically creates a galaxy API user (by default, versioneddata@localhost.com). You will see an API key generated automatically as well. **The tool (via Galaxy) automatically writes a copy of this key in a file called "versioneddata_api_key.txt" in its install file folder for reference; if you ever want to change the API key for the versioneddata user, you need to delete this file, and it will be regenerated automatically on next use of the tool by an admin.** + +5. Also, ensure that the kipper.py file has the Galaxy server's system user as its owner with executable permission ( "chown galaxy kipper.py; chmod u+x kipper.py"). + + +### Command-line Kipper + + + Optional: If you want to work with Kipper files directly, e.g. to start a Kipper repository from scratch, then sym-link the kipper.py file that is in the Versioned Data tool code folder's /data_stores/ subfolder - you could link it as "/usr/bin/kipper" for example. + The tool path is shown in the "location" field when you click on "Versioned Data" tool in Galaxy's "admin > Administration > Manage installed tool shed repositories" report. + +More info on command-line kipper is available [here](https://github.com/Public-Health-Bioinformatics/kipper). diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_versioned_data_tool.odt Binary file doc/galaxy_versioned_data_tool.odt has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_versioned_data_tool.pdf Binary file doc/galaxy_versioned_data_tool.pdf has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/history_dataset_details.png Binary file doc/history_dataset_details.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/history_view_details.png Binary file doc/history_view_details.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/library_dataset_upload.png Binary file doc/library_dataset_upload.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/maintenance.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/maintenance.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,42 @@ +# Permissions, security, and maintenance + +## Permissions + +When installing the Versioned Data tool, ensure that you are in the Galaxy config admin_user's list and that you have an API key. Run the tool at least once after install, that way the tool can create its own "versioneddata" user if it isn't already there. The tool uses this account to run data retrieval workflow jobs. + +For your server data store folders, since galaxy is responsible for triggering the generation and storage of versions in them, it will need permissions. For example, assuming "galaxy" is the user account that runs galaxy: + +```bash + chown -R galaxy /projects2/reference_dbs/versioned/ + chmod u+rwx /projects2/reference_dbs/versioned/* +``` + +Otherwise you will be confronted with an error similar to the following error within Galaxy when you try to do a retrieval: + +```bash + File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child + raise child_exception + OSError: [Errno 13] Permission denied" +``` + +## Cache Clearing + +All the retrieved data store versions (except for static "folder" data stores) get cached on the server, and all the triggered galaxy workflow runs get cached in galaxy in the Versioned Data library's "Workflow cache" folder. A special script will clear out all but the most recent version of any data store's cached versions, and remove workflow caches as appropriate. This script is named + + `versioned_data_cache_clear.py` + +and is in the Versioned Data script's shed_tools folder. It can be run as a monthly scheduled task. + +The "versioneddata" user has a history for each workflow it runs on behalf of another user. A galaxy admin can impersonate the "versioneddata" user to see the workflows being executed by other users, and as well, manually delete any histories that haven't been properly deleted by the cache clearing script. + +Note that Galaxy won't delete a dataset if it is linked from another (user or library) context. + +## Notes + +In galaxy, the galaxy (i) information link from a "Versioned Data Retrieval" history job item will display all the form data collected for the job run. One item, + + `For user with Galaxy API Key http://salk.bccdc.med.ubc.ca/galaxylab/api-cc628a7dffdbeca7` + +is used to pass the api url, and user's current history id. We weren't able to convey this information any other way. The coded parameter is not the user's api key; that is kept confidential, i.e. it doesn't exist in the records of the job run. + + diff -r d31a1bd74e63 -r 5c5027485f7d doc/problem_solving.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/problem_solving.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,27 @@ +# Problem solving + +If the Galaxy API is not working, then it needs to be enabled in universe_wsgi.ini with "enable_api = True" + +The "Versioned data retrieval" tool depends on the bioblend python module created for accessing Galaxy's API. This enables reading and writing to a particular Galaxy data library and to current user's history in a more elegant way. + + `> pip install bioblend` + +This loads a variety of dependencies - requests, poster, boto, pyyaml,... (For ubuntu, first install libyaml-dev first to avoid odd pyyaml error?) (Note that this must be done as well on galaxy toolshed's install too.) + +**Data library download link doesn't work** + +If a data library has a version data folder that is linked to the data store elsewhere on the server, that folder's download link probably won't work until you adjust your galaxy webserver (apache or nginx) configuration. See this and this (a note on galaxy user permissions visa vis apache user). For the example of all data stores that are in /projects2/reference_dbs/versioned/ , Apache needs two lines for the galaxy site configuration (usually in /etc/httpd/conf/httpd.conf ). + +``` +XSendFile on +XSendFilePath ... (other paths) +XSendFilePath /projects2/reference_dbs/versioned/ +``` + +Without this path, and sufficient permissions, errors will show up in the Apache httpd log, and galaxy users will find that they can't download the versioned data if it is linked from the galaxy library to other locations on the server. + +**Data store version deletion** + +Note: if a server data store version folder has been deleted, but a link to it still exists in the Versioned Data library, then attempting to download the dataset from there will result in an error. Running the Versioned Data tool to request this file will regenerate it in the server data store cache. Alternately, you can delete the versioned data library version cache folder. + + diff -r d31a1bd74e63 -r 5c5027485f7d doc/setup.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/setup.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,14 @@ +## Setup for Admins + +The general goal with the following configuration is to enable the versioned data to be generated and used directly by server (non galaxy-platform) users; at the same time Galaxy users have access to the same versioned data system (and also have the benefit of generating derivative data via Galaxy workflows) without having to leave the Galaxy platform. In the background a reference database scheduled import process (in our case maintained by Biomaj) keeps the master data store files up to date. + +To setup the Versioned Data Tool, do the following: + + 1. [Galaxy tool installation](galaxy_tool_install.md) + 2. [Server data stores](data_stores.md) + 4. [Galaxy "Versioned Data" library setup](galaxy_library.md) + 5. [Workflow configuration](workflows.md) + 6. [Permissions, security, and maintenance](maintenance.md) + 7. [Problem solving](problem_solving.md) + +Any galaxy user who wants to use this tool will need a Galaxy API key. They can get one via their User menu, or a Galaxy admin can assign one for them. diff -r d31a1bd74e63 -r 5c5027485f7d doc/versioned_data_retrieval.png Binary file doc/versioned_data_retrieval.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/workflow_tools.png Binary file doc/workflow_tools.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d doc/workflows.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/workflows.md Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,18 @@ +# Workflow configuration + +**All displayed workflows must be prefixed with the keyword "Versioning:"** This is how they are recognized by the Versioned Data tool + +In order for a workflow to show up in the tool's Workflows menu, **it must be published by its author**. Currently there is no other ability to offer finer-grained access to lists of workflows by individual or user group. + +![Galaxy workflows](workflows.png) + +Workflow processing is accomplished on behalf of a user by launching it in a history of the versioneddata@localhost.com user. Cache clearing will remove these file references from the data library when they are old (but as the files are often moved up to the library's cache they can still be used by others until that cache is cleared. An admin can impersonate a "versioneddata" user to see the workflows being executed by other users. + +Note that any workflow that uses additional input datasets must have those datasets set in the workflow design/template, so they must exist in a fixed location - a data library. (Currently they can't be in the Versioned Data library). + +If a a workflow references a tool version that has been uninstalled, one will receive this error when working on it. The only remedy is to reinstall that particular tool version, or to change the workflow to a newer version. + +```bash + File "/usr/lib/python2.6/site-packages/bioblend/galaxyclient.py", line 104, in make_post_request r.status_code, body=r.text) + bioblend.galaxy.client.ConnectionError: Unexpected response from galaxy: 500: {"traceback": "Traceback (most recent call last):\n File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/web/framework/decorators.py\", line 244, in decorator\n rval = func( self, trans, *args, **kwargs)\n File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/webapps/galaxy/api/workflows.py\", line 231, in create\n populate_state=True,\n File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/run.py\", line 21, in invoke\n return __invoke( trans, workflow, workflow_run_config, workflow_invocation, populate_state )\n File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/run.py\", line 60, in __invoke\n modules.populate_module_and_state( trans, workflow, workflow_run_config.param_map )\n File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/modules.py\", line 1014, in populate_module_and_state\n step_errors = module_injector.inject( step, step_args=step_args, source=\"json\" )\n File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/modules.py\", line 992, in inject\n raise MissingToolException()\nMissingToolException\n", "err_msg": "Uncaught exception in exposed API method:", "err_code": 0} +``` diff -r d31a1bd74e63 -r 5c5027485f7d doc/workflows.png Binary file doc/workflows.png has changed diff -r d31a1bd74e63 -r 5c5027485f7d ffp_macros.xml --- a/ffp_macros.xml Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,21 +0,0 @@ - - - - - - - - - - - - - - - - @BINARY@ - - @BINARY@ --version - - - \ No newline at end of file diff -r d31a1bd74e63 -r 5c5027485f7d ffp_phylogeny.py --- a/ffp_phylogeny.py Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,364 +0,0 @@ -#!/usr/bin/python -import optparse -import re -import time -import os -import tempfile -import sys -import shlex, subprocess -from string import maketrans - -VERSION_NUMBER = "0.1.03" - -class MyParser(optparse.OptionParser): - """ - From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output - Provides a better class for displaying formatted help info in epilog() portion of optParse; allows for carriage returns. - """ - def format_epilog(self, formatter): - return self.epilog - - -def stop_err( msg ): - sys.stderr.write("%s\n" % msg) - sys.exit(1) - -def getTaxonomyNames(type, multiple, abbreviate, filepaths, filenames): - """ - Returns a taxonomic list of names corresponding to each file being analyzed by ffp. - This may also include names for each fasta sequence found within a file if the - "-m" multiple option is provided. Default is to use the file names rather than fasta id's inside the files. - NOTE: THIS DOES NOT (MUST NOT) REORDER NAMES IN NAME ARRAY. - EACH NAME ENTRY IS TRIMMED AND MADE UNIQUE - - @param type string ['text','amino','nucleotide'] - @param multiple boolean Flag indicates to look within files for labels - @param abbreviate boolean Flag indicates to shorten labels - @filenames array original input file names as user selected them - @filepaths array resulting galaxy dataset file .dat paths - - """ - # Take off prefix/suffix whitespace/comma : - taxonomy = filenames.strip().strip(',').split(',') - names=[] - ptr = 0 - - for file in filepaths: - # Trim labels to 50 characters max. ffpjsd kneecaps a taxonomy label to 10 characters if it is greater than 50 chars. - taxonomyitem = taxonomy[ptr].strip()[:50] #.translate(translations) - # Convert non-alphanumeric characters to underscore in taxonomy names. ffprwn IS VERY SENSITIVE ABOUT THIS. - taxonomyitem = re.sub('[^0-9a-zA-Z]+', '_', taxonomyitem) - - if (not type in 'text') and multiple: - #Must read each fasta file, looking for all lines beginning ">" - with open(file) as fastafile: - lineptr = 0 - for line in fastafile: - if line[0] == '>': - name = line[1:].split(None,1)[0].strip()[:50] - # Odd case where no fasta description found - if name == '': name = taxonomyitem + '.' + str(lineptr) - names.append(name) - lineptr += 1 - else: - - names.append(taxonomyitem) - - ptr += 1 - - if abbreviate: - names = trimCommonPrefixes(names) - names = trimCommonPrefixes(names, True) # reverse = Suffixes. - - return names - -def trimCommonPrefixes(names, reverse=False): - """ - Examines sorted array of names. Trims off prefix of each subsequent pair. - - @param names array of textual labels (file names or fasta taxonomy ids) - @param reverse boolean whether to reverse array strings before doing prefix trimming. - """ - wordybits = '|.0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' - - if reverse: - names = map(lambda name: name[::-1], names) #reverses characters in names - - sortednames = sorted(names) - ptr = 0 - sortedlen = len(sortednames) - oldprefixlen=0 - prefixlen=0 - for name in sortednames: - ptr += 1 - - #If we're not at the very last item, reevaluate prefixlen - if ptr < sortedlen: - - # Skip first item in an any duplicate pair. Leave duplicate name in full. - if name == sortednames[ptr]: - if reverse: - continue - else: - names[names.index(name)] = 'DupLabel-' + name - continue - - # See http://stackoverflow.com/questions/9114402/regexp-finding-longest-common-prefix-of-two-strings - prefixlen = len( name[:([x[0]==x[1] for x in zip(name, sortednames[ptr])]+[0]).index(0)] ) - - if prefixlen <= oldprefixlen: - newprefix = name[:oldprefixlen] - else: - newprefix = name[:prefixlen] - # Expands label to include any preceeding characters that were probably part of it. - newprefix = newprefix.rstrip(wordybits) - newname = name[len(newprefix):] - # Some tree visualizers don't show numeric labels?!?! - if not reverse and newname.replace('.','',1).isdigit(): - newname = 'id_' + newname - names[names.index(name)] = newname #extract name after prefix part; has nl in it - oldprefixlen = prefixlen - - if reverse: - names = map(lambda name: name[::-1], names) #now back to original direction - - return names - -def getTaxonomyFile(names): - """ - FFP's ffpjsd -p [taxon file of labels] option creates a phylip tree with - given taxon labels - - @param names array of datafile names or fasta sequence ids - """ - - try: - temp = tempfile.NamedTemporaryFile(mode='w+t',delete=False) - taxonomyTempFile = temp.name - temp.writelines(name + '\n' for name in names) - - except: - stop_err("Galaxy configuration error for ffp_phylogeny tool. Unable to write taxonomy file " + taxonomyTempFile) - - finally: - temp.close() - - return taxonomyTempFile - - -def check_output(command): - """ - Execute a command line containing a series of pipes; and handle error cases by exiting at first error case. This is a substitute for Python 2.7 subprocess.check_output() - allowing piped commands without shell=True call . Based on Python subprocess docs 17.1.4.2 - - ISSUE: warnings on stderr are given with no exit code 0: - ffpry: Warning: No keys of length 6 found. - ffpcol: (null): Not a key valued FFP. - - Can't use communicate() because this closes processes' stdout - file handle even without errors because of read to end of stdout: - (stdoutdata, stderrdata) = processes[ptr-1].communicate() - - """ - commands = command.split("|") - processes = [] - ptr = 0 - substantive = re.compile('[a-zA-Z0-9]+') - - for command_line in commands: - print command_line.strip() - args = shlex.split(command_line.strip()) - if ptr == 0: - proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) - processes.append(proc) - else: - - #this has to come before error processing? - newProcess = subprocess.Popen(args, stdin=processes[ptr-1].stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE) - - # It seems the act of reading standard error output is enough to trigger - # error code signal for that process, i.e. so that retcode returns a code. - retcode = processes[ptr-1].poll() - stderrdata = processes[ptr-1].stderr.read() - #Issue with ffptree is it outputs ---- ... ---- on stderr even when ok. - if retcode or (len(stderrdata) > 0 and substantive.search(stderrdata)): - stop_err(stderrdata) - - processes.append(newProcess) - processes[ptr-1].stdout.close() # Allow prev. process to receive a SIGPIPE if current process exits. - - ptr += 1 - - retcode = processes[ptr-1].poll() - (stdoutdata, stderrdata) = processes[ptr-1].communicate() - if retcode or (len(stderrdata) > 0 and substantive.search(stderrdata)): - stop_err(stderrdata) - - return stdoutdata - - -class ReportEngine(object): - - def __init__(self): pass - - def __main__(self): - - - ## *************************** Parse Command Line ***************************** - parser = MyParser( - description = 'FFP (Feature frequency profile) is an alignment free comparison tool', - usage = 'python ffp_phylogeny.py [input_files] [output file] [options]', - epilog="""Details: - - FFP (Feature frequency profile) is an alignment free comparison tool for phylogenetic analysis and text comparison. It can be applied to nucleotide sequences, complete genomes, proteomes and even used for text comparison. - - """) - - parser.set_defaults(row_limit=0) - # Don't use "-h" , it is reserved for --help! - - parser.add_option('-t', '--type', type='choice', dest='type', default='text', - choices=['amino','nucleotide','text'], - help='Choice of Amino acid, nucleotide or plain text sequences to find features in') - - parser.add_option('-l', '--length', type='int', dest='length', default=6, - help='Features (any string of valid characters found in data) of this length will be counted. Synonyms: l-mer, k-mer, n-gram, k-tuple') - - #parser.add_option('-n', '--normalize', dest='normalize', default=True, action='store_true', - # help='Normalize counts into relative frequency') - - parser.add_option('-m', '--multiple', dest='multiple', default=False, action='store_true', - help='By default all sequences in a fasta file be treated as 1 sequence to profile. This option enables each sequence found in a fasta file to have its own profile.') - - parser.add_option('-M', '--metric', type='string', dest='metric', - help='Various metrics to measure count distances by.') - - parser.add_option('-x', '--taxonomy', type='string', dest='taxonomy', - help='Taxanomic label for each profile/sequence.') - - parser.add_option('-d', '--disable', dest='disable', default=False, action='store_true', - help='By default amino acid and nucleotide characters are grouped by functional category (protein or purine/pyrimidine group) before being counted. Disable this to treat individual characters as distinct.') - - parser.add_option('-a', '--abbreviate', dest='abbreviate', default=False, action='store_true', - help='Shorten tree taxonomy labels as much as possible.') - - parser.add_option('-s', '--similarity', dest='similarity', default=False, action='store_true', - help='Enables pearson correlation coefficient matrix and any of the binary distance measures to be turned into similarity matrixes.') - - parser.add_option('-f', '--filter', type='choice', dest='filter', default='none', - choices=['none','count','f','n','e','freq','norm','evd'], - help='Choice of [f=raw frequency|n=normal|e=extreme value (Gumbel)] distribution: Features are trimmed from the data based on lower/upper cutoff points according to the given distribution.') - - parser.add_option('-L', '--lower', type='float', dest='lower', - help='Filter lower bound is a 0.00 percentages') - - parser.add_option('-U', '--upper', type='float', dest='upper', - help='Filter upper bound is a 0.00 percentages') - - parser.add_option('-o', '--output', type='string', dest='output', - help='Path of output file to create') - - parser.add_option('-T', '--tree', dest='tree', default=False, action='store_true', help='Generate Phylogenetic Tree output file') - - parser.add_option('-v', '--version', dest='version', default=False, action='store_true', help='Version number') - - # Could also have -D INT decimal precision included for ffprwn . - - options, args = parser.parse_args() - - if options.version: - print VERSION_NUMBER - return - - import time - time_start = time.time() - - try: - in_files = args[:] - - except: - stop_err("Expecting at least 1 input data file.") - - - #ffptxt / ffpaa / ffpry - if options.type in 'text': - command = 'ffptxt' - - else: - if options.type == 'amino': - command = 'ffpaa' - else: - command = 'ffpry' - - if options.disable: - command += ' -d' - - if options.multiple: - command += ' -m' - - command += ' -l ' + str(options.length) - - if len(in_files): #Note: app isn't really suited to stdio - command += ' "' + '" "'.join(in_files) + '"' - - #ffpcol / ffpfilt - if options.filter != 'none': - command += ' | ffpfilt' - if options.filter != 'count': - command += ' -' + options.filter - if options.lower > 0: - command += ' --lower ' + str(options.lower) - if options.upper > 0: - command += ' --upper ' + str(options.upper) - - else: - command += ' | ffpcol' - - if options.type in 'text': - command += ' -t' - - else: - - if options.type == 'amino': - command += ' -a' - - if options.disable: - command += ' -d' - - #if options.normalize: - command += ' | ffprwn' - - #Now create a taxonomy label file, ensuring a name exists for each profile. - taxonomyNames = getTaxonomyNames(options.type, options.multiple, options.abbreviate, in_files, options.taxonomy) - taxonomyTempFile = getTaxonomyFile(taxonomyNames) - - # -p = Include phylip format 'infile' of the taxon names to use. Very simple, just a list of fasta identifier names. - command += ' | ffpjsd -p ' + taxonomyTempFile - - if options.metric and len(options.metric) >0 : - command += ' --' + options.metric - if options.similarity: - command += ' -s' - - # Generate Newick (.nhx) formatted tree if we have at least 3 taxonomy items: - if options.tree: - if len(taxonomyNames) > 2: - command += ' | ffptree -q' - else: - stop_err("For a phylogenetic tree display, one must have at least 3 ffp profiles.") - - #print command - - result = check_output(command) - with open(options.output,'w') as fw: - fw.writelines(result) - os.remove(taxonomyTempFile) - -if __name__ == '__main__': - - time_start = time.time() - - reportEngine = ReportEngine() - reportEngine.__main__() - - print('Execution time (seconds): ' + str(int(time.time()-time_start))) - diff -r d31a1bd74e63 -r 5c5027485f7d ffp_phylogeny.xml --- a/ffp_phylogeny.xml Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,293 +0,0 @@ - - An alignment free comparison tool for phylogenetic analysis and text comparison - - ffp-phylogeny - - - - ./ffp_phylogeny.py - ffp_macros.xml - - - 1 - - ]]> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - [ ffpfilt | ffpcol > ffprwn] > ffpjsd > ffptree** . The last step is optional - by deselecting the "Generate Tree Phylogeny" checkbox, the tool will output only the precursor distance matrix file rather than a Newick (.nhx) formatted tree file. - -Each sequence or text file has a profile containing tallies of each feature found. A feature is a string of valid characters of given length. - -For nucleotide data, by default each character (ATGC) is grouped as either purine(R) or pyrmidine(Y) before being counted. - -For amino acid data, by default each character is grouped into one of the following: -(ST),(DE),(KQR),(IVLM),(FWY),C,G,A,N,H,P. Each group is represented by the first character in its series. - -One other key concept is that a given feature, e.g. "TAA" is counted in forward -AND reverse directions, mirroring the idea that a feature's orientation is not -so important to distinguish when it comes to alignment-free comparison. -The counts for "TAA" and "AAT" are merged. - -The labeling of the resulting counted feature items is perhaps the trickiest -concept to master. Due to computational efficiency measures taken by the -developers, a feature that we see on paper as "TAC" may be stored and labeled -internally as "GTA", its reverse compliment. One must look for the alternative -if one does not find the original. - -Also note that in amino acid sequences the stop codon "*" (or any other character -that is not in the Amino acid alphabet) causes that character frame not to be -counted. Also, character frames never span across fasta entries. - -A few tutorials: - * http://sourceforge.net/projects/ffp-phylogeny/files/Documentation/tutorial.pdf - * https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny - -------- - -.. class:: warningmark - -**Note** - -Taxonomy label details: If each file contains one profile, the file's name is used to label the profile. -If each file contains fasta sequences to profile individually, their fasta identifiers will be used to label them. -The "short labels" option will find the shortest label that uniquely identifies each profile. -Either way, there are some quirks: ffpjsd clips labels to 10 characters if they are greater than 50 characters, so all labels are trimmed to 50 characters first. -Also "id" is prefixed to any numeric label since some tree visualizers won't show purely numeric labels. -In the accidental case where a Fasta sequence label is a duplicate of a previous one it will be prefixed by "DupLabel-". - -The command line ffpjsd can hang if one provides an l-mer length greater than the length of file content. -One must identify its process id (">ps aux | grep ffpjsd") and kill it (">kill [process id]"). -------- - -**References** - -The original ffp-phylogeny code is at http://ffp-phylogeny.sourceforge.net/ . -This tool uses Aaron Petkau's modified version: https://github.com/apetkau/ffp-3.19-custom . - -The development of the ff-phylogeny should be attributed to: - -Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 2009;106(8):2677-2682. doi:10.1073/pnas.0813249106. - - ]]> - - - diff -r d31a1bd74e63 -r 5c5027485f7d tarballit.sh --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tarballit.sh Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,2 @@ +#!/bin/bash + tar -zcvf versioned_data.tar.gz * --exclude "*~" --exclude "*.pyc" --exclude "tool_test_output*" --exclude "*.gz" diff -r d31a1bd74e63 -r 5c5027485f7d test-data/genome1 --- a/test-data/genome1 Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,2 +0,0 @@ ->genome1 -AATT diff -r d31a1bd74e63 -r 5c5027485f7d test-data/genome2 --- a/test-data/genome2 Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,2 +0,0 @@ ->genome2 -AAGG diff -r d31a1bd74e63 -r 5c5027485f7d test-data/test_length_1_output.tabular --- a/test-data/test_length_1_output.tabular Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,3 +0,0 @@ -2 -genome1 0.00e+00 1.89e-01 -genome2 1.89e-01 0.00e+00 diff -r d31a1bd74e63 -r 5c5027485f7d test-data/test_length_2_output.tabular --- a/test-data/test_length_2_output.tabular Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,3 +0,0 @@ -2 -genome1 0.00e+00 4.58e-01 -genome2 4.58e-01 0.00e+00 diff -r d31a1bd74e63 -r 5c5027485f7d test-data/test_length_2b_output.tabular --- a/test-data/test_length_2b_output.tabular Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,3 +0,0 @@ -2 -genome1 0.00e+00 1.42e-01 -genome2 1.42e-01 0.00e+00 diff -r d31a1bd74e63 -r 5c5027485f7d tool_dependencies.xml --- a/tool_dependencies.xml Sun Aug 09 16:05:40 2015 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,24 +0,0 @@ - - - - - - git clone https://github.com/apetkau/ffp-3.19-custom.git ffp-phylogeny - git reset --hard d4382db015acec0e5cc43d6c1ac80ae12cb7e6b3 - ./configure --disable-gui --prefix=$INSTALL_DIR - - - - $INSTALL_DIR/bin - - - - - apetkau/ffp-3.19-custom is a customized version of http://sourceforge.net/projects/ffp-phylogeny/ - - - - diff -r d31a1bd74e63 -r 5c5027485f7d vdb_common.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/vdb_common.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,88 @@ +#!/usr/bin/python +import re +import os +import time +import dateutil +import dateutil.parser as parser2 +import datetime +import calendar + +def parse_date(adate): + """ + Convert human-entered time into linux integer timestamp + + @param adate string Human entered date to parse into linux time + + @return integer Linux time equivalent or 0 if no date supplied + """ + adate = adate.strip() + if adate > '': + adateP = parser2.parse(adate, fuzzy=True) + #dateP2 = time.mktime(adateP.timetuple()) + # This handles UTC & daylight savings exactly + return calendar.timegm(adateP.timetuple()) + return 0 + + +def get_unix_time(vtime, voffset=0): + return float(vtime) - int(voffset)/100*60*60 + + +def natural_sort_key(s, _nsre=re.compile('([0-9]+)')): + return [int(text) if text.isdigit() else text.lower() + for text in re.split(_nsre, s)] + + +class date_matcher(object): + """ + Enables cycling through a list of versions and picking the one that matches + or immediately preceeds a given date. As soon as an item is found, subsequent + calls to date_matcher return false (because of the self.found flag) + """ + def __init__(self, unix_time): + """ + @param adate linux date/time + """ + self.found = False + self.unix_time = unix_time + + def __iter__(self): + return self + + def next(self, unix_datetime): + select = False + if (self.found == False) and (self.unix_time > 0) and (unix_datetime <= self.unix_time): + self.found = True + select = True + return select + + +def dateISOFormat(atimestamp): + return datetime.datetime.isoformat(datetime.datetime.fromtimestamp(atimestamp)) + +def lightDate(unixtime): + return datetime.datetime.utcfromtimestamp(float(unixtime)).strftime('%Y-%m-%d %H:%M') + +def move_files(source_path, destination_path, file_paths): + """ + MOVE FILES TO CACHE FOLDER (RATHER THAN COPYING THEM) FOR GREATER EFFICIENCY. + Since a number of data source systems have hidden / temporary files in their + data folder structure, a list of file_paths is required to select only that + content that should be copied over. Note: this will leave skeleton of folders; only files are moved. + + Note: Tried using os.renames() but it errors when attempting to remove folders + from git archive that aren't empty due to files that are not to be copied. + + + @param source_path string Absolute folder path to move data files from + @param destination_path string Absolute folder path to move data files to + @param file_paths string List of files and their relative paths from source_path root + """ + for file_name in file_paths: + if len(file_name): + print "(" + file_name + ")" + v_path = os.path.dirname(os.path.join(destination_path, file_name)) + if not os.path.isdir(v_path): + os.makedirs(v_path) + os.rename(os.path.join(source_path, file_name), os.path.join(destination_path, file_name) ) + diff -r d31a1bd74e63 -r 5c5027485f7d vdb_retrieval.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/vdb_retrieval.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,853 @@ +#!/usr/bin/python + +""" +****************************** vdb_retrieval.py ****************************** + VDBRetrieval() instance called in two stages: + 1) by tool's versioned_data.xml form (dynamic_option field) + 2) by its executable versioned_data.py script. + +""" + +import os, sys, glob, time +import string +from random import choice + +from bioblend.galaxy import GalaxyInstance +from requests.exceptions import ChunkedEncodingError +from requests.exceptions import ConnectionError + +import urllib2 +import json +import vdb_common + +# Store these values in python/galaxy environment variables? +VDB_DATA_LIBRARY = 'Versioned Data' +VDB_WORKFLOW_CACHE_FOLDER_NAME = 'Workflow cache' +VDB_CACHED_DATA_LABEL = 'Cached data' + +# Don't forget to add "versionedata@localhost.com" to galaxy config admin_users list. + +VDB_ADMIN_API_USER = 'versioneddata' +VDB_ADMIN_API_EMAIL = 'versioneddata@localhost.com' +VDB_ADMIN_API_KEY_PATH = os.path.join(os.path.dirname(sys._getframe().f_code.co_filename), 'versioneddata_api_key.txt') + +#kipper, git, folder and other registered handlers +VDB_STORAGE_OPTIONS = 'kipper git folder biomaj' + +# Used in versioned_data_form.py +VDB_DATASET_NOT_AVAILABLE = 'This database is not currently available (no items).' +VDB_DATA_LIBRARY_FOLDER_ERROR = 'Error: this data library folder is not configured correctly.' +VDB_DATA_LIBRARY_CONFIG_ERROR = 'Error: Check folder config file: ' + + +class VDBRetrieval(object): + + def __init__(self, api_key=None, api_url=None): + """ + This gets either trans.x.y from call in versioned_data.xml, + or it gets a call with api_key and api_url from versioned_data.py + + @param api_key_path string File path to temporary file containing user's galaxy api_key + @param api_url string contains http://[ip]:[port] for handling galaxy api calls. + + """ + # Initialized constants during the life of a request: + self.global_retrieval_date = None + self.library_id = None + self.history_id = None + self.data_stores = [] + + # Entire json library structure. item.url, type=file|folder, id, name (library path) + # Note: changes to library during a request aren't reflected here. + self.library = None + + self.user_api_key = None + self.user_api = None + self.admin_api_key = None + self.admin_api = None + self.api_url = None + + + def set_trans(self, api_url, history_id, user_api_key=None): #master_api_key=None, + """ + Used only on initial presentation of versioned_data.xml form. Doesn't need admin_api + """ + self.history_id = history_id + self.api_url = api_url + self.user_api_key = user_api_key + #self.master_api_key = master_api_key + + self.set_user_api() + self.set_admin_api() + self.set_datastores() + + + def set_api(self, api_info_path): + """ + "api_info_path" is provided only when user submits tool via versioned_data.py call. + It encodes both the api_url and the history_id of current session + Only at this point will we need the admin_api, so it is looked up below. + + """ + + with open(api_info_path, 'r') as access: + + self.user_api_key = access.readline().strip() + #self.master_api_key = access.readline().strip() + api_info = access.readline().strip() #[api_url]-[history_id] + self.api_url, self.history_id = api_info.split('-') + + self.set_user_api() + self.set_admin_api() + self.set_datastores() + + + def set_user_api(self): + """ + Note: error message tacked on to self.data_stores for display back to user. + """ + self.user_api = GalaxyInstance(url=self.api_url, key=self.user_api_key) + + if not self.user_api: + self.data_stores.append({'name':'Error: user Galaxy API connection was not set up correctly. Try getting another user API key.', 'id':'none'}) + return + + + def set_datastores(self): + """ + Provides the list of data stores that users can select versions from. + Note: error message tacked on to self.data_stores for display back to user. + """ + # Look for library called "Versioned Data" + try: + libs = self.user_api.libraries.get_libraries(name=VDB_DATA_LIBRARY, deleted=False) + except Exception as err: + # This is the first call to api so api url or authentication erro can happen here. + self.data_stores.append({'name':'Error: Unable to make API connection: ' + err.message, 'id':'none'}) + return + + found = False + for lib in libs: + if lib['deleted'] == False: + found = True + self.library_id = lib['id'] + break; + + if not found: + self.data_stores.append({'name':'Error: Data Library [%s] needs to be set up by a galaxy administrator.' % VDB_DATA_LIBRARY, 'id':'none'}) + return + + try: + + if self.admin_api: + self.library = self.admin_api.libraries.show_library(self.library_id, contents=True) + else: + self.library = self.user_api.libraries.show_library(self.library_id, contents=True) + + except Exception as err: + # If data within a library is somehow messed up (maybe user has no permissions?), this can generate a bioblend errorapi. + if err.message[-21:] == 'HTTP status code: 403': + self.data_stores.append({'name':'Error: [%s] library needs permissions adjusted so users can view it.' % VDB_DATA_LIBRARY , 'id':'none'}) + else: + self.data_stores.append({'name':'Error: Unable to get [%s] library contents: %s' % (VDB_DATA_LIBRARY, err.message) , 'id':'none'}) + return + + # Need to ensure it is sorted folder/file wise such that folders listed by date/id descending (name leads with version date/id) files will follow). + self.library = sorted(self.library, key=lambda x: x['name'], reverse=False) + + # Gets list of data stores + # For given library_id (usually called "Versioned Data"), retrieves folder/name + # for any folder containing a data source specification file. A folder should + # have at most one of these. It indicates the storage method used for the folder. + + for item in self.library: + if item['type'] == "file" and self.test_data_store_type(item['name']): + # Returns id of specification file that points to data source. + self.data_stores.append({ + 'name':os.path.dirname(item['name']), + 'id':item['id'] + }) + + + + def set_admin_api(self): + + # Now fetch admin_api_key from disk, or regenerate user account and api from scratch. + if os.path.isfile(VDB_ADMIN_API_KEY_PATH): + + with open(VDB_ADMIN_API_KEY_PATH, 'r') as access: + self.admin_api_key = access.readline().strip() + self.api_url = access.readline().strip() + + else: + # VERIFY THAT USER IS AN ADMIN + user = self.user_api.users.get_current_user() + if user['is_admin'] == False: + print "Unable to establish the admin api: you need to be in the admin_user=... list in galaxy config." + sys.exit(1) + """ Future: will master API be able to do... + #if not self.master_api_key: + # print "Unable to establish the admin api: no existing path to config file, and no master_api_key." + self.master_api_key + # sys.exit(1) + # Generate from scratch: + #master_api = GalaxyInstance(url=self.api_url, key=self.master_api_key) + #users = master_api.users.get_users(deleted=False) + """ + users = self.user_api.users.get_users(deleted=False) + for user in users: + + if user['email'] == VDB_ADMIN_API_EMAIL: + self.admin_api_key = self.user_api.users.create_user_apikey(user['id']) + + if not self.admin_api_key: + #Create admin api access account with dummy email address and reliable but secure password: + # NOTE: this will only be considered an admin account if it is listed in galaxy config file as one. + random_password = ''.join([choice(string.letters + string.digits) for i in range(15)]) + api_admin_user = self.user_api.users.create_local_user(VDB_ADMIN_API_USER, VDB_ADMIN_API_EMAIL, random_password) + self.admin_api_key = self.user_api.users.create_user_apikey(api_admin_user['id']) + + with open(VDB_ADMIN_API_KEY_PATH, 'w') as access: + access.write(self.admin_api_key + '\n' + self.api_url) + + self.admin_api = GalaxyInstance(url=self.api_url, key=self.admin_api_key) + + if not self.admin_api: + print 'Error: admin Galaxy API connection was not set up correctly. Admin user should be ' + VDB_ADMIN_API_EMAIL + print "Unexpected error:", sys.exc_info()[0] + sys.exit(1) + + + def get_data_store_gateway(self, type, spec_file_id): + # NOTE THAT PYTHON NEVER TIMES OUT FOR THESE CALLS - BUT IT WILL TIME OUT FOR API CALLS. + # FUTURE: Adapt this so that any modules in data_stores/ folder are usable + # e.g. https://bbs.archlinux.org/viewtopic.php?id=109561 + # http://stackoverflow.com/questions/301134/dynamic-module-import-in-python + + # ****************** GIT ARCHIVE **************** + if type == "git": + import data_stores.vdb_git + return data_stores.vdb_git.VDBGitDataStore(self, spec_file_id) + + # ****************** Kipper ARCHIVE **************** + elif type == "kipper": + import data_stores.vdb_kipper + return data_stores.vdb_kipper.VDBKipperDataStore(self, spec_file_id) + + # ****************** FILE FOLDER ****************** + elif type == "folder": + import data_stores.vdb_folder + return data_stores.vdb_folder.VDBFolderDataStore(self, spec_file_id) + + # ****************** BIOMAJ FOLDER ****************** + elif type == "biomaj": + import data_stores.vdb_biomaj + return data_stores.vdb_biomaj.VDBBiomajDataStore(self, spec_file_id) + + else: + print 'Error: %s not recognized as a valid data store type.' % type + sys.exit( 1 ) + + + #For a given path leading to pointer.[git|kipper|folder|biomaj] returns suffix + def test_data_store_type(self, file_name, file_path=None): + if file_path and not os.path.isfile(file_path): + return False + + suffix = file_name.rsplit('.',1) + if len(suffix) > 1 and suffix[1] in VDB_STORAGE_OPTIONS: + return suffix[1] + + return False + + + + + + def get_library_data_store_list(self): + """ + For display on tool form, returns names, ids of specification files that point to data sources. + + @return dirs array of [[folder label], [folder_id, selected]...] + """ + dirs = [] + # Gets recursive contents of library - files and folders + for item in self.data_stores: + dirs.append([item['name'], item['id'], False]) + + return dirs + + + def get_library_label_path(self, spec_file_id): + for item in self.data_stores: + if item['id'] == spec_file_id: + return item['name'] + + return None + + + def get_library_folder_datasets(self, library_version_path, admin=False): + """ + Gets set of ALL dataset FILES within folder - INCLUDING SUBFOLDERS - by searching + through a library, examining each item's full hierarchic label + BUT CURRENTLY: If any file has state='error' the whole list is rejected (and regenerated). + + WISHLIST: HAVE AN API FUNCTION TO GET ONLY A GIVEN FOLDER'S (BY ID) CONTENTS! + + @param library_version_path string Full hierarchic label of a library file or folder. + + @return array of ldda_id library dataset data association ids. + """ + + if admin: + api_handle = self.admin_api + else: + api_handle = self.user_api + + count = 0 + while count < 4: + try: + items = api_handle.libraries.show_library(self.library_id, True) + break + except ChunkedEncodingError: + print "Error: Trying to fetch Versioned Data library listing. Try [" + str(count) + "]" + time.sleep (2) + pass + + count +=1 + + datasets = [] + libvpath_len = len(library_version_path) + 1 + for item in items: + if item['type'] == "file": + name = item['name'] + # need slash or else will match to similar prefixes. + if name[0:libvpath_len] == library_version_path + '/': + + # ISSUE seems to be that input library datasets can be queued / running, and this MUST wait till they are finished or it will plow ahead prematurely. + + count = 0 + + while count < 10: + + try: + lib_dataset = api_handle.libraries.show_dataset(self.library_id, item['id']) + break + + except: + print "Unexpected error:", sys.exc_info()[0] + sys.exit(1) + + if lib_dataset['state'] == 'running': + time.sleep(10) + count +=1 + continue + + elif lib_dataset['state'] == 'queued': + + # FUTURE: Check date. If it is really stale it should be killed? + print 'Note: library folder dataset item "%s" is [%s]. Please wait until it is finished processing, or have a galaxy administrator delete the dataset if its creation has failed.' % (name, lib_dataset['state']) + sys.exit(1) + + elif lib_dataset['state'] != 'ok' or not os.path.isfile(lib_dataset['file_name']): + print 'Note: library folder dataset "%s" had an error during job. Its state was [%s]. Regenerating.' % (name, lib_dataset['state']) + self.admin_api.libraries.delete_library_dataset(self.library_id, item['id'], purged=True) + return [] + + else: + break + + datasets.append(item['id']) + + + return datasets + + + def get_library_version_datasets(self, library_version_path, base_folder_id='', version_label='', version_path=''): + """ + Check if given library has a folder for given version_path. If so: + - and it has content, return its datasets. + - otherwise refetch content for verison folder + If no folder, populate the version folder with data from the archive and return those datasets. + Version exists in external cache (or in case of unlinked folder, in EXISTING galaxy library folder). + Don't call unless version_path contents have been established. + + @param library_version_path string Full hierarchic label of a library file or folder with version id. + + For creation: + @param base_folder_id string a library folder id under which version files should exist + @param version_label string Label to give newly created galaxy library version folder + @param version_path string Data source folder to retrieve versioned data files from + + @return array of dataset + """ + + + # Pick the first folder of any that match given 'Versioned Data/.../.../[version id]' path. + # This case will always match 'folder' data store: + + folder_matches = self.get_folders(name=library_version_path) + + if len(folder_matches): + + folder_id = folder_matches[0]['id'] + dataset_ids = self.get_library_folder_datasets(library_version_path) + + if len(dataset_ids) > 0: + + return dataset_ids + + if os.listdir(version_path) == []: + # version_path doesn't exist for 'folder' data store versions that are datasets directly in library (i.e. not linked) + print "Error: the data store didn't return any content for given version id. Looked in: " + version_path + sys.exit(1) + + # NOTE ONE 3rd party COMMENT THAT ONE SHOULD PUT IN file_type='fasta' FOR LARGE FILES. Problem with that is that then galaxy can't recognize other data types. + library_folder_datasets = self.admin_api.libraries.upload_from_galaxy_filesystem(self.library_id, version_path, folder_id, link_data_only=True, roles=None) + + + else: + if base_folder_id == '': #Normally shouldn't happen + + print "Error: no match to given version folder for [" + library_version_path + "] but unable to create one - missing parent folder identifier" + return [] + + # Provide archive folder with datestamped name and version (folderNew has url, id, name): + folderNew = self.admin_api.libraries.create_folder(self.library_id, version_label, description=VDB_CACHED_DATA_LABEL, base_folder_id=base_folder_id) + folder_id = str(folderNew[0]['id']) + + # Now link results to suitably named galaxy library dataset + # Note, this command links to EVERY file/folder in version_folder source. + # Also, Galaxy will strip off .gz suffixes - WITHOUT UNCOMPRESSING FILES! + # So, best to prevent data store from showing .gz files in first place + try: + library_folder_datasets = self.admin_api.libraries.upload_from_galaxy_filesystem(self.library_id, version_path, folder_id, link_data_only=True, roles=None) + + except: + # Will return error if version_path folder is empty or kipper unable to create folder or db due to permissions etc. + print "Error: a permission or other error was encountered when trying to retrieve version data for version folder [" + version_path + "]: Is the [%s] listed in galaxy config admin_users list?" % VDB_ADMIN_API_EMAIL, sys.exc_info()[0] + sys.exit(1) + + + library_dataset_ids = [dataset['id'] for dataset in library_folder_datasets] + + # LOOP WAITS UNTIL THESE DATASETS ARE UPLOADED. + # They still take time even for linked big data probably because they are read for metadata. + # Not nice that user doesn't see process as soon as it starts, but timeout possibilities + # later on down the line are more difficult to manage. + for dataset_id in library_dataset_ids: + # ten seconds x 60 = 6 minutes; should be longer? + for count in range(60): + try: + lib_dataset = self.admin_api.libraries.show_dataset(self.library_id, dataset_id) + break + + except: + print "Unexpected error:", sys.exc_info()[0] + continue + + if lib_dataset['state'] in 'running queued': + time.sleep(10) + count +=1 + continue + else: + # Possibly in a nice "ok" or not nice state here. + break + + + return library_dataset_ids + + + def get_folders(self, name): + """ + ISSUE: Have run into this sporadic error with a number of bioblend api calls. Means api calls may need to be wrapped in a retry mechanism: + File "/usr/lib/python2.6/site-packages/requests/models.py", line 656, in generate + raise ChunkedEncodingError(e) + requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(475 bytes read)', IncompleteRead(475 bytes read)) + """ + for count in range(3): + try: + return self.user_api.libraries.get_folders(self.library_id, name=name ) + break + + except: + print 'Try (%s) to fetch library folders for "%s"' % (str(count), name) + print sys.exc_info()[0] + time.sleep (5) + + print "Failed after (%s) tries!" % (str(count)) + return None + + + def get_library_folder(self, library_path, relative_path, relative_labels): + """ + Check if given library has folder that looks like library_path + relative_path. + If not, create and return resulting id. Used for cache creation. + Ignores bad library_path. + + @param library_path string Full hierarchic label of a library folder. NOTE: Library_path must have leading forward slash for a match, i.e. /derivative_path + @param relative_path string branch of folder tree stemming from library_path + @param relative_labels string label for each relative_path item + + @return folder_id + """ + created = False + root_match = self.get_folders( name=library_path) + + if len(root_match): + base_folder_id=root_match[0]['id'] + + relative_path_array = relative_path.split('/') + relative_labels_array = relative_labels.split('/') + + for ptr in range(len (relative_path_array)): + + _library_path = os.path.join(library_path, '/'.join(relative_path_array[0:ptr+1])) + folder_matches = self.get_folders( name=_library_path) + + if len(folder_matches): + folder_id = folder_matches[0]['id'] + else: + dataset_key = relative_path_array[ptr] + label = relative_labels_array[ptr] + folder_new = self.admin_api.libraries.create_folder(self.library_id, dataset_key, description=label, base_folder_id=base_folder_id) + folder_id = str(folder_new[0]['id']) + + base_folder_id = folder_id + + return folder_id + + return None + + + def get_library_folders(self, library_label_path): + """ + Gets set of ALL folders within given library path. Within each folder, lists its files as well. + Folders are ordered by version date/id, most recent first (natural sort). + + NOT Quite recursive. Nested folders don't have parent info. + + @param library_version_path string Full hierarchic label of a library folder. Inside it are version subfolders, their datasets, and the pointer file. + + @return array of ids of the version subfolders and also their dataset content ids + """ + + folders = [] + libvpath_len = len(library_label_path) + for item in self.library: + + name = item['name'] + if name[0:libvpath_len] == library_label_path: + + # Skip any file that is immediately under library_label_path + if item['type'] == 'file': + file_key_val = item['name'].rsplit('/',1) + #file_name_parts = file_key_val[1].split('.') + if file_key_val[0] == library_label_path: + #and len(file_name_parts) > 1 \ + #and file_name_parts[1] in VDB_STORAGE_OPTIONS: + continue + + if item['type'] == 'folder': + folders.append({'id':item['id'], 'name':item['name'], 'files':[]}) + + else: + # Items should be sorted ascending such that each item is contained in previous folder. + folders[-1]['files'].append({'id':item['id'], 'name':item['name']}) + + return folders + + + def get_workflow_data(self, workflow_list, datasets, version_id): + """ + Run each workflow in turn, given datasets generated above. + See if each workflow's output has been cached. + If not, run workflow and reestablish output data + Complexity is that cache could be: + 1) in user's history. + 2) in library data folder called "derivative_cache" under data source folder (as created by this galaxy install) + 3) in external data folder ..."/derivative_cache" (as created by this galaxy install) + BUT other galaxy installs can't really use this unless they know metadata on workflow that generated it + In future we'll design a system for different galaxies to be able to read metadata to determine if they can use the cached workflow data here. + + ISSUE Below: Unless it is a really short workflow, run_workflow() returns before work is actually complete. DO WE NEED TO DELAY UNTIL EVERY SINGLE OUTPUT DATASET IS "ok", not just "queued" or "running"? OR IS SERVER TO LIBRARY UPLOAD PAUSE ABOVE ENOUGH? + + Note, workflow_list contains only ids for items beginning with "versioning: " + FUTURE IMPROVEMENT: LOCK WORKFLOW: VULNERABILITY: IF WORKFLOW CHANGES, THAT AFFECTS REPRODUCABILITY. + + FUTURE: NEED TO ENSURE EACH dataset id not found in history is retrieved from cache. + FUTURE: Check to see that EVERY SINGLE workflow output + has a corresponding dataset_id in history or library, + i.e. len(workflow['outputs']) == len(history_dataset_ids) + But do we know before execution how many outputs (given conditional output?) + + @param workflow_list + @param datasets: an array of correct data source versioned datasets that are inputs to tools and workflows + @param version_id + + """ + for workflow_id in workflow_list.split(): + + workflows = self.admin_api.workflows.get_workflows(workflow_id, published=True) + + if not len(workflows): + # Error occurs if admin_api user doesn't have permissions on this workflow??? + # Currently all workflows have to be shared with VDB_ADMIN_API_EMAIL. + # Future: could get around this by using publicly shared workflows via "import_shared_workflow(workflow_id)" + print 'Error: unable to run workflow - has it been shared with the Versioned Data tool user email address "%s" ?' % VDB_ADMIN_API_EMAIL + sys.exit(1) + + for workflow_summary in workflows: + + workflow = self.admin_api.workflows.show_workflow(workflow_id) + print 'Doing workflow: "' + workflow_summary['name'] + '"' + + if len(workflow['inputs']) == 0: + print "ERROR: This workflow is not configured correctly - it needs at least 1 input dataset step." + + # FUTURE: Bring greater intelligence to assigning inputs to workflow?!!! + if len(datasets) < len(workflow['inputs']): + + print 'Error: workflow requires more inputs (%s) than are available in retrieved datasets (%s) for this version of retrieved data.' % (len(workflow['inputs']), len(datasets)) + sys.exit(1) + + codings = self.get_codings(workflow, datasets) + (workflow_input_key, workflow_input_label, annotation_key, dataset_map) = codings + + history_dataset_ids = self.get_history_workflow_results(annotation_key) + + if not history_dataset_ids: + + library_cache_path = os.path.join("/", VDB_WORKFLOW_CACHE_FOLDER_NAME, workflow_id, workflow_input_key) + + # This has to be privileged api admin fetch. + library_dataset_ids = self.get_library_folder_datasets(library_cache_path, admin=True) + + if not len(library_dataset_ids): + # No cache in library so run workflow + + # Create admin_api history + admin_history = self.admin_api.histories.create_history() + admin_history_id = admin_history['id'] + + # If you try to run a workflow that hasn't been shared with you, it seems to go a bit brezerk. + work_result = self.admin_api.workflows.run_workflow(workflow_id, dataset_map=dataset_map, history_id=admin_history_id) + + # Then copy (link) results back to library so can match in future + self.cache_datasets(library_cache_path, work_result, workflow_summary, codings, version_id, admin_history_id) + + # Now return the new cached library dataset ids: + library_dataset_ids = self.get_library_folder_datasets(library_cache_path, admin=True) + """ If a dataset is purged, its purged everywhere... so don't purge! Let caching system do that. + THIS APPEARS TO HAPPEN TOO QUICKLY FOR LARGE DATABASES; LEAVE IT TO CACHING MECHANISM TO CLEAR. OR ABOVE FIX TO WAIT UNTIL DS IS OK. + self.admin_api.histories.delete_history(admin_history_id, purge=False) + """ + + # Now link library cache workflow results to history and add key there for future match. + self.update_history(library_dataset_ids, annotation_key, version_id) + + + + + def update_history(self, library_dataset_ids, annotation, version_id): + """ + Copy datasets from library over to current history if they aren't already there. + Must cycle through history datasets, looking for "copied_from_ldda_id" value. This is available only with details view. + + @param library_dataset_ids array List of dataset Ids to copy from library folder + @param annotation string annotation to add (e.g. Path of original version folder added as annotation) + @param version_id string Label to add to copied dataset in user's history + """ + history_datasets = self.user_api.histories.show_history(self.history_id, contents=True, deleted=False, visible=True, details='all' , types=None) # , + + datasets = [] + for dataset_id in library_dataset_ids: + # USING ADMIN_API because that's only way to get workflow items back... user_api doesn't nec. have view rights on newly created workflow items. Only versioneddata@localhost.com has perms. + ld_dataset = self.admin_api.libraries.show_dataset(self.library_id, dataset_id) + + if not ld_dataset['state'] in 'ok running queued': + + print "Error when linking to library dataset cache [" + ld_dataset['name'] + ", " + ld_dataset['id'] + "] - it isn't in a good state: " + ld_dataset['state'] + sys.exit(1) + + if not os.path.isfile(ld_dataset['file_name']): + pass + #FUTURE: SHOULD TRIGGER LIBRARY REGENERATION OF ITEM? + + library_ldda_id = ld_dataset['ldda_id'] + + # Find out if library dataset item is already in history, and if so, just return that item. + dataset = None + for dataset2 in history_datasets: + + if 'copied_from_ldda_id' in dataset2 \ + and dataset2['copied_from_ldda_id'] == library_ldda_id \ + and dataset2['state'] in 'ok running' \ + and dataset2['accessible'] == True: + dataset = dataset2 + break + + if not dataset: # link in given dataset from library + + dataset = self.user_api.histories.upload_dataset_from_library(self.history_id, dataset_id) + + # Update dataset's label - not necessary, just hinting at its creation. + new_name = dataset['name'] + if dataset['name'][-len(version_id):] != version_id: + new_name += ' ' + version_id + + self.user_api.histories.update_dataset(self.history_id, dataset['id'], name=new_name, annotation = annotation) + + datasets.append({ + 'id': dataset['id'], + 'ld_id': ld_dataset['id'], + 'name': dataset['name'], + 'ldda_id': library_ldda_id, + 'library_dataset_name': ld_dataset['name'], + 'state': ld_dataset['state'] + }) + + return datasets + + + def get_codings(self, workflow, datasets): + """ + Returns a number of coded lists or arrays for use in caching or displaying workflow results. + Note: workflow['inputs'] = {u'23': {u'label': u'Input Dataset', u'value': u''}}, + Note: step_id is not incremental. + Note: VERY COMPLICATED because of hda/ldda/ld ids + + FUTURE: IS METADATA AVAILABLE TO BETTER MATCH WORKFLOW INPUTS TO DATA SOURCE RECALL VERSIONS? + ISSUE: IT IS ASSUMED ALL INPUTS TO WORKFLOW ARE AVAILABLE AS DATASETS BY ID IN LIBRARY. I.e. + one can't have a workflow that also makes reference to another just-generated file in user's + history. + """ + db_ptr = 0 + dataset_map = {} + workflow_input_key = [] + workflow_input_labels = [] + + for step_id, ds_in in workflow['inputs'].iteritems(): + input_dataset_id = datasets[db_ptr]['ld_id'] + ldda_id = datasets[db_ptr]['ldda_id'] + dataset_map[step_id] = {'src': 'ld', 'id': input_dataset_id} + workflow_input_key.append(ldda_id) #like dataset_index but from workflow input perspective + workflow_input_labels.append(datasets[db_ptr]['name']) + db_ptr += 1 + + workflow_input_key = '_'.join(workflow_input_key) + workflow_input_labels = ', '.join(workflow_input_labels) + annotation_key = workflow['id'] + ":" + workflow_input_key + + return (workflow_input_key, workflow_input_labels, annotation_key, dataset_map) + + + def get_history_workflow_results(self, annotation): + """ + See if workflow-generated dataset exists in user's history. The only way to spot this + is to find some dataset in user's history that has workflow_id in its "annotation" field. + We added the specific dataset id's that were used as input to the workflow as well as the + workflow key since same workflow could have been run on different inputs. + + @param annotation_key string Contains workflow id and input dataset ids.. + """ + history_datasets = self.user_api.histories.show_history(self.history_id, contents=True, deleted=False, visible=True, details='all') # , types=None + dataset_ids = [] + + for dataset in history_datasets: + if dataset['annotation'] == annotation: + if dataset['accessible'] == True and dataset['state'] == 'ok': + dataset_ids.append(dataset['id']) + else: + print "Warning: dataset " + dataset['name'] + " is in an error state [ " + dataset['state'] + "] so skipped!" + + return dataset_ids + + + def cache_datasets(self, library_cache_path, work_result, workflow_summary, codings, version_id, history_id): + """ + Use the Galaxy API to LINK versioned data api admin user's history workflow-created item(s) into the appropriate Versioned Data Workflow Cache folder. Doing this via API call so that metadata is preserved, e.g. preserving that it is a product of makeblastdb/formatdb and all that entails. Only then does Galaxy remain knowledgeable about datatype/data collection. + + Then user gets link to workflow dataset in their history. (If a galaxy user deletes a workflow dataset in their history they actually only deletes their history link to that dataset. True of api admin user?) + + FUTURE: have the galaxy-created data shared from a server location? + """ + + (workflow_input_key, workflow_input_label, annotation_key, dataset_map) = codings + + # This will create folder if it doesn't exist: + _library_cache_labels = os.path.join("/", VDB_WORKFLOW_CACHE_FOLDER_NAME, workflow_summary['name'], 'On ' + workflow_input_label) + folder_id = self.get_library_folder("/", library_cache_path, _library_cache_labels) + if not folder_id: # Case should never happen + print 'Error: unable to determine library folder to place cache in:' + library_cache_path + sys.exit(1) + + + for dataset_id in work_result['outputs']: + # We have to mark each dataset entry with the Workflow ID and input datasets it was generated by. + # No other way to know they are associated. ADD VERSION ID TO END OF workflowinput_label? + label = workflow_summary['name'] +' on ' + workflow_input_label + + # THIS WILL BE IN ADMIN API HISTORY + self.admin_api.histories.update_dataset(history_id, dataset_id, annotation = annotation_key, name=label) + + # Upload dataset_id and give it description 'cached data' + if 'copy_from_dataset' in dir(self.admin_api.libraries): + # IN BIOBLEND LATEST: + self.admin_api.libraries.copy_from_dataset(self.library_id, dataset_id, folder_id, VDB_CACHED_DATA_LABEL + ": version " + version_id) + else: + self.library_cache_setup_privileged(folder_id, dataset_id, VDB_CACHED_DATA_LABEL + ": version " + version_id) + + + + def library_cache_setup_privileged(self, folder_id, dataset_id, message): + """ + Copy a history HDA into a library LDDA (that the current admin api user has add permissions on) + in the given library and library folder. Requires that dataset_id has been created by admin_api_key user. Nicola Soranzo [nicola.soranzo@gmail.com will be adding to BIOBLEND eventually. + + We tried linking a Versioned Data library Workflow Cache folder to the dataset(s) a non-admin api user has just generated. It turns out API user that connects the two must be both a Library admin AND the owner of the history dataset being uploaded, or an error occurs. So system can't do action on behalf of non-library-privileged user. Second complication with that approach is that there is no Bioblend API call - one must do this directly in galaxy API via direct URL fetc. + + NOTE: This will raise "HTTPError(req.get_full_url(), code, msg, hdrs, fp)" if given empty folder_id for example + + @see def copy_hda_to_ldda( library_id, library_folder_id, hda_id, message='' ): + @see https://wiki.galaxyproject.org/Events/GCC2013/TrainingDay/API?action=AttachFile&do=view&target=lddas_1.py + + @uses library_id: the id of the library which we want to query. + + @param dataset_id: the id of the user's history dataset we want to copy into the library folder. + @param folder_id: the id of the library folder to copy into. + @param message: an optional message to add to the new LDDA. + """ + + + + full_url = self.api_url + '/libraries' + '/' + self.library_id + '/contents' + url = self.make_url( self.admin_api_key, full_url ) + + post_data = { + 'folder_id' : folder_id, + 'create_type' : 'file', + 'from_hda_id' : dataset_id, + 'ldda_message' : message + } + + req = urllib2.Request( url, headers = { 'Content-Type': 'application/json' }, data = json.dumps( post_data ) ) + #try: + + results = json.loads( urllib2.urlopen( req ).read() ) + return + + + #Expecting to phase this out with bioblend api call for library_cache_setup() + def make_url(self, api_key, url, args=None ): + # Adds the API Key to the URL if it's not already there. + if args is None: + args = [] + argsep = '&' + if '?' not in url: + argsep = '?' + if '?key=' not in url and '&key=' not in url: + args.insert( 0, ( 'key', api_key ) ) + return url + argsep + '&'.join( [ '='.join( t ) for t in args ] ) + + + diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/versioned_data.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,149 @@ +#!/usr/bin/python +import os +import optparse +import sys +import time +import re + +import vdb_common +import vdb_retrieval + +class MyParser(optparse.OptionParser): + """ + From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output + Provides a better display of formatted help info in epilog() portion of optParse. + """ + def format_epilog(self, formatter): + return self.epilog + + +def stop_err( msg ): + sys.stderr.write("%s\n" % msg) + sys.exit(1) + + +class ReportEngine(object): + + def __init__(self): pass + + def __main__(self): + + options, args = self.get_command_line() + retrieval_obj = vdb_retrieval.VDBRetrieval() + retrieval_obj.set_api(options.api_info_path) + + retrievals=[] + + for retrieval in options.retrievals.strip().strip('|').split('|'): + # Normally xml form supplies "spec_file_id, [version list], [workflow_list]" + params = retrieval.strip().split(',') + + spec_file_id = params[0] + + if spec_file_id == 'none': + print 'Error: Form was selected without requesting a data store to retrieve!' + sys.exit( 1 ) + + # STEP 1: Determine data store type and location + data_store_spec = retrieval_obj.user_api.libraries.show_folder(retrieval_obj.library_id, spec_file_id) + data_store_type = retrieval_obj.test_data_store_type(data_store_spec['name']) + base_folder_id = data_store_spec['folder_id'] + + if not data_store_type: + print 'Error: unrecognized data store type [' + data_store_type + ']' + sys.exit( 1 ) + + ds_obj = retrieval_obj.get_data_store_gateway(data_store_type, spec_file_id) + + if len(params) > 1 and len(params[1].strip()) > 0: + _versionList = params[1].strip() + version_id = _versionList.split()[0] # VersionList SHOULD just have 1 id + else: + # User didn't select version_id via "Add new retrieval" + if options.globalRetrievalDate: + _retrieval_date = vdb_common.parse_date(options.globalRetrievalDate) + version_id = ds_obj.get_version_options(global_retrieval_date=_retrieval_date, selection=True) + + else: + version_id = '' + + # Reestablishes file(s) if they don't exist on disk. Do data library links to it as well. + ds_obj.get_version(version_id) + if ds_obj.version_path == None: + + print "Error: unable to retrieve version [%s] from %s archive [%s]. Archive doesn't contain this version id?" % (version_id, data_store_type, ds_obj.library_version_path) + sys.exit( 1 ) + + # Version data file(s) are sitting in [ds_obj.version_path] ready for retrieval. + library_dataset_ids = retrieval_obj.get_library_version_datasets(ds_obj.library_version_path, base_folder_id, ds_obj.version_label, ds_obj.version_path) + + # The only thing that doesn't have cache lookup is "folder" data that isn't linked in. + # In that case try lookup directly. + if len(library_dataset_ids) == 0 and data_store_type == 'folder': + library_version_datasets = retrieval_obj.get_library_folder_datasets(ds_obj.library_version_path) + library_dataset_ids = [item['id'] for item in library_version_datasets] + + if len(library_dataset_ids) == 0: + + print 'Error: unable to retrieve version [%s] from %s archive [%s] ' % (version_id, data_store_type, ds_obj.library_version_path) + sys.exit( 1 ) + + # At this point we have references to the galaxy ids of the requested versioned dataset, after regeneration + versioned_datasets = retrieval_obj.update_history(library_dataset_ids, ds_obj.library_version_path, version_id) + + if len(params) > 2: + + workflow_list = params[2].strip() + + if len(workflow_list) > 0: + # We have workflow run via admin_api and admin_api history. + retrieval_obj.get_workflow_data(workflow_list, versioned_datasets, version_id) + + + result=retrievals + + # Output file needs to exist. Otherwise Galaxy doesn't generate a placeholder file name for the output, and so we can't do things like check for [placeholder name]_files folder. Add something to report on? + with open(options.output,'w') as fw: + fw.writelines(result) + + + def get_command_line(self): + ## *************************** Parse Command Line ***************************** + parser = MyParser( + description = 'This Galaxy tool retrieves versions of prepared data sources and places them in a galaxy "Versioned Data" library', + usage = 'python versioned_data.py [options]', + epilog="""Details: + + This tool retrieves links to current or past versions of fasta (or other key-value text) databases from a cache kept in the data library called "Fasta Databases". It then places them into the current history so that subsequent tools can work with that data. + """) + + parser.add_option('-r', '--retrievals', type='string', dest='retrievals', + help='List of datasources and their versions and galaxy workflows to return') + + parser.add_option('-o', '--output', type='string', dest='output', + help='Path of output log file to create') + + parser.add_option('-O', '--output_id', type='string', dest='output_id', + help='Output identifier') + + parser.add_option('-d', '--date', type='string', dest='globalRetrievalDate', + help='Provide date/time for data recall. Defaults to now.') + + parser.add_option('-v', '--version', dest='version', default=False, action='store_true', + help='Version number of this program.') + + parser.add_option('-s', '--api_info_path', type='string', dest='api_info_path', help='Galaxy user api key/path.') + + return parser.parse_args() + + + +if __name__ == '__main__': + + time_start = time.time() + + reportEngine = ReportEngine() + reportEngine.__main__() + + print('Execution time (seconds): ' + str(int(time.time()-time_start))) + diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/versioned_data.xml Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,124 @@ + + Retrieve versioned sequence files and/or their blast, bowtie, etc. database indexes + + versioned_data.py + bccdc_macros.xml + + + + #assert $__user__, Exception( 'You must be logged in to use this tool.' ) + versioned_data.py + #if $globalRetrievalDate.strip() > '' + -d "$globalRetrievalDate" + #end if + -r + " + #for $v in $versions: + ${v.database}, + #for $r in $v.retrieval: + ${r.retrievalId} + #end for + , + #for $w in $v.workflows: + ${w.workflow} + #end for + | + #end for + " + -o "$log" + -O "$__app__.security.encode_id($log.id)" + --api_info_path "$api_info_path" ##Actually a file path to configfile that holds api key + + + + + + + + + + + + + + + + + + + + + + + + + + + + ${__user__.api_keys[0].key} + $api_info + + + + + + + + + + + + + + + + + + +.. class:: infomark + + +**What it does** + +This tool retrieves links to current or past versions of fasta or other types of +data from a cache kept in the Galaxy data library called "Versioned Data". It then places +them into one's current history so that subsequent tools can work with that data. + +For example, after using this tool to select a version of the NCBI nt database, a blast search can be carried out on it by selecting "BLAST database from your history" from the "Subject database/sequences" field of the NCBI BLAST+ search tool. + +You can select one or more files or databases by version date or id. This list +is supplied from the Shared Data > Data Libraries > Versioned Data folder that has +been set up by an administrator. + +The Workflows section allows you to select one or more pre-defined workflows +to execute on the versioned data. The results are placed in your history for use +by other tools or workflows. + +A caching system exists to cache the versioned data or workflow data that the tool generates. +If you request versioned data or derivative data that isn't cached, it may take time to regenerate. + +The top-level "Global retrieval date [YYYY-MM-DD]" field that the form starts with will be applied to +all selected databases. This can be overriden by a retrieval date or version that +you supply for a particular database. Leave it and any "Retrievals" inputs empty if you just need the latest version of selected databases. + +------- + +.. class:: warningmark + +**Note** + +Again, some past database versions can take time to regenerate if there is no cached version available, for example NCBI nt is a 50+ gigabyte file that needs to be read through to get a fasta version, and a makeblastdb workflow on top of that can take hours on the first call. Access to cached versions is immediate. + +Setup of versioned data sources and workflow options can only be done by a Galaxy administrator. + +------- + +**References** + +If you use this Galaxy tool in work leading to a scientific publication please +cite the following paper: + +*Reference coming soon...* + + + diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data_cache_clear.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/versioned_data_cache_clear.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,130 @@ +#!/usr/bin/python + +""" +****************************** versioned_data_cache_clear.py ****************************** + Call this script directly to clear out all but the latest galaxy Versioned Data data library + and server data store cached folder versions. + + SUGGEST RUNNING THIS UNDER GALAXY OR LESS PRIVILEGED USER, BUT the versioneddata_api_key file does need to be readable by the user. + +""" +import vdb_retrieval +import vdb_common +import glob +import os + +# Note that globals from vdb_retrieval can be referenced by prefixing with vdb_retrieval.XYZ +# Note that this script uses the admin_api established in vdb_retrieval.py + +retrieval_obj = vdb_retrieval.VDBRetrieval() +retrieval_obj.set_admin_api() +retrieval_obj.user_api = retrieval_obj.admin_api +retrieval_obj.set_datastores() + +workflow_keepers = [] #stack of Versioned Data library dataset_ids that if found in a workflow data input folder key name, can be saved; otherwise remove folder. +library_folder_deletes = [] +library_dataset_deletes = [] + +# Cycle through datastores, listing subfolders under each, sorted. +# Permanently delete all but latest subfolder. +for data_store in retrieval_obj.data_stores: + spec_file_id = data_store['id'] + # STEP 1: Determine data store type and location + data_store_spec = retrieval_obj.admin_api.libraries.show_folder(retrieval_obj.library_id, spec_file_id) + data_store_type = retrieval_obj.test_data_store_type(data_store_spec['name']) + + if not data_store_type in 'folder biomaj': # Folders are static - they don't do caching. + + base_folder_id = data_store_spec['folder_id'] + ds_obj = retrieval_obj.get_data_store_gateway(data_store_type, spec_file_id) + + print + + #Cycle through library tree; have to look at the whole thing since there's no /[string]/* wildcard search: + folders = retrieval_obj.get_library_folders(ds_obj.library_label_path) + for ptr, folder in enumerate(folders): + + # Ignore folder that represents data store itself: + if ptr == 0: + print 'Data Store ::' + folder['name'] + + # Keep most recent cache item + elif ptr == len(folders)-1: + print 'Cached Version ::' + folder['name'] + workflow_keepers.extend(folder['files']) + + # Drop version caches that are further in the past: + else: + print 'Clearing version cache:' + folder['name'] + library_folder_deletes.extend(folder['id']) + library_dataset_deletes.extend(folder['files']) + + + # Now auto-clean versioned/ folders too? + print "Server loc: " + ds_obj.data_store_path + + items = os.listdir(ds_obj.data_store_path) + items = sorted(items, key=lambda el: vdb_common.natural_sort_key(el), reverse=True) + count = 0 + for name in items: + + # If it is a directory and it isn't the master or symlinked "current" one: + # Add ability to skip sym-linked folders too? + version_folder=os.path.join(ds_obj.data_store_path, name) + if not name == 'master' \ + and os.path.isdir(version_folder) \ + and not os.path.islink(version_folder): + + count += 1 + if count == 1: + print "Keeping cache:" + name + else: + print "Dropping cache:" + name + for root2, dirs2, files2 in os.walk(version_folder): + for version_file in files2: + full_path = os.path.join(root2, version_file) + print "Removing " + full_path + os.remove(full_path) + #Not expecting any subfolders here. + + os.rmdir(version_folder) + + +# Permanently delete specific data library datasets: +for item in library_dataset_deletes: + retrieval_obj.admin_api.libraries.delete_library_dataset(retrieval_obj.library_id, item['id'], purged=True) + + +# Newer Bioblend API method for deleting galaxy library folders. +# OLD Galaxy way possible: http DELETE request to {{url}}/api/folders/{{encoded_folder_id}}?key={{key}} +if 'folders' in dir(retrieval_obj.admin_api): + for folder in library_folder_deletes: + retrieval_obj.admin_api.folders.delete(folder['id']) + + +print workflow_keepers + +workflow_cache_folders = retrieval_obj.get_library_folders('/'+ vdb_retrieval.VDB_WORKFLOW_CACHE_FOLDER_NAME+'/') + +for folder in workflow_cache_folders: + dataset_ids = folder['name'].split('_') #input dataset ids separated by underscore + count = 0 + for id in dataset_ids: + if id in workflow_keepers: + count += 1 + + # If every input dataset in workflow cache exists in library cache, then keep it. + if count == len(dataset_ids): + continue + + # We have one or more cached datasets to drop. + print "Dropping workflow cache: " + folder['name'] + for id in [item['id'] for item in folder['files']]: + print id + retrieval_obj.admin_api.libraries.delete_library_dataset(retrieval_obj.library_id, id, purged=True) + + # NOW DELETE WORKFLOW FOLDER. + if 'folders' in dir(retrieval_obj.admin_api): + retrieval_obj.admin_api.folders.delete(folder['id']) + + diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data_form.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/versioned_data_form.py Sun Aug 09 16:07:50 2015 -0400 @@ -0,0 +1,149 @@ +#!/usr/bin/python +# +import os +import sys + +# Extra step enables this script to locate vdb_common and vdb_retrieval +# From http://code.activestate.com/recipes/66062-determining-current-function-name/ +_self_dir = os.path.dirname(sys._getframe().f_code.co_filename) +sys.path.append(_self_dir) + +import vdb_common +import vdb_retrieval + +retrieval_obj = None # Global used here to manage user's currently selected retrieval info. + + +def vdb_init_tool_user(trans): + """ + Retrieves a user's api key if they have one, otherwise lets them know they need one + This function is automatically called from versioned_data.xml form on presentation to user + Note that this is how self.api_url gets back into form, for passage back to 2nd call via versioned_data.py + self.api_key is passed via secure construct. + ALSO: squeezing history_id in this way since no other way to pass it. + "trans" is provided only by tool form presentation via + BUT NOW SEE John Chilton's: https://gist.github.com/jmchilton/27c5bb05e155a611294d + See galaxy source code at https://galaxy-dist.readthedocs.org/en/latest/_modules/galaxy/web/framework.html, + See http://dev.list.galaxyproject.org/error-using-get-user-id-in-xml-file-in-new-Galaxy-td4665274.html + See http://dev.list.galaxyproject.org/hg-galaxy-2780-Real-Job-tm-support-for-the-library-upload-to-td4133384.html + master api key, set in galaxy config: #self.master_api_key = trans.app.config.master_api_key + """ + global retrieval_obj + + api_url = trans.request.application_url + '/api' + history_id = str(trans.security.encode_id(trans.history.id)) + user_api_key = None + #master_api_key = trans.app.config.master_api_key + + if trans.user: + + user_name = trans.user.username + + if trans.user.api_keys and len(trans.user.api_keys) > 0: + user_api_key = trans.user.api_keys[0].key #First key is always the active one? + items = [ { 'name': user_name, 'value': api_url + '-' + history_id, 'options':[], 'selected': True } ] + + else: + items = [ { 'name': user_name + ' - Note: you need a key (see "User" menu)!', 'value': '0', 'options':[], 'selected': False } ] + + else: + items = [ { 'name': 'You need to be logged in to use this tool!', 'value': '1', 'options':[], 'selected': False } ] + + retrieval_obj = vdb_retrieval.VDBRetrieval() + retrieval_obj.set_trans(api_url, history_id, user_api_key) #, master_api_key + + return items + + +def vdb_get_databases(): + """ + Called by Tool Form, retrieves list of versioned databases from galaxy library called "Versioned Data" + + @return [name,value,selected] array + """ + global retrieval_obj + items = retrieval_obj.get_library_data_store_list() + + if len(items) == 0: + # Not great: Communicating library problem by text in form select pulldown input. + items = [[vdb_retrieval.VDB_DATA_LIBRARY_FOLDER_ERROR, None, False]] + + return items + + +def vdb_get_versions(spec_file_id, global_retrieval_date): + """ + Retrieve applicable versions of given database. + Unfortunately this is only refreshed when form screen is refreshed + + @param dbKey [folder_id] + + @return [name,value,selected] array + """ + global retrieval_obj + + items = [] + if spec_file_id: + + data_store_spec = retrieval_obj.user_api.libraries.show_folder(retrieval_obj.library_id, spec_file_id) + + if data_store_spec: #OTHERWISE DOES THIS MEAN USER DOESN'T HAVE PERMISSIONS? VALIDATE + + file_name = data_store_spec['name'] # Short (no path), original file name, not galaxy-assigned file_name + data_store_type = retrieval_obj.test_data_store_type(file_name) + library_label_path = retrieval_obj.get_library_label_path(spec_file_id) + + if not data_store_type or not library_label_path: + # Cludgy method of sending message to user + items.append([vdb_retrieval.VDB_DATA_LIBRARY_FOLDER_ERROR, None, False]) + items.append([vdb_retrieval.VDB_DATA_LIBRARY_CONFIG_ERROR + '"' + vdb_retrieval.VDB_DATA_LIBRARY + '/' + str(library_label_path) + '/' + file_name + '"', None, False]) + return items + + _retrieval_date = vdb_common.parse_date(global_retrieval_date) + + # Loads interface for appropriate data source retrieval + ds_obj = retrieval_obj.get_data_store_gateway(data_store_type, spec_file_id) + items = ds_obj.get_version_options(global_retrieval_date=_retrieval_date) + + else: + items.append(['Unable to find' + spec_file_id + ':Check permissions?','',False]) + + return items + + +def vdb_get_workflows(dbKey): + """ + List appropriate workflows for database. These are indicated by prefix "Versioning:" in name. + Currently can see ALL workflows that are published; admin_api() receives this in all galaxy versions. + Seems like only some galaxy versions allow user_api() to also see published workflows. + Only alternative is to list only individual workflows that current user can see - ones they created, and published workflows; but versioneddata user needs to have permissions on these too. + + Future: Sensitivity: Some workflows apply only to some kinds of database + + @param dbKey [data_spec_id] + @return [name,value,selected] array + """ + global retrieval_obj + + data = [] + try: + workflows = retrieval_obj.admin_api.workflows.get_workflows(published=True) + + except Exception as err: + if err.message[-21:] == 'HTTP status code: 403': + data.append(['Error: User does not have permissions to see workflows (or they need to be published).' , 0, False]) + else: + data.append(['Error: In getting workflows: %s' % err.message , 0, False]) + return data + + oldName = "" + for workflow in workflows: + name = workflow['name'] + if name[0:11].lower() == "versioning:" and name != oldName: + # Interface Bug: If an item is published and is also shared personally with a user, it is shown twice. + data.append([name, workflow['id'], False]) + oldName = name + + return data + +