Galaxy |

Changeset 1:5c5027485f7d (2015-08-09)

Previous changeset 0:d31a1bd74e63 (2015-08-09) Next changeset 2:269d246ce6d0 (2015-10-23)

Commit message:
Uploaded correct file

modified:
LICENSE.md
README.md

added:
__init__.py
bccdc_macros.xml
data_store_utils.py
data_stores/__init__.py
data_stores/fasta_format.py
data_stores/kipper.py
data_stores/tester.py
data_stores/vdb_biomaj.py
data_stores/vdb_data_stores.py
data_stores/vdb_folder.py
data_stores/vdb_git.py
data_stores/vdb_kipper.py
doc/README.md
doc/background.md
doc/bad_history_item.png
doc/caching.md
doc/data_provenance.md
doc/data_stores.md
doc/design.md
doc/galaxy_data_library.png
doc/galaxy_library.md
doc/galaxy_tool.md
doc/galaxy_tool_form.png
doc/galaxy_tool_install.md
doc/galaxy_versioned_data_tool.odt
doc/galaxy_versioned_data_tool.pdf
doc/history_dataset_details.png
doc/history_view_details.png
doc/library_dataset_upload.png
doc/maintenance.md
doc/problem_solving.md
doc/setup.md
doc/versioned_data_retrieval.png
doc/workflow_tools.png
doc/workflows.md
doc/workflows.png
tarballit.sh
vdb_common.py
vdb_retrieval.py
versioned_data.py
versioned_data.xml
versioned_data_cache_clear.py
versioned_data_form.py

removed:
ffp_macros.xml
ffp_phylogeny.py
ffp_phylogeny.xml
test-data/genome1
test-data/genome2
test-data/test_length_1_output.tabular
test-data/test_length_2_output.tabular
test-data/test_length_2b_output.tabular
tool_dependencies.xml

diff -r d31a1bd74e63 -r 5c5027485f7d LICENSE.md
--- a/LICENSE.md Sun Aug 09 16:05:40 2015 -0400
+++ b/LICENSE.md Sun Aug 09 16:07:50 2015 -0400

@@ -1,7 +1,7 @@
Source Code License

An Open Source Initiative (OSI) approved license
-ffp_phylogeny source code is licensed under the Academic Free License version 3.0.
+versioned_data source code is licensed under the Academic Free License version 3.0.

Licensed under the Academic Free License version 3.0

diff -r d31a1bd74e63 -r 5c5027485f7d README.md
--- a/README.md Sun Aug 09 16:05:40 2015 -0400
+++ b/README.md Sun Aug 09 16:07:50 2015 -0400

[

b'@@ -1,47 +1,52 @@\n-Feature Frequency Profile Phylogenies\n-=====================================\n+# Versioned Data System\n+The Galaxy and command line Versioned Data System manages the retrieval of current and past versions of selected reference sequence databases from local data stores.\n \n+---\n \n-Introduction\n-------------\n-\n-FFP (Feature frequency profile) is an alignment free comparison tool for phylogenetic analysis and text comparison. It can be applied to nucleotide sequences, complete genomes, proteomes and even used for text comparison. This software is a Galaxy (http://galaxyproject.org) tool for calculating FFP on one or more fasta sequence or text datasets.\n-\n-The original command line ffp-phylogeny code is at http://ffp-phylogeny.sourceforge.net/ . This tool uses Aaron Petkau\'s modified version: https://github.com/apetkau/ffp-3.19-custom . Aaron has quite a good writeup of the technique as well at https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny .\n+0. Overview\n+1. [Setup for Admins](doc/setup.md)\n+ 1. [Galaxy tool installation](doc/galaxy_tool_install.md)\n+ 2. [Server data stores](doc/data_stores.md)\n+ 3. [Data store examples](doc/data_store_examples.md)\n+ 4. [Galaxy "Versioned Data" library setup](doc/galaxy_library.md)\n+ 5. [Workflow configuration](doc/workflows.md)\n+ 6. [Permissions, security, and maintenance](doc/maintenance.md)\n+ 7. [Problem solving](doc/problem_solving.md)\n+2. [Using the Galaxy Versioned data tool](doc/galaxy_tool.md)\n+3. [System Design](doc/design.md)\n+4. [Background Research](doc/background.md)\n+5. [Server data store and galaxy library organization](doc/data_store_org.md)\n+6. [Data Provenance and Reproducibility](doc/data_provenance.md)\n+7. [Caching System](doc/caching.md)\n \n-**Installation Note** : Your Galaxy server will need the groff package to be installed on it first (to generate ffp-phylogeny man pages). A cryptic error will occur if it isn\'t: "troff: fatal error: can\'t find macro file s". This is different from the "groff-base" package.\n+---\n+\n+## Overview\n+\n+This tool can be used on a server both via the command line and via the Galaxy bioinformatics workflow platform using the "Versioned Data" tool. Different kinds of content are suited to different archiving technologies, so the system provides a few storage system choices.\n \n-This Galaxy tool prepares a mini-pipeline consisting of **[ffpry | ffpaa | ffptxt] > [ ffpfilt | ffpcol > ffprwn] > ffpjsd > ffptree** . The last step is optional - by deselecting the "Generate Tree Phylogeny" checkbox, the tool will output a distance matrix rather than a Newick (.nhx) formatted tree file.\n+* Fasta sequences - accession ids, descriptions and their sequences - are suited to storage as 1 line key-value pair records in a key-value store. Here we introduce a low-tech file-based database plugin for this kind of data called **Kipper**. It is suited entirely to the goal of producing complete versioned files. This covers much of the sequencing archiving problem for reference databases. Consult https://github.com/Public-Health-Bioinformatics/kipper for up-to-date information on Kipper.\n \n-Each sequence or text file has a profile containing tallies of each feature found. A feature is a string of valid characters of given length. \n+* A **git** archiving system plugin is also provided for software file tree archiving, with a particular file differential (diff) compression benefit for documents that have sentence-like lines added and deleted between versions. \n+\n+* Super-large files that are not suited to Kipper or git can be handled by a simple "**folder**" data store holds each version of file(s) in a separate compressed archive.\n \n-For nucleotide data, by default each character (ATGC) is grouped as either purine(R) or pyrmidine(Y) before being counted.\n+* **Biomaj** (our reference database maintenance software) can be configured to download and store separate version files. A Biomaj plugin a'..b',C,G,A,N,H,P. Each group is represented by the first character in its series.\n+## Project goals\n \n-One other key concept is that a given feature, e.g. "TAA" is counted in forward AND reverse directions, mirroring the idea that a feature\'s orientation is not so important to distinguish when it comes to alignment-free comparison. The counts for "TAA" and "AAT" are merged.\n+* **To enable reproducible molecular biology research:** To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at reference sequence databases corresponding to a particular past date or version identifier. This recall can also explain the difference between what was known in the past vs. currently.\n+\n+* **To reduce hard drive space.** Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + .... (Compressing each file individually is an option but even better we could store just the differences between subsequent versions.)\n \n-The labeling of the resulting counted feature items is perhaps the trickiest concept to master. Due to computational efficiency measures taken by the developers, a feature that we see on paper as "TAC" may be stored and labeled internally as "GTA", its reverse compliment. One must look for the alternative if one does not find the original. \n-\n-Also note that in amino acid sequences the stop codon "*" (or any other character that is not in the Amino acid alphabet) causes that character frame not to be counted. Also, character frames never span across fasta entries.\n+* **Maximize speed of archive recall.** Understanding that the archived version files can be large, we\'d ideally like a versioned file to be retrieved in the time it takes to write a file of that size to disk. Caching this data and its derivatives (makeblastdb databases for example) is important.\n \n-A few tutorials:\n- * http://sourceforge.net/projects/ffp-phylogeny/files/Documentation/tutorial.pdf\n- * https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny\n-\n--------\n-**Note**\n+* **Improve sequence archive management.** Provide an admin interface for managing regular scheduled import and log of reference sequence databases from our own and 3rd party sources like NCBI and cpndb.ca .\n \n-Taxonomy label details: If each file contains one profile, the file\'s name is used to label the profile. If each file contains fasta sequences to profile individually, their fasta identifiers will be used to label them. The "short labels" option will find the shortest label that uniquely identifies each profile. Either way, there are some quirks: ffpjsd clips labels to 10 characters if they are greater than 50 characters, so all labels are trimmed to 50 characters first. Also "id" is prefixed to any numeric label since some tree visualizers won\'t show purely numeric labels. In the accidental case where a Fasta sequence label is a duplicate of a previous one it will be prefixed by "DupLabel-".\n-\n-The command line ffpjsd can hang if one provides an l-mer length greater than the length of file content. One must identify its process id ("ps aux | grep ffpjsd") and kill it ("kill [process id]").\n-\n-Finally, it is possible for the ffptree program to generate a tree where some of the branch distances are negative. See https://www.biostars.org/p/45597/\n+* Integrate database versioning into the Galaxy workflow management software without adding a lot of complexity.\n \n--------\n-**References**\n- \n-The development of the ffp-phylogeny command line software should be attributed to:\n-\n-Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 2009;106(8):2677-2682. doi:10.1073/pnas.0813249106.\n-\n+* A bonus would be to enable the efficient sharing of versioned data between computers/servers.\n'

diff -r d31a1bd74e63 -r 5c5027485f7d bccdc_macros.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/bccdc_macros.xml Sun Aug 09 16:07:50 2015 -0400

@@ -0,0 +1,27 @@
+<macros>
+
+    <xml name="stdio">
+        <stdio>
+            
+            <exit_code range="1:" />
+            <exit_code range=":-1" />
+            
+            <regex match="Error:" />
+            <regex match="Exception:" />
+        </stdio>
+    </xml>
+
+    <xml name="requirements">
+        <requirements>
+            <requirement type="binary">@BINARY@</requirement>
+        </requirements>
+        
+    </xml>
+
+    <token name="@REFERENCES@">
+    </token>
+    <token name="@REFERENCES_2@">
+ REFERENCES.
+    </token>
+
+</macros>

diff -r d31a1bd74e63 -r 5c5027485f7d data_store_utils.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_store_utils.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,26 @@
+
+def version_cache_setup(dataset_id, data_file_cache_folder, cacheable_dataset):
+ """ UNUSED: Idea was to enable caching of workflow products outside of galaxy for use by others.
+ CONSIDER METACODE. NOT INTEGRATED, NOT TESTED.
+ """
+ data_file_cache_name = os.path.join(data_file_cache_folder, dataset_id ) #'blastdb.txt'
+ if os.path.isfile(data_file_cache_name):
+ pass
+ else:
+ if os.path.isdir(data_file_cache_folder):
+ shutil.rmtree(data_file_cache_folder)
+ os.makedirs(data_file_cache_folder)
+ # Default filename=false means we're supplying the filename.
+ gi.datasets.download_dataset(dataset_id, file_path=data_file_cache_name, use_default_filename=False, wait_for_completion=True) # , maxwait=12000) is a default of 3 hours
+
+ # Generically, any dataset might have subfolders - to check we have to
+ # see if galaxy dataset file path has contents at _files suffix.
+ # Find dataset_id in version retrieval history datasets, and get its folder path, and copy _files over...
+ galaxy_dataset_folder = cacheable_dataset['file_name'][0:-4] + '_files'
+ time.sleep(2)
+ if os.path.isdir(galaxy_dataset_folder) \
+ and not os.path.isdir(data_file_cache_folder + '/files/'):
+ print 'Copying ' + galaxy_dataset_folder + ' to ' + data_file_cache_folder
+ # Copy program makes target folder.
+ shutil.copytree(galaxy_dataset_folder, data_file_cache_folder + '/files/') # , symlinks=False, ignore=None
+

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/fasta_format.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/fasta_format.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,195 @@
+#!/usr/bin/python
+# Simple comparison and conversion tools for big fasta data
+
+import sys, os, optparse
+
+VERSION = "1.0.0"
+
+class MyParser(optparse.OptionParser):
+ """
+ Provides a better class for displaying formatted help info.
+ From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output.
+ """
+ def format_epilog(self, formatter):
+ return self.epilog
+
+
+def split_len(seq, length):
+ return [seq[i:i+length] for i in range(0, len(seq), length)]
+
+
+def check_file_path(file_path, message = "File "):
+
+ path = os.path.normpath(file_path)
+ # make sure any relative paths are converted to absolute ones
+ if not os.path.isdir(os.path.dirname(path)) or not os.path.isfile(path):
+ # Not an absolute path, so try default folder where script was called:
+ path = os.path.normpath(os.path.join(os.getcwd(), path) )
+ if not os.path.isfile(path):
+ print message + "[" + file_path + "] doesn't exist!"
+ sys.exit(1)
+
+ return path
+
+
+class FastaFormat(object):
+
+ def __main__(self):
+
+ options, args = self.get_command_line()
+
+ if options.code_version:
+ print VERSION
+ return VERSION
+
+ if len(args) > 0:
+ file_a = check_file_path(args[0])
+ else: file_a = False
+
+ if len(args) > 1:
+ file_b = check_file_path(args[1])
+ else: file_b = False
+
+ if options.to_fasta == True:
+ # Transform from key-value file to regular fasta format: 1 line for identifier and description(s), remaining 80 character lines for fasta data.
+
+ with sys.stdout as outputFile :
+ for line in open(file_a,'r') if file_a else sys.stdin:
+ line_data = line.rsplit('\t',1)
+ if len(line_data) > 1:
+ outputFile.write(line_data[0] + '\n' + '\n'.join(split_len(line_data[1],80)) )
+ else:
+ # Fasta one-liner didn't have any sequence data
+ outputFile.write(line_data[0])
+
+ #outputFile.close() #Otherwise terminal never looks like it closes?
+
+
+ elif options.to_keyvalue == True:
+ # Transform from fasta format to key-value format:
+ # Separates sequence lines are merged and separated from id/description line by a tab.
+ with sys.stdout as outputFile:
+ start = True
+ for line in open(file_a,'r') if file_a else sys.stdin:
+ if line[0] == ">":
+ if start == False:
+ outputFile.write('\n')
+ else:
+ start = False
+ outputFile.write(line.strip() + '\t')
+
+ else:
+ outputFile.write(line.strip())
+
+
+ elif options.compare == True:
+
+ if len(args) < 2:
+ print "Error: Need two fasta file paths to compare"
+ sys.exit(1)
+
+ file_a = open(file_a,'r')
+ file_b = open(file_b,'r')
+
+ p = 3
+ count_a = 0
+ count_b = 0
+ sample_length = 50
+
+ while True:
+
+ if p&1:
+ a = file_a.readline()
+ count_a += 1
+
+ if p&2:
+ b = file_b.readline()
+ count_b += 1
+
+ if not a or not b: # blank line still has "cr\lf" in it so doesn't match here
+ print "EOF"
+ break
+
+ a = a.strip()
+ b = b.strip()
+
+ if a == b:
+ p = 3
+
+ elif a < b:
+ sys.stdout.write('f1 ln %s: -%s\nvs.   %s \n' % (count_a, a[0:sample_length] , b[0:sample_length]))
+ p = 1
+
+ else:
+ sys.stdout.write('f2 ln %s: +%s\nvs.   %s \n' % (count_b, b[0:sample_length] , a[0:sample_length]))
+ p = 2
+
+
+ if count_a % 1000000 == 0:
+ print "At line %s:" % count_a
+
+ # For testing:
+ #if count_a > 50:
+ #
+ # print "Quick exit at line 500"
+ # sys.exit(1)
+
+
+ for line in file_a.readlines():
+ count_a += 1
+ sys.stdout.write('f1 ln %s: -%s\nvs.   %s \n' % (count_a, line[0:sample_length] ))
+
+ for line in file_b.readlines():
+ count_b += 1
+ sys.stdout.write('f2 ln %s: +%s\nvs.   %s \n' % (count_b, line[0:sample_length] ))
+
+
+
+
+ def get_command_line(self):
+ """
+ *************************** Parse Command Line *****************************
+
+ """
+ parser = MyParser(
+ description = 'Tool for comparing two fasta files, or transforming fasta data to single-line key-value format, or visa versa.',
+ usage = 'fasta_format.py [options]',
+ epilog="""
+ Note: This tool uses stdin and stdout for transforming fasta data.
+
+ Convert from key-value data to fasta format data:
+ fasta_format.py [file] -f --fasta
+ cat [file] | fasta_format.py -f --fasta
+
+ Convert from fasta format data to key-value data:
+ fasta_format.py [file] -k --keyvalue
+ cat [file] | fasta_format.py -k --keyvalue
+
+ Compare two fasta format files:
+ fasta_format.py [file1] [file2] -c --compare
+
+ Return version of this code:
+ fasta_format.py -v --version
+
+""")
+
+ parser.add_option('-c', '--compare', dest='compare', default=False, action='store_true',
+ help='Compare two fasta files')
+
+ parser.add_option('-f', '--fasta', dest='to_fasta', default=False, action='store_true',
+ help='Transform key-value file to fasta file format')
+
+ parser.add_option('-k', '--keyvalue', dest='to_keyvalue', default=False, action='store_true',
+ help='Transform fasta file format to key-value format')
+
+ parser.add_option('-v', '--version', dest='code_version', default=False, action='store_true',
+ help='Return version of this code.')
+
+ return parser.parse_args()
+
+
+if __name__ == '__main__':
+
+ fasta_format = FastaFormat()
+ fasta_format.__main__()
+

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/kipper.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/kipper.py Sun Aug 09 16:07:50 2015 -0400

[

b'@@ -0,0 +1,1086 @@\n+#!/usr/bin/python\n+# -*- coding: utf-8 -*-\n+\n+import subprocess\n+import datetime\n+import dateutil.parser as parser2\n+import calendar\n+import optparse\n+import re\n+import os\n+import sys\n+from shutil import copy\n+import tempfile\n+import json\n+import glob\n+import gzip\n+\n+\n+\n+\n+CODE_VERSION = \'1.0.0\'\n+REGEX_NATURAL_SORT = re.compile(\'([0-9]+)\')\n+KEYDB_LIST = 1\n+KEYDB_EXTRACT = 2\n+KEYDB_REVERT = 3\n+KEYDB_IMPORT = 4\n+\n+class MyParser(optparse.OptionParser):\n+\t"""\n+\tProvides a better class for displaying formatted help info.\n+\tFrom http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output.\n+\t"""\n+\tdef format_epilog(self, formatter):\n+\t\treturn self.epilog\n+\n+def stop_err( msg ):\n+ sys.stderr.write("%s\\n" % msg)\n+ sys.exit(1)\n+\n+class Kipper(object):\n+\n+\t\n+\tdef __init__(self):\n+\t\t# Provide defaults\n+\t\tself.db_master_file_name = None\n+\t\tself.db_master_file_path = None\n+\t\tself.metadata_file_path = None\t\t\n+\t\tself.db_import_file_path = None\n+\t\tself.output_file = None # By default, printed to stdout\n+\t\tself.volume_id = None\n+\t\tself.version_id = None # Note, this is natural #, starts from 1; \n+\t\tself.metadata = None\n+\t\tself.options = None\n+\t\tself.compression = \'\'\n+\n+\t\t_nowabout = datetime.datetime.utcnow() \n+\t\tself.dateTime = long(_nowabout.strftime("%s"))\n+\n+\t\tself.delim = "\\t"\n+\t\tself.nl = "\\n"\n+\t\n+\t\n+\tdef __main__(self):\n+\t\t"""\n+\t\tHandles all command line options for creating kipper archives, and extracting or reverting to a version.\n+\t\t""" \n+\t\toptions, args = self.get_command_line()\n+\t\tself.options = options\n+\n+\t\tif options.code_version:\n+\t\t\tprint CODE_VERSION\n+\t\t\treturn CODE_VERSION\n+\t\n+\t\t# *********************** Get Master kipper file ***********************\n+\t\tif not len(args):\n+\t\t\tstop_err(\'A Kipper database file name needs to be included as first parameter!\')\n+\t\t\t\n+\t\tself.db_master_file_name = args[0] #accepts relative path with file name\n+\t\t\n+\t\tself.db_master_file_path = self.check_folder(self.db_master_file_name, "Kipper database file")\n+\t\t# db_master_file_path is used from now on; db_master_file_name is used just for metadata labeling. \n+\t\t# Adjust it to remove any relative path component.\n+\t\tself.db_master_file_name = os.path.basename(self.db_master_file_name)\n+\t\t\n+\t\tif os.path.isdir(self.db_master_file_path):\n+\t\t\tstop_err(\'Error: Kipper data file "%s" is actually a folder!\' % (self.db_master_file_path) )\n+\n+\t\tself.metadata_file_path = self.db_master_file_path + \'.md\'\n+\t\t\n+\t\t# Returns path but makes sure its folder is real. Must come before get_metadata()\n+\t\tself.output_file = self.check_folder(options.db_output_file_path)\n+\n+\t\t\n+\t\t# ************************* Get Metadata ******************************\n+\t\tif options.initialize: \n+\t\t\tif options.compression:\n+\t\t\t\tself.compression = options.compression\n+\t\t\t\t\n+\t\t\tself.set_metadata(type=options.initialize, compression=self.compression)\n+\t\t\t\n+\t\tself.get_metadata(options);\n+\n+\t\tself.check_date_input(options)\n+\n+\t\tif options.version_id or (options.extract and options.version_index):\n+\t\t\tif options.version_index:\n+\t\t\t\tvol_ver = self.version_lookup(options.version_index)\n+\t\t\t\t\n+\t\t\telse:\n+\t\t\t\t# Note version_id info overrides any date input above.\n+\t\t\t\tvol_ver = self.get_version(options.version_id)\n+\t\t\t\n+\t\t\tif not vol_ver:\n+\t\t\t\tstop_err("Error: Given version number or name does not exist in this database")\n+\t\t\t\t\n+\t\t\t(volume, version) = vol_ver\n+\t\t\tself.volume_id = volume[\'id\']\n+\t\t\tself.version_id = version[\'id\']\n+\t\t\tself.dateTime = float(version[\'created\'])\n+\t\telse:\n+\t\t\t# Use latest version by default\n+\t\t\tif not self.version_id and len(self.metadata[\'volumes\'][-1][\'versions\']) > 0:\n+\t\t\t\tself.volume_id = self.metadata[\'volumes\'][-1][\'id\']\n+\t\t\t\tself.version_id = self.metadata[\'volumes\'][-1][\'versions\'][-1][\'id\']\n+\t\t\t\n+\t\t# ************************** Action triggers **************************\n+\t\t\n+\t\tif options.volume == True: \n+\t\t\t# Add a new volume to the metadata\n+\t\t\tself.metadata_create_volume()\n+\t\t\tself.write_metad'..b'ray\n+\n+\t\t@param line string containing [accession id][TAB][description][TAB][fasta sequence]\n+\t\t@return string containing lines each ending with newline, except end.\n+\t\t"""\n+\t\tline_data =\tline.split(\'\\t\',2)\n+\t\t# Set up ">[accession id] [description]\\n" :\n+\t\tfasta_header = \'>\' + \' \'.join(line_data[0:2]) + \'\\n\'\n+\t\t# Put fasta sequences back into multi-line; note trailing item has newline.\n+\t\tsequences= self.split_len(line_data[2],80)\n+\t\tif len(sequences) and sequences[-1].strip() == \'\':\n+\t\t\tsequences[-1] = \'\'\t\t\t\n+\n+\t\treturn fasta_header + \'\\n\'.join(sequences)\n+\n+\n+\tdef split_len(self, seq, length):\n+\t\treturn [seq[i:i+length] for i in range(0, len(seq), length)]\n+\n+\n+class bigFileReader(object):\n+\t"""\n+\tThis provides some advantage over reading line by line, and as well has a system\n+\tfor skipping/not advancing reads - it has a memory via "take_step" about whether\n+\tit should advance or not - this is used when the master database and the import \n+\tdatabase are feeding lines into a new database.\n+\t \n+\tInterestingly, using readlines() with byte hint parameter less \n+\tthan file size improves performance by at least 30% over readline().\n+\n+\tFUTURE: Adjust buffer lines dynamically based on file size/lines ratio?\n+\t"""\n+\n+\tdef __init__(self, filename):\n+\t\tself.lines = []\n+\t\t# This simply allows any .gz repository to be opened\n+\t\t# It isn\'t connected to the Kipper metadata[\'compression\'] feature.\n+\t\tif filename[-3:] == \'.gz\':\n+\t\t\tself.file = gzip.open(filename,\'rb\')\n+\t\telse:\n+\t\t\tself.file = open(filename, \'rb\', 1)\n+\n+\t\tself.line = False\n+\t\tself.take_step = True\n+\t\tself.buffer_size=1000 # Number of lines to read into buffer.\n+\n+\n+\tdef turn(self):\n+\t\t"""\n+\t\tWhen accessing bigFileReader via turn mechanism, we get current line if no step;\n+\t\totherwise with step we read new line. \n+\t\t"""\n+\t\tif self.take_step == True:\n+\t\t\tself.take_step = False\n+\t\t\treturn self.read()\n+\t\treturn self.line\n+\n+\n+\tdef read(self):\n+\t\tif len(self.lines) == 0: \n+\t\t\tself.lines = self.file.readlines(self.buffer_size)\n+\t\tif len(self.lines) > 0:\n+\t\t\tself.line = self.lines.pop(0) \n+\t\t\t#if len(self.lines) == 0:\n+\t\t\t#\tself.lines = self.file.readlines(self.buffer_size)\n+\t\t\t#make sure each line doesn\'t include carriage return\n+\t\t\treturn self.line\n+\t\t\t\n+\t\treturn False \n+\n+\n+\tdef readlines(self):\n+\t\t"""\n+\t\tSmall efficiency: \n+\t\tA test on self.lines after readLines() call can control loop.\n+\t\tBulk write of remaining buffer; ensures lines array isn\'t copied \n+\t\tbut is preserved when self.lines is removed\n+\t\t"""\n+\t\tself.line = False\n+\t\tif len(self.lines) == 0:\n+\t\t\tself.lines = self.file.readlines(self.buffer_size)\n+\t\tif len(self.lines) > 0:\n+\t\t\tshallowCopy = self.lines[:]\n+\t\t\tself.lines = self.file.readlines(self.buffer_size)\n+\t\t\treturn shallowCopy \n+\t\treturn False\n+\n+\n+\tdef step(self):\n+\t\tself.take_step = True\n+\n+\n+\n+# Enables use of with ... syntax. See https://mail.python.org/pipermail/tutor/2009-November/072959.html\n+class myGzipFile(gzip.GzipFile):\n+ def __enter__(self):\n+ if self.fileobj is None:\n+ raise ValueError("I/O operation on closed GzipFile object")\n+ return self\n+\n+ def __exit__(self, *args):\n+ self.close()\n+\n+\n+def natural_sort_key(s, _nsre = REGEX_NATURAL_SORT):\n+\treturn [int(text) if text.isdigit() else text.lower()\n+\t\tfor text in re.split(_nsre, s)] \n+\n+\n+def generic_linux_sort(self):\n+\timport locale\n+\tlocale.setlocale(locale.LC_ALL, "C")\n+\tyourList.sort(cmp=locale.strcoll)\n+\n+\n+def parse_date(adate):\n+\t"""\n+\tConvert human-entered time into linux integer timestamp\n+\n+\t@param adate string Human entered date to parse into linux time\n+\n+\t@return integer Linux time equivalent or 0 if no date supplied\n+\t"""\n+\tadate = adate.strip()\n+\tif adate > \'\':\n+\t adateP = parser2.parse(adate, fuzzy=True)\n+\t #dateP2 = time.mktime(adateP.timetuple())\n+\t # This handles UTC & daylight savings exactly\n+\t return calendar.timegm(adateP.timetuple())\n+\treturn 0\n+\n+\n+if __name__ == \'__main__\':\n+\n+\tkipper = Kipper()\n+\tkipper.__main__()\n+\t\n'

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/tester.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/tester.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,109 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+import optparse
+import sys
+import difflib
+import os
+
+
+class MyParser(optparse.OptionParser):
+ """
+ From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output
+ Provides a better class for displaying formatted help info in epilog() portion of optParse; allows for carriage returns.
+ """
+ def format_epilog(self, formatter):
+ return self.epilog
+
+def stop_err( msg ):
+    sys.stderr.write("%s\n" % msg)
+    sys.exit(1)
+
+
+def __main__(self):
+
+
+ """
+ (This is run only in context of command line.)
+ FUTURE: ALLOW GLOB IMPORT OF ALL FILES OF GIVEN SUFFX, each to its own version, initial date taken from file date
+ FUTURE: ALLOW dates of versions to be adjusted.
+ """
+ options, args = self.get_command_line()
+ self.options = options
+
+ if options.test_ids:
+ return self.test(options.test_ids)
+
+
+def get_command_line(self):
+ """
+ *************************** Parse Command Line *****************************
+
+ """
+ parser = MyParser(
+ description = 'Tests a program against given input/output files.',
+ usage = 'tester.py [program] [input files] [output files] [parameters]',
+ epilog="""
+
+
+ parser.add_option('-t', '--tests', dest='test_ids', help='Enter "all" or comma-separated id(s) of tests to run.')
+
+ return parser.parse_args()
+
+
+
+def test(self, test_ids):
+ # Future: read this spec from test-data folder itself?
+ tests = {
+ '1': {'input':'a1','outputs':'','options':''}
+ }
+ self.test_suite('keydb.py', test_ids, tests, '/tmp/')
+
+
+def test_suite(self, program, test_ids, tests, output_dir):
+
+ if test_ids == 'all':
+ test_ids = sorted(tests.keys())
+ else:
+ test_ids = test_ids.split(',')
+
+ for test_id in test_ids:
+ if test_id in tests:
+ test = tests[test_id]
+ test['program'] = program
+ test['base_dir'] = base_dir = os.path.dirname(__file__)
+ executable = os.path.join(base_dir,program)
+ if not os.path.isfile(executable):
+ stop_err('\n\tUnable to locate ' + executable)
+ # Each output file has to be prefixed with the output (usualy /tmp/) folder
+ test['tmp_output'] = (' ' + test['outputs']).replace(' ',' ' + output_dir)
+ # Note: output_dir output files don't get cleaned up after each test.  Should they?!
+ params = '%(base_dir)s/%(program)s %(base_dir)s/test-data/%(input)s%(tmp_output)s %(options)s' % test
+ print("Test " + test_id + ': ' + params)
+ print("................")
+ os.system(params)
+ print("................")
+ for file in test['outputs'].split(' '):
+ try:
+ f1 = open(test['base_dir'] + '/test-data/' + file)
+ f2 = open(output_dir + file)
+ except IOError as details:
+ stop_err('Error in test setup: ' + str(details)+'\n')
+
+ #n=[number of context lines
+ diff = difflib.context_diff(f1.readlines(), f2.readlines(), lineterm='',n=0)
+ # One Galaxy issue: it doesn't convert entities when user downloads file.
+ # BUT IT appears to when generating directly to command line?
+ print '\nCompare ' + file
+ print '\n'.join(list(diff))
+
+ else:
+ stop_err("\nExpecting one or more test ids from " + str(sorted(tests.keys())))
+
+ stop_err("\nTest finished.")
+
+
+if __name__ == '__main__':
+
+ keydb = KeyDb()
+ keydb.__main__()
+

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_biomaj.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/vdb_biomaj.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,71 @@
+#!/usr/bin/python
+## ******************************* Biomaj FOLDER *********************************
+##
+
+import os
+import time
+import glob
+import vdb_common
+import vdb_data_stores
+
+class VDBBiomajDataStore(vdb_data_stores.VDBDataStore):
+
+ versions = None
+ library = None
+
+ def __init__(self, retrieval_obj, spec_file_id):
+ """
+ Provides list of available versions where each version is a data file sitting in a Biomaj data store subfolder. e.g.
+
+ /projects2/ref_databases/biomaj/ncbi/blast/silva_rna/ silva_rna_119/flat
+ /projects2/ref_databases/biomaj/ncbi/genomes/Bacteria/ Bacteria_2014-07-25/flat
+
+ """
+ super(VDBBiomajDataStore, self).__init__(retrieval_obj, spec_file_id)
+
+ self.library = retrieval_obj.library
+
+ versions = []
+ # Linked, meaning that our data source spec file pointed to some folder on the server.
+ # Name of EACH subfolder should be a label for each version, including date/time and version id.
+
+ base_file_path = os.path.join(os.path.dirname(self.base_file_name),'*','flat','*')
+ try:
+ #Here we need to get subfolder listing of linked file location.
+ for item in glob.glob(base_file_path):
+
+ # Only interested in files, and not ones in the symlinked /current/ subfolder
+ # Also, Galaxy will strip off .gz suffixes - WITHOUT UNCOMPRESSING FILES!
+ # So, best to prevent data store from showing .gz files in first place.
+ if os.path.isfile(item) and not '/current/' in item and not item[-3:] == '.gz':
+ #Name includes last two subfolders: /[folder]/flat/[name]
+ item_name = '/'.join(item.rsplit('/',3)[1:])
+ # Can't count on creation date being spelled out in name
+ created = vdb_common.parse_date(time.ctime(os.path.getmtime(item)))
+ versions.append({'name':item_name, 'id':item_name, 'created': created})
+
+ except Exception as err:
+ # This is the first call to api so api url or authentication erro can happen here.
+ versions.append({
+ 'name':'Error: Unable to get version list: ' + err.message,
+ 'id':'',
+ 'created':''
+ })
+
+ self.versions = sorted(versions, key=lambda x: x['name'], reverse=True)
+
+
+ def get_version(self, version_name):
+ """
+ Return server path of requested version info
+
+ @uses library_label_path string Full hierarchic label of a library file or folder, PARENT of version id folder.
+ @uses base_file_name string Server absolute path to data_store spec file
+ @param version_name alphaneumeric string
+ """
+ self.version_label = version_name
+ self.library_version_path = os.path.join(self.library_label_path, self.version_label)
+
+ #linked to some other folder, spec is location of base_file_name
+ self.version_path = os.path.join(self.data_store_path, version_name)
+

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_data_stores.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/vdb_data_stores.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,142 @@
+import vdb_common
+import vdb_retrieval # For message text
+
+class VDBDataStore(object):
+
+ """
+ Provides data store engine super-class with methods to list available versions, and to generate a version (potentially providing a link to cached version).  Currently have options for git, folder, and kipper.
+
+ get_data_store_gateway() method loads the appropriate data_stores/vdb_.... data store variant.
+
+ """
+
+ def __init__(self, retrieval_obj, spec_file_id):
+ """
+ Note that api is only needed for data store type = folder.
+
+ @init self.type
+ @init self.base_file_name
+ @init self.library_label_path
+ @init self.data_store_path
+ @init self.global_retrieval_date
+
+ @sets self.library_version_path
+ @sets self.version_label
+ @sets self.version_path
+ """
+
+ self.admin_api = retrieval_obj.admin_api
+ self.library_id = retrieval_obj.library_id
+ self.library_label_path = retrieval_obj.get_library_label_path(spec_file_id)
+ try:
+ # Issue: Error probably gets trapped but is never reported back to Galaxy from thread via <code file="versioned_data_form.py" /> form's call?
+ # It appears galaxy user needs more than just "r" permission on this file, oddly?
+ spec = self.admin_api.libraries.show_folder(self.library_id, spec_file_id)
+
+ except IOError as e:
+ print 'Tried to fetch library folder spec file: %s.  Check permissions?"' % spec_file_id
+ print "I/O error({0}): {1}".format(e.errno, e.strerror)
+ sys.exit(1)
+
+ self.type = retrieval_obj.test_data_store_type(spec['name'])
+
+ #Server absolute path to data_store spec file (can be Galaxy .dat file representing Galaxy library too.
+
+ self.base_file_name = spec['file_name']
+
+ # In all cases a pointer file's content (data_store_path)
+ # should point to a real server folder (that galaxy has permission to read).
+ # Exception to this is for pointer.folder, where content can be empty,
+ # in which case idea is to use library folder contents directly.
+
+ with open(self.base_file_name,'r') as path_spec:
+ self.data_store_path = path_spec.read().strip()
+ if len(self.data_store_path) > 0:
+ # Let people forget to put a trailing slash on the folder path.
+ if not self.data_store_path[-1] == '/':
+ self.data_store_path += '/'
+
+ # Generated on subsequent subclass call
+ self.library_version_path = None
+ self.version_label = None
+ self.version_path = None
+
+
+ def get_version_options(self, global_retrieval_date=0, version_name=None, selection=False):
+ """
+ Provides list of available versions of a given archive.  List is filtered by
+ optional global_retrieval_date or version id.  For date filter, the version immediately
+ preceeding given datetime (includes same datetime) is returned.
+ All comparisons are done by version NAME not id, because underlying db id might change.
+
+ If global_retrieval_date datetime preceeds first version, no filtering is done.
+
+ @param global_retrieval_date long unix time date to test entry against.
+ @param version_name string Name of version
+ @param version_id string Looks like a number, or '' to pull latest id.
+ """
+
+ data = []
+
+ date_match = vdb_common.date_matcher(global_retrieval_date)
+ found=False
+
+ for ptr, item in enumerate(self.versions):
+ created = float(item['created'])
+ item_name = item['name']
+
+ # Note version_id is often "None", so must come last in conjunction.
+ selected = (found == False) \
+ and ((item_name == version_name ) or date_match.next(created))
+
+ if selected == True:
+ found = True
+ if selection==True:
+ return item_name
+
+ # Folder type data stores should already have something resembling created date in name.
+ if type(self).__name__ in 'VDBFolderDataStore VDBBiomajDataStore':
+ item_label = item['name']
+ else:
+ item_label = vdb_common.lightDate(created) + '_' + item['name']
+
+ data.append([item_label, item['name'], selected])
+
+ if not found and len(self.versions) > 0:
+ if global_retrieval_date: # Select oldest date version since no match above.
+ item_name = data[-1][1]
+ data[-1][2] = True
+ else:
+ item_name = data[0][1]
+ data[0][2] = True
+ if selection == True:
+ return item_name
+
+
+ # For cosmetic display: Natural sort takes care of version keys that have mixed characters/numbers
+ data = sorted(data, key=lambda el: vdb_common.natural_sort_key(el[0]), reverse=True) #descending
+
+ #Always tag the first item as the most current one
+ if len(data) > 0:
+ data[0][0] = data[0][0] + ' (current)'
+ else:
+ data.append([vdb_retrieval.VDB_DATASET_NOT_AVAILABLE + ' Is pointer file content right? : ' + self.data_store_path,'',False])
+
+ """
+ globalFound = False
+ for i in range(len(data)):
+ if data[i][2] == True:
+ globalFound = True
+ break
+
+ if globalFound == False:
+ data[0][2] = True # And select it if no other date has been selcted
+
+ """
+
+ return data
+
+
+ def get_version(self, version_name):
+ # All subclasses must define this.
+ pass

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_folder.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/vdb_folder.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,90 @@
+#!/usr/bin/python
+## ******************************* FILE FOLDER *********************************
+##
+
+import re
+import os
+import vdb_common
+import vdb_data_stores
+
+class VDBFolderDataStore(vdb_data_stores.VDBDataStore):
+
+ versions = None
+ library = None
+
+ def __init__(self, retrieval_obj, spec_file_id):
+ """
+ Provides list of available versions where each version is indicated by the
+ existence of a folder that contains its content.  This content can be directly
+ in the galaxy Versioned Data folder tree, OR it can be linked to another folder
+ on the server.  In the latter case, galaxy will treat the Versioned Data folders as caches.
+ View of versions filters out any folders that are used for derivative data caching.
+ """
+ super(VDBFolderDataStore, self).__init__(retrieval_obj, spec_file_id)
+
+ self.library = retrieval_obj.library
+
+ versions = []
+
+ # If data source spec file has no content, use the library folder directly.
+ # Name of EACH subfolder should be a label for each version, including date/time and version id.
+ if self.data_store_path == '':
+ try:
+ lib_label_len = len(self.library_label_path) +1
+
+ for item in self.library:
+ # If item is under library_label_path ...
+ if item['name'][0:lib_label_len] == self.library_label_path + '/':
+ item_name = item['name'][lib_label_len:len(item['name'])]
+ if item_name.find('/') == -1 and item_name.find('_') != -1:
+ (item_date, item_version) = item_name.split('_',1)
+ created = vdb_common.parse_date(item_date)
+ versions.append({'name':item_name, 'id':item_name, 'created': created})
+
+ except Exception as err:
+ # This is the first call to api so api url or authentication erro can happen here.
+ versions.append({
+ 'name':'Software Error: Unable to get version list: ' + err.message,
+ 'id':'',
+ 'created':''
+ })
+
+ else:
+
+ base_file_path = self.data_store_path
+ #base_file_path = os.path.dirname(self.base_file_name)
+ #Here we need to get directory listing of linked file location.
+ for item_name in os.listdir(base_file_path): # Includes files and folders
+ # Only interested in folders
+ if os.path.isdir( os.path.join(base_file_path, item_name)) and  item_name.find('_') != -1:
+ (item_date, item_version) = item_name.split('_',1)
+ created = vdb_common.parse_date(item_date)
+ versions.append({'name':item_name, 'id':item_name, 'created': created})
+
+
+ self.versions = sorted(versions, key=lambda x: x['name'], reverse=True)
+
+ def get_version(self, version_name):
+ """
+ Return server path of requested version info - BUT ONLY IF IT IS LINKED.
+ IF NOT LINKED, returns None for self.version_path
+
+ QUESTION: DOES GALAXY AUTOMATICALLY HANDLE tar.gz/zip decompression?
+
+ @uses library_label_path string Full hierarchic label of a library file or folder, PARENT of version id folder.
+ @uses base_file_name string Server absolute path to data_store spec file
+
+ @param version_name alphaneumeric string (git tag)
+ """
+ self.version_label = version_name
+ self.library_version_path = os.path.join(self.library_label_path, self.version_label)
+
+ if self.data_store_path == '':
+ # In this case version content is held in library directly;
+ self.version_path = self.base_file_name
+
+ else:
+
+ #linked to some other folder, spec is location of base_file_name
+ self.version_path = os.path.join(self.data_store_path, version_name)
+

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_git.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/vdb_git.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,106 @@
+#!/usr/bin/python
+## ********************************* GIT ARCHIVE *******************************
+##
+
+import os, sys
+import vdb_common
+import vdb_data_stores
+import subprocess
+import datetime
+
+class VDBGitDataStore(vdb_data_stores.VDBDataStore):
+
+ def __init__(self, retrieval_obj, spec_file_id):
+ """
+ Archive is expected to be in "master/" subfolder of data_store_path on server.
+ """
+ super(VDBGitDataStore, self).__init__(retrieval_obj, spec_file_id)
+ self.command = 'git'
+ gitPath = self.data_store_path + 'master/'
+ if not os.path.isdir(os.path.join(gitPath,'.git') ):
+ print "Error: Unable to locate git archive file: " + gitPath
+ sys.exit(1)
+
+ command = [self.command, '--git-dir=' + gitPath + '.git', '--work-tree=' + gitPath, 'for-each-ref','--sort=-*committerdate', "--format=%(*committerdate:raw) %(refname)"]
+ # to list just 1 id: command.append('refs/tags/' + version_id)
+ # git --git-dir=/.../NCBI_16S/master/.git --work-tree=/.../NCBI_16S/master/ tag
+ items, error = subprocess.Popen(command,stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
+ items = items.split("\n") #Loop through list of tags
+ versions = []
+ for ptr, item in enumerate(items):
+
+ # Ignore master branch name; time is included as separate field in all other cases
+ if item.strip().find(" ") >0:
+ (vtime, voffset, name) = item.split(" ")
+ created = vdb_common.get_unix_time(vtime, voffset)
+ item_name = name[10:] #strip 'refs/tags/' part off
+ versions.append({'name':item_name, 'id':item_name, 'created': created})
+
+ self.versions = versions
+
+
+ def get_version(self, version_name):
+ """
+ Returns server folder path to version folder containing git files for a given version_id (git tag)
+
+ FUTURE: TO AVOID USE CONFLICTS FOR VERSION RETRIEVAL, GIT CLONE INTO TEMP FOLDER
+ with -s / --shared and -n / --no-checkout (to avoid head build) THEN CHECKOUT version
+ ...
+ REMOVE CLONE GIT REPO
+
+ @param galaxy_instance object A Bioblend galaxy instance
+ @param library_id string Identifier for a galaxy data library
+ @param library_label_path string Full hierarchic label of a library file or folder, PARENT of version id folder.
+
+ @param base_folder_id string a library folder id under which version files should exist
+ @param version_id alphaneumeric string (git tag)
+
+ """
+ version = self.get_metadata_version(version_name)
+
+ if not version:
+ print 'Error: Galaxy was not able to find the given version id in the %s data store.' % self.version_path
+ sys.exit( 1 )
+
+ version_name = version['name']
+ self.version_path = os.path.join(self.data_store_path, version_name)
+ self.version_label = vdb_common.lightDate(version['created']) + '_v' + version_name
+ self.library_version_path = os.path.join(self.library_label_path, self.version_label)
+
+ # If Data Library Versioned Data folder doesn't exist for this version, then create it
+ if not os.path.exists(self.version_path):
+ try:
+ os.mkdir(self.version_path)
+ except:
+ print 'Error: Galaxy was not able to create data store folder "%s".  Check permissions?' % self.version_path
+ sys.exit( 1 )
+
+ if os.listdir(self.version_path) == []:
+
+ git_path = self.data_store_path + 'master/'
+ # RETRIEVE LIST OF FILES FOR GIVEN GIT TAG (using "ls-tree".
+ # It can happen independently of git checkout)
+
+ command = [self.command,  '--git-dir=%s/.git' % git_path, 'ls-tree','--name-only','-r', version_name]
+ items, error = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
+ git_files = items.split('\n')
+
+ # PERFORM GIT CHECKOUT
+ command = [self.command, '--git-dir=%s/.git' % git_path, '--work-tree=%s' % git_path, 'checkout', version_name]
+ results, error = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
+
+ vdb_common.move_files(git_path, self.version_path, git_files)
+
+
+
+ def get_metadata_version(self, version_name=''):
+ if version_name == '':
+ return self.versions[0]
+
+ for version in self.versions:
+ if str(version['name']) == version_name:
+ return version
+
+ return False
+
+

diff -r d31a1bd74e63 -r 5c5027485f7d data_stores/vdb_kipper.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data_stores/vdb_kipper.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,102 @@
+#!/usr/bin/python
+## ********************************* Kipper *************************************
+##
+
+import vdb_common
+import vdb_data_stores
+import subprocess
+import json
+import glob
+import os
+import sys
+
+class VDBKipperDataStore(vdb_data_stores.VDBDataStore):
+
+ metadata = None
+ versions = None
+ command = None
+
+ def __init__(self, retrieval_obj, spec_file_id):
+ """
+ Metadata (version list) is expected in "master/" subfolder of data_store_path on server.
+
+ Future: allow for more than one database / .md file in a folder?
+ """
+ super(VDBKipperDataStore, self).__init__(retrieval_obj, spec_file_id)
+ # Ensure we're working with this tool's version of Kipper.
+ self.command = os.path.join(os.path.dirname(sys._getframe().f_code.co_filename),'kipper.py')
+ self.versions = []
+
+ metadata_path = os.path.join(self.data_store_path,'master','*.md')
+ metadata_files = glob.glob(metadata_path)
+
+ if len(metadata_files) == 0:
+ return # Handled by empty list error in versioned_data_form.py
+
+ metadata_file = metadata_files[0]
+
+ with open(metadata_file,'r') as metadata_handle:
+ self.metadata = json.load(metadata_handle)
+ for volume in self.metadata['volumes']:
+ self.versions.extend(volume['versions'])
+
+ if len(self.versions) == 0:
+ print "Error: Unable to locate metadata file: " + metadata_path
+ sys.exit(1)
+
+ self.versions = sorted(self.versions, key=lambda x: x['id'], reverse=True)
+
+
+ def get_version(self, version_name):
+ """
+ Trigger populating of the appropriate version folder.
+ Returns path information to version folder, and its appropriate library label.
+
+ @param version_id string
+ """
+
+ version = self.get_metadata_version(version_name)
+
+ if not version:
+ print 'Error: Galaxy was not able to find the given version id in the %s data store.' % self.version_path
+ sys.exit( 1 )
+
+ version_name = version['name']
+ self.version_path = os.path.join(self.data_store_path, version_name)
+ self.version_label = vdb_common.lightDate(version['created']) + '_v' + version_name
+
+ print self.library_label_path
+ print self.version_label
+
+ self.library_version_path = os.path.join(self.library_label_path, self.version_label)
+
+ if not os.path.exists(self.version_path):
+ try:
+ os.mkdir(self.version_path)
+ except:
+ print 'Error: Galaxy was not able to create data store folder "%s".  Check permissions?' % self.version_path
+ sys.exit( 1 )
+
+ # Generate cache if folder is empty (can take a while):
+ if os.listdir(self.version_path) == []:
+
+ db_file = os.path.join(self.data_store_path, 'master', self.metadata["db_file_name"] )
+ command = [self.command, db_file, '-e','-I', version_name, '-o', self.version_path ]
+
+ try:
+ result = subprocess.call(command);
+ except:
+ print 'Error: Galaxy was not able to run the kipper.py program successfully for this job: "%s".  Check permissions?' % self.version_path
+ sys.exit( 1 )
+
+
+ def get_metadata_version(self, version_name=''):
+ if version_name == '':
+ return self.versions[0]
+
+ for version in self.versions:
+ if str(version['name']) == version_name:
+ return version
+
+ return False
+

diff -r d31a1bd74e63 -r 5c5027485f7d doc/README.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/README.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,16 @@
+## Documentation Index
+0. [Overview](../README.md)
+1. [Setup for Admins](setup.md)
+  1. [Galaxy tool installation](galaxy_tool_install.md)
+  2. [Server data stores](data_stores.md)
+  4. [Galaxy "Versioned Data" library setup](galaxy_library.md)
+  5. [Workflow configuration](workflows.md)
+  6. [Permissions, security, and maintenance](maintenance.md)
+  7. [Problem solving](problem_solving.md)
+2. [Using the Galaxy Versioned data tool](galaxy_tool.md)
+3. [System Design](design.md)
+4. [Background Research](background.md)
+5. [Server data store and galaxy library organization](data_store_org.md)
+6. [Data Provenance and Reproducibility](data_provenance.md)
+7. [Caching System](caching.md)
+

diff -r d31a1bd74e63 -r 5c5027485f7d doc/background.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/background.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,44 @@
+# Background Research
+
+### Git Archiving
+
+Git was investigated as a generic solution for versioning textual files but in a number of cases its performance was detrimental.  Initially the git archive option seemed to fail on fasta files of any size.  In tests git was replacing one version with another by a diff that contained deletes of every original line followed by inserts of every line in new file.
+
+By reformatting the incoming fasta file so that each line held one fasta record >[sequence id] [description] [sequence], git's diff algorithm seemed to regain its efficiency. The traditional format:
+
+```
+>gi|324234235|whatever a description
+ATGCCGTAAGATTC ...
+ATGCCGTAAGATTC ...
+ATGCCGTAAGATTC ...
+>gi|25345345|another entry
+etc.
+```
+
+was converted to a 1 line format:
+
+```
+>gi|324234235|whatever a description [tab] ATGCCGTAAGATTCATGCCGTAAGATTCATGCCGTAAGATTC ...
+>gi|25345345|another entry ...
+```
+
+This approach seemed to archive fasta files very efficiently.  However it turned out this only worked for some kinds of fasta data.  Git's file differential algorithm failed possibly because of a lack of variation from line to line of a file - it may need lines to have different lengths, word count, or variation in character content (it seemed to be ok for some nucleotide data, but we found it occasionally failed on uniprot v1 to v11 protein data.
+
+There is a git feature to disable diff analysis for files above a certain file size, which then makes git archive size comparable to a straight text-file compression algorithm (zip/gzip).  Overall, a practical limit of 15gb has been reported for git archive size.  The main barrier seems to be content format; archive retrieval does get slower over time if diffs are being calculated.
+
+Git also has has very high memory requirements to do its processing, which is a server liability.  Tests indicate that for some inputs, git is successful for file sizes up to at least a few gigabytes.
+
+### XML Archiving
+
+We also examined the versioning of XML content.  The XArch: http://xarch.sourceforge.net/ approach is very promising as a generic solution, though it entails slow processing due to the nature of xml documents.  It compares current and past xml documents and datestamps the additions/deletions to an xml structure.  A good article summarizing the XML archive problem: http://useless-factor.blogspot.ca/2008/01/matching-diffing-and-merging-xml.html
+
+
+### Other Database Solutions
+
+A number of key-value databases exist (e.g. LevelDB, LMDB, and DiscoDB) but they are designed to gain quick access to individual key-value entries, not to query properties of each key essential to version construction, namely to filter every key in the database by when it was created or deleted.  Some key-value databases have features that could support speedy version retrieval.
+
+Google's Leveldb - a straight key-value file based database - may work, though we would need to think up a scheme for including create & delete times in the key for each fasta identifier.  Possibly 1 key=fasta id record indicating existence of fasta identifier sequence; and subsequent fastaId.created.deleted key as per the keydb solution.  An advantage here is that compression is built-in.  In the case where incoming data consists a full file to synchronize with, inserts are easy to account for - but deletes can be very tricky since finding them involves scanning through the entire leveldb's key list, with the assumption that they are sorted the same way as the incoming data (the keydb approach) (VERIFY THIS???)
+
+For processing an incoming entire version of key-value content, the functionality required is that the keys are stored in a reliable sort order, and that they can be read fast sequentially .  During this loop we can retrieve all the create/update/delete transactions of each key by version or date.  (Usually the db has no ability to structure the value data though, so we would usually have to create such a schema).  This is basically what keydb does.
+
+For processing separate incoming files that contain only inserts and deletes, the job is much the same, only we skip swaths of the master database content.  At some point it becomes more attractive performance-wise to implement an indexed key value store, thus avoiding the sorting and sequential access necessary for simpler key-value dbs.

diff -r d31a1bd74e63 -r 5c5027485f7d doc/bad_history_item.png

Binary file doc/bad_history_item.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/caching.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/caching.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,15 @@
+# Caching System
+
+The Versioned Data tool caching system operates in two stages, one covering server folders that contain data store version retrievals, and the second covering Galaxy's "Versioned Data" data library folder behaviour.  The "Versioned Data" library folder files mainly link to the server folders having the respective content, so cache management consists of creating/deleting the server's versioned data as well as Galaxy's link to it.  As well the "Versioned Data" library has a "Workflow Cache" folder which stores all derivative (output) workflow datasets as requested by Versioned Data tool users.
+
+The program that periodically clears out the Galaxy link and server data cache is called *versioned_data_cache_clear.py*, and should be set up by a Galaxy administrator to run on a monthly basis, say.  It leaves each database's latest version in the cache.
+
+---
+
+### Galaxy behaviour when user history references to missing cached data occur
+
+If a server data store version folder is deleted, the galaxy data library and user history references to it will be broken.  The user will experience a "Not Found" display in their workspace when clicking on a linked datastore like that in their history.  The user can simply rerun the "Versioned Data Retrieval" tool to restore the data.
+
+![message when selected dataset file no longer exists on the server](bad_history_item.png)
+
+In the example above, clicking on the view (eye) icon for "155:16s_rdp.fasta" triggered this message.  To remedy the situation, the user simply reruns the "154: Versioned Data Retrieval" step below to regenerate the dataset(s).

diff -r d31a1bd74e63 -r 5c5027485f7d doc/data_provenance.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/data_provenance.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,12 @@
+## Data Provenance and Reproducibility
+
+When a user selects particular version id or date for versioned data retrieval, this is recorded for future reference, and can be seen in a history item's "View details" (info icon) report, in the "Input Parameters" section. But if a user left the global date field blank or didn't select a particular version of a data source, they or another user can still rerun a Versioned Data retrieval to recreate the results by noting the original history item's view details "Created" date and entering it into the global retrieval date of the form.
+
+![A history dataset has a detailed view link](history_view_details.png)
+
+![Data provenance information is available in the detail view](history_dataset_details.png)
+
+Also, particular dates/versions of a Versioned Data history item's retrieved data are shown in its "Edit Attributes" (pencil icon) report in the "Info" field.
+
+Because Galaxy also preserves the version id of any galaxy tool it runs (e.g. the makeblastdb version #), rerunning a history/workflow that has these tools should also apply the appropriate software version to generate the secondary data as well.
+However, the tool version ids contained within a workflow are not recorded by the versioned data tool per se.; they exist only in the selected workflow's design template, so some care must be taken to freeze or version any workflows used to generate derivative data.

diff -r d31a1bd74e63 -r 5c5027485f7d doc/data_stores.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/data_stores.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,96 @@
+# Server Data Stores
+
+These folders hold the git, Kipper, and plain folder data stores.  A data store connection directly to a Biomaj databank is also possible, though it is limited to those databanks that have .fasta files in their /flat/ subfolder.  How you want to store your data depends on the data itself (size, format) and its frequency of use; generally large fasta databases should be held in a Kipper data store.  Generally, each git or Kipper data store needs to have:
+* An appropriately named folder;
+* A sub-folder called "master/", with the data store files set up in it;
+* A file called "pointer.[type of data store]" which contains a path to itself.  This file will be linked into a galaxy data library.
+Note that all these folders need to be accessible by the same user that runs your Galaxy installation for Galaxy integration to work.  Specifically Galaxy needs recursive rwx permissions on these folders.
+
+To start, you may want to set up the Kipper RDP RNA database (included in the Kipper repository RDP-test-case folder, see README.md file there).  It comes complete with the folder structure and files for the Kipper example that follows.  You will have to adjust the pointer.Kipper file content appropriately.
+
+The data store folders can be placed within a single folder, or may be in different locations on the server as desired.  In our example versioned data stores are all located under /projects2/reference_dbs/versioned/ .
+
+## Data Store Examples
+
+### Kipper data store example: A data store for the RDP RNA database:
+```bash
+  /projects2/reference_dbs/versioned/RDP_RNA/
+  /projects2/reference_dbs/versioned/RDP_RNA/pointer.kipper
+  /projects2/reference_dbs/versioned/RDP_RNA/master/
+  /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_1
+  /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_2
+  /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna.md
+```
+To start a Kipper data store from scratch, go into the master folder, initialize a Kipper data store there, and import a version of a content file into the Kipper db.
+
+```bash
+cd master/
+kipper.py rdp_rna -M fasta
+kipper.py rdp_rna -i [file to import] -o .
+```
+
+Kipper stores and retrieves just one file at a time. Currently there is no provision for retrieving multiple files from different Kipper data stores at once.
+
+Note that very large temporary files can be generated during the archive/recall process.  For example, a compressed 10Gb NCBI "nr" input fasta file may be resorted and reformatted; or a 10Gb file may be transformed from Kipper format and output to a file.  For this reason we have located any necessary temporary files in the input and output folders specified in the Kipper command line.  (The system /tmp/ folder can be too small to fit them).
+
+
+### Folder data store example
+
+In this scenario the data we want to archive probably isn't of a key-value nature, nor is it amenable to diff storage via git, so we're storing each version as a separate file.  We don't need the "master" sub-folder since there is no master database, but we do need 1 additional folder to store each version's file(s).
+
+The version folder names must be in the format **[date]_[version id]** to convey to users the date and version id of each version.  The folder names will be displayed directly in the Galaxy Versioned Data tool's selectable list of versions. Note that this can allow for various date and time granularity, e.g. "2005_v1" and "2005-01-05 10:24_v1" are both acceptable folder names. Note that several versions can be published on the same day.
+
+Example of a refseq50 protein database as a folder data store:
+
+```bash
+/projects/reference_dbs/versioned/refseq50/
+/projects/reference_dbs/versioned/refseq50/pointer.folder
+/projects/reference_dbs/versioned/refseq50/2005-01-05_v1/file.fasta
+/projects/reference_dbs/versioned/refseq50/2005-01-05_v2/file.fasta
+/projects/reference_dbs/versioned/refseq50/2005-02-04_v3/file.fasta
+...
+/projects/reference_dbs/versioned/refseq50/2005-05-24_v4/file.fasta
+...
+/projects/reference_dbs/versioned/refseq50/2005-09-27_v5/file.fasta
+```
+
+etc...
+
+A data store of type "folder" doesn't have to be stored outside of galaxy.  Exactly the same folder structure can be set up directly within the galaxy data library, and files can be uploaded inside them.  The one drawback to this approach is that then other (non-galaxy platform) server users can't have easy access to version data.
+
+Needless to say, administrators should **never delete these files since they are not cached!**
+
+### Git Data Store example
+
+A git data store for versions of the NCBI 16S microbial database would look like:
+
+```
+/projects2/reference_dbs/versioned/ncbi_16S/
+/projects2/reference_dbs/versioned/ncbi_16S/pointer.git
+/projects2/reference_dbs/versioned/ncbi_16S/master/.git (hidden file)
+```
+
+One must initialize a git repository (or clone one) into the master/ folder.  The Versioned Data system depends on use of git 'tags' to specify versions of data.  See the Git Data Store section for details.
+To start a git repository from scratch, go into the master folder, initialize git there, copy versioned content into the folder, and then commit it.  Finally add a git tag that describes the version identifier.  The versioned data system only distinguishes versions by their tag name. Thus one can have several commits between versions.
+
+```
+cd master/
+git init
+cp [files from wherever] ./
+git add [those files]
+git commit -m 'various changes'
+...
+git add [changed files]
+git commit -m 'various changes'
+...
+git tag -a v1 -m v1
+```
+
+Once your tag is defined it will be listed in Galaxy as a version
+
+
+### Biomaj data store example
+
+In this scenario the data we want versioned access to is sitting directly in the /flat/ folder of a Biomaj databank.  Each version is a separate file that Biomaj manages.  Biomaj can be set to keep all old versions alongside any new one it downloads, or it can limit the total # of versions to a fixed number (with the oldest removed when the newest arrives).
+
+*coming soon...*

diff -r d31a1bd74e63 -r 5c5027485f7d doc/design.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/design.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,45 @@
+# System Design
+
+## Server Versioned data store folder and Galaxy data library folder structure
+
+Galaxy's data library will be the main portal to the server locations where versioned data are kept.  We have designed the system so that versioned data sources usually exist outside of galaxy in "data store folders".  One assumption is that because of the size of the databases involved, and because we are mostly concerned with providing tools with direct access or access to indexes on this data, we will make use of the file system to hold and organize the data store and derived products rather than consider a SQL or NoSQL database warehouse approach.
+
+Our versioned data tool scans the galaxy data library for folders that signal where these data stores are. It only searches for content inside a data library called "Versioned Data". The folder hierarchy under this is flexible - a Galaxy admin can create a hierarchy to their liking, dividing versioned data first by data source (NCBI, 3rd party, etc.) or by type (viral, bacteral, rna, eucaryote, protein, nucleotide, etc.). Eventually a "versioned data folder" is encountered in the hierarchy.  It has a specific format:
+
+A versioned data folder has a marker file that indicates what kind of archival technology is at work in it.  The marker file also yields the server file path where the archive data exists. (The data library folders themselves are just textual names and so don't provide any information for sniffing where they are or what they contain.)
+
+* pointer.kipper signals a Kipper-managed data store.
+
+* pointer.git signals a git-managed data store.
+
+* pointer.folder signals a basic versioned file storage folder.  No caching mechanism applies here; files are permanent unless deleted manually by an admin.  Content of this file either points to a server folder, or is left empty to indicate that library should be used directly to store permanent versioned data folder content.
+
+* pointer.biomaj signals a basic versioned file storage folder that points directly to Biomaj databank folders.  No caching mechanism applies here; versioned data folders exist to the extent that Biomaj has been configured to keep them.  Each data bank version folder will be examined for content in its /flat/ subfolder; those files will be listed as data to retrieve.
+
+The marker file can have permissions set so that only Galaxy admins could see it.
+
+Under the versioned data folder a user can see archive folders whose contents are linked to caches of particular archive data versions (version date and id indicated in folder name).  More than one sub-folder of archived data can exist since different users may have them in use.  The archived data are a COPY of the master archive folder at cache-generation time.
+
+For example, if using git to retrieve a version, the git database for that version is recalled, then a server folder is created to cache that archive version, and all of the git archive's contents are copied over to it.  A Galaxy Versioned Data library cached folder's files are usually symlinked to the archive copy folder elsewhere on the server; this is designed so the system can also be used independently of Galaxy by other command line tools).
+
+If the archiving system has a cached version of a particular date available, this is marked by the presence of an "YYYY[-MM-DD-HH-MM-SS]_[VERSION ID]" folder.  If this folder does not exist, the archive needs to be regenerated. Archive dates can be as basic as year, or can include finer granularity of month, day, hour etc.
+
+In the server data store, the folder hierarchy for a data store basically provides file storage for the data store, as well as specifically named folders for cached data.
+
+A data store can be broken into separate volumes such that a particular volume covers a range of version ids. As well, to facilitate parallel processing, a client server can hold a master database in a set of files distinguished by the range of keys they cover, e.g. one master_a.kipper handles all keys beginning with 'a'.
+
+```
+/.../[database name]/
+  master/[datastore name]_[volume #]
+  master/[datastore name]_[volume #]_[key prefix]     (If database is chunked by key prefix)
+  [version date]_[version_id]/                        (A particular extracted version's data e.g. "2014-01-01_v5/file 1 .... file N")
+```
+
+Derivative data like blast databases are stored under a folder called "Workflow cache" that exists immediately under the versioned data library.  The folder hierarchy here consists of a folder for each workflow by workflow id, followed by a folder coded by the id's of each of the workflows inputs; these reference dataset ids that exist in the versioned data library.  The combination of workflow id and its input dataset ids yeids a cache of one or more output files that we can count on to be the right output function of the workflow + versioned data input.  In the future, this derivative data could be accessed from links under a similar folder structure in the server data store folders.
+
+```
+  workflow_cache/                                     Contains links to galaxy generated workflow products like blast databases.
+    workflow_id
+      [dataset_id]_[dataset_id]...   (inputs to the workflow coded in folder name)
+        Workflow output files for given dataset(s)
+```

diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_data_library.png

Binary file doc/galaxy_data_library.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_library.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/galaxy_library.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,36 @@
+## Galaxy "Versioned Data" Library Setup
+
+1. Create a Galaxy data library called "Versioned Data".  The name is important - it must be located by the Versioned Data tool.
+
+![Galaxy Data Library](galaxy_data_library.png)
+
+2. Set permissions so the special "versioneddata@localhost.com" user has add/update permissions on the "Versioned Data" library.
+
+3. Then add any folder structure you want under Versioned Data.  Top level folders could be "Bacteria, Virus, Eukaryote", or "NCBI, ... ".  Underlying folders can hold versioned data for particular bacteria / virus databases, e.g. "NCBI nt".
+
+4. Connect a **"pointer.[data store type]"** file (a simple text file) to any folder on your server that you want to activate as a versioned data store.  Any folder that has a "pointer.[data store type]" file in it will be treated as a folder containing versioned content, as illustrated above.  These folders will then be listed (by name) in the Versioned Data tool's list of data stores. Within these folders, links to caches of retrieved versioned data will be kept (shown as "cached data" items in illustration).  For "folder" and "biomaj" data stores, links will be to permanent files, not cached ones.
+
+For example, on Galaxy page Shared Data > Data Libraries > Versioned Data, there is a folder/file:
+
+  `Bacterial/RDP RNA/pointer.kipper`
+
+which contains one line of text:
+
+  `/projects2/ref_databases/versioned/rdp_rna/`
+
+That enables the Versioned Data tool to list the Kipper archive as "RDP RNA" (name of folder that pointer file is in), and to know where its data store is. Beneath this folder other cached folders will accumulate.
+
+The "upload files to a data library" page (shown below) has the ability to link to the pointer file directly in the data store folder.  Select the displayed "Upload files from filesystem paths" option (available only if you access this form from the admin menu in Galaxy).  Enter the path and FILENAME (literally "pointer.kipper" in this case) of your pointer file. If you forget the filename, galaxy will link all the content files, which will need to be deleted before continuing.
+
+![Link pointer file to Versioned Data subfolder](library_dataset_upload.png)
+
+Preferably select the "Link to files without copying into Galaxy" option as well.  This isn't required, but linking to the pointer file enables easier diagnosis of folder path problems.
+
+Then submit the form; if an error occurs then verify that the pointer file path is correct.
+
+**Note: a folder called "Workflow Cache" is automatically created within the Versioned Data folder to hold cached workflow results as triggered by the Versioned Data tool. No maintenance of this folder is needed.**
+
+### Direct use of Data Library Folder
+
+Note: The "folder" data store is a simple data store type - marking a folder this way enables the Versioned Data tool to directly list the subfolders of this Library folder (by name), with expectation that each folder's contents can be placed in user's history directly.  The one optional twist to this is that if the folder's .pointer file is sym-linked to another folder on the server, then folder contents will be retrieved from the symlinked location.
+

diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_tool.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/galaxy_tool.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,26 @@
+# The Galaxy Versioned Data Tool
+
+This tool retrieves links to current or past versions of fasta or other types of data from a cache kept in the Galaxy data library called "Versioned Data". It then places them into the current history so that subsequent tools can work with that data. A blast search can be carried out on a version of a fasta database from a year ago for example.
+
+![Galaxy Versioned Data Tool](versioned_data_retrieval.png)
+
+You can select one or more files by version date or id. (This list is supplied from the Shared Data > Data Libraries > Versioned Data folder that has been set up by a Galaxy administrator).
+
+In the versioned data tool, user selects a data source, and then selects a version to retrieve (by date or version id).
+If a cached version of that database exists, it is linked into user's history.
+Otherwise a new version of it is created, placed in cache, and linked into history.
+The Versioned Data form starts with an optional top-level "Global retrieval date" which is applied to all selected databases. This can be overridden by a retrieval date or version that you supply for a particular database.
+
+Finally, if you just select a data source to retrieve, but no global retrieval date or particular versions, the most recent version of the selected data source will be retrieved.
+
+The caching system caches both the versioned data and workflow data that the tool generates. If you request versioned data or derivative data that isn't cached, then (depending on the size of the archive) it may take time to regenerate.
+
+
+
+## Generation of workflow data
+
+The Workflows section allows you to select one or more pre-defined workflows to execute on the versioned data.  Currently this includes any workflow that begins with the phrase "Versioned: ".  The results are placed in your history for use by other tools or workflows.
+
+Currently workflow parameters must be entirely specified ("canned"), when the workflow is created/updated, rather than being specified at runtime.  This means that a separate workflow with fixed settings must be predefined for each desired retrieval process (e.g. a blastdb with regions of low complexity filtered out, which requires a few steps to execute  -dustmasker + makeblastdb etc).
+
+Any user that needs more specific parameters for a reference database creation can just invoke the tools/steps after using the Versioned Data tool to retrieve the raw fasta data.  The only drawback in this case is that the derivative data can't be cached - it has to be redone each time the tool is run.

diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_tool_form.png

Binary file doc/galaxy_tool_form.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_tool_install.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/galaxy_tool_install.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,22 @@
+## Galaxy tool installation
+
+1. **This step requires a restart of Galaxy**:  Edit your Galaxy config file ("universe_wsgi.conf" in older installs):
+  1. Add the tool's default user name "**versioneddata@localhost.com**" to the admin_user's list line.
+  1. Ensure the Galaxy API is activated: "enable_api = True"
+
+2. Then ensure **you** have an API key (from User > Api keys menu).
+
+3. Login to Galaxy as an administrator, and install the Versioned Data tool.  Go to the Galaxy Admin > Tool Sheds > "Search and browse tool sheds" menu.  Currently this tool is available from the tool shed at https://toolshed.g2.bx.psu.edu/view/damion/versioned_data .  The "versioned_data" repository is located under the "sequence analysis" category.
+
+4. After it is installed, display the tool form once in Galaxy - this allows the tool to do the remaining setup of its "versioneddata" user.  The tool automatically creates a galaxy API user (by default, versioneddata@localhost.com).  You will see an API key generated automatically as well.  **The tool (via Galaxy) automatically writes a copy of this key in a file called "versioneddata_api_key.txt" in its install file folder for reference; if you ever want to change the API key for the versioneddata user, you need to delete this file, and it will be regenerated automatically on next use of the tool by an admin.**
+
+5. Also, ensure that the kipper.py file has the Galaxy server's system user as its owner with executable permission ( "chown galaxy kipper.py; chmod u+x kipper.py").
+
+
+### Command-line Kipper
+
+
+  Optional: If you want to work with Kipper files directly, e.g. to start a Kipper repository from scratch, then sym-link the kipper.py file that is in the Versioned Data tool code folder's /data_stores/ subfolder - you could link it as "/usr/bin/kipper" for example.
+  The tool path is shown in the "location" field when you click on "Versioned Data" tool in Galaxy's "admin > Administration > Manage installed tool shed repositories" report.
+
+More info on command-line kipper is available [here](https://github.com/Public-Health-Bioinformatics/kipper).

diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_versioned_data_tool.odt

Binary file doc/galaxy_versioned_data_tool.odt has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/galaxy_versioned_data_tool.pdf

Binary file doc/galaxy_versioned_data_tool.pdf has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/history_dataset_details.png

Binary file doc/history_dataset_details.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/history_view_details.png

Binary file doc/history_view_details.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/library_dataset_upload.png

Binary file doc/library_dataset_upload.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/maintenance.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/maintenance.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,42 @@
+# Permissions, security, and maintenance
+
+## Permissions
+
+When installing the Versioned Data tool, ensure that you are in the Galaxy config admin_user's list and that you have an API key.  Run the tool at least once after install, that way the tool can create its own "versioneddata" user if it isn't already there.  The tool uses this account to run data retrieval workflow jobs.
+
+For your server data store folders, since galaxy is responsible for triggering the generation and storage of versions in them, it will need permissions. For example, assuming "galaxy" is the user account that runs galaxy:
+
+```bash
+  chown -R galaxy /projects2/reference_dbs/versioned/
+  chmod u+rwx /projects2/reference_dbs/versioned/*
+```
+
+Otherwise you will be confronted with an error similar to the following error within Galaxy when you try to do a retrieval:
+
+```bash
+  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
+    raise child_exception
+  OSError: [Errno 13] Permission denied"
+```
+
+## Cache Clearing
+
+All the retrieved data store versions (except for static "folder" data stores) get cached on the server, and all the triggered galaxy workflow runs get cached in galaxy in the Versioned Data library's "Workflow cache" folder.  A special script will clear out all but the most recent version of any data store's cached versions, and remove workflow caches as appropriate.  This script is named
+
+  `versioned_data_cache_clear.py`
+
+and is in the Versioned Data script's shed_tools folder.  It can be run as a monthly scheduled task.
+
+The "versioneddata" user has a history for each workflow it runs on behalf of another user.  A galaxy admin can impersonate the "versioneddata" user to see the workflows being executed by other users, and as well, manually delete any histories that haven't been properly deleted by the cache clearing script.
+
+Note that Galaxy won't delete a dataset if it is linked from another (user or library) context.
+
+## Notes
+
+In galaxy, the galaxy (i) information link from a "Versioned Data Retrieval" history job item will display all the form data collected for the job run.  One item,
+
+  `For user with Galaxy API Key http://salk.bccdc.med.ubc.ca/galaxylab/api-cc628a7dffdbeca7`
+
+is used to pass the api url, and user's current history id.  We weren't able to convey this information any other way.  The coded parameter is not the user's api key; that is kept confidential, i.e. it doesn't exist in the records of the job run.
+
+

diff -r d31a1bd74e63 -r 5c5027485f7d doc/problem_solving.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/problem_solving.md Sun Aug 09 16:07:50 2015 -0400

@@ -0,0 +1,27 @@
+# Problem solving
+
+If the Galaxy API is not working, then it needs to be enabled in universe_wsgi.ini with "enable_api = True"
+
+The "Versioned data retrieval" tool depends on the bioblend python module created for accessing Galaxy's API. This enables reading and writing to a particular Galaxy data library and to current user's history in a more elegant way.
+
+  `> pip install bioblend`
+
+This loads a variety of dependencies - requests, poster, boto, pyyaml,... (For ubuntu, first install libyaml-dev first to avoid odd pyyaml error?)  (Note that this must be done as well on galaxy toolshed's install too.)
+
+**Data library download link doesn't work**
+
+If a data library has a version data folder that is linked to the data store elsewhere on the server, that folder's download link probably won't work until you adjust your galaxy webserver (apache or nginx) configuration.  See this and this (a note on galaxy user permissions visa vis apache user).  For the example of all data stores that are in /projects2/reference_dbs/versioned/ , Apache needs two lines for the galaxy site configuration (usually in /etc/httpd/conf/httpd.conf ).
+
+```
+XSendFile on
+XSendFilePath ... (other paths)
+XSendFilePath /projects2/reference_dbs/versioned/
+```
+
+Without this path, and sufficient permissions, errors will show up in the Apache httpd log, and galaxy users will find that they can't download the versioned data if it is linked from the galaxy library to other locations on the server.
+
+**Data store version deletion**
+
+Note: if a server data store version folder has been deleted, but a link to it still exists in the Versioned Data library, then attempting to download the dataset from there will result in an error.  Running the Versioned Data tool to request this file will regenerate it in the server data store cache. Alternately, you can delete the versioned data library version cache folder.
+
+

diff -r d31a1bd74e63 -r 5c5027485f7d doc/setup.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/setup.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,14 @@
+## Setup for Admins
+
+The general goal with the following configuration is to enable the versioned data to be generated and used directly by server (non galaxy-platform) users; at the same time Galaxy users have access to the same versioned data system (and also have the benefit of generating derivative data via Galaxy workflows) without having to leave the Galaxy platform.  In the background a reference database scheduled import process (in our case maintained by Biomaj) keeps the master data store files up to date.
+
+To setup the Versioned Data Tool, do the following:
+
+  1. [Galaxy tool installation](galaxy_tool_install.md)
+  2. [Server data stores](data_stores.md)
+  4. [Galaxy "Versioned Data" library setup](galaxy_library.md)
+  5. [Workflow configuration](workflows.md)
+  6. [Permissions, security, and maintenance](maintenance.md)
+  7. [Problem solving](problem_solving.md)
+
+Any galaxy user who wants to use this tool will need a Galaxy API key.  They can get one via their User menu, or a Galaxy admin can assign one for them.

diff -r d31a1bd74e63 -r 5c5027485f7d doc/versioned_data_retrieval.png

Binary file doc/versioned_data_retrieval.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/workflow_tools.png

Binary file doc/workflow_tools.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d doc/workflows.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/workflows.md Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,18 @@
+# Workflow configuration
+
+**All displayed workflows must be prefixed with the keyword "Versioning:"**  This is how they are recognized by the Versioned Data tool
+
+In order for a workflow to show up in the tool's Workflows menu, **it must be published by its author**. Currently there is no other ability to offer finer-grained access to lists of workflows by individual or user group.
+
+![Galaxy workflows](workflows.png)
+
+Workflow processing is accomplished on behalf of a user by launching it in a history of the versioneddata@localhost.com user.  Cache clearing will remove these file references from the data library when they are old (but as the files are often moved up to the library's cache they can still be used by others until that cache is cleared.  An admin can impersonate a "versioneddata" user to see the workflows being executed by other users.
+
+Note that any workflow that uses additional input datasets must have those datasets set in the workflow design/template, so they must exist in a fixed location - a data library.  (Currently they can't be in the Versioned Data library).
+
+If a a workflow references a tool version that has been uninstalled, one will receive this error when working on it.  The only remedy is to reinstall that particular tool version, or to change the workflow to a newer version.
+
+```bash
+  File "/usr/lib/python2.6/site-packages/bioblend/galaxyclient.py", line 104, in make_post_request r.status_code, body=r.text)
+  bioblend.galaxy.client.ConnectionError: Unexpected response from galaxy: 500: {"traceback": "Traceback (most recent call last):\n  File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/web/framework/decorators.py\", line 244, in decorator\n    rval = func( self, trans, *args, **kwargs)\n  File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/webapps/galaxy/api/workflows.py\", line 231, in create\n    populate_state=True,\n  File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/run.py\", line 21, in invoke\n    return __invoke( trans, workflow, workflow_run_config, workflow_invocation, populate_state )\n  File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/run.py\", line 60, in __invoke\n    modules.populate_module_and_state( trans, workflow, workflow_run_config.param_map )\n  File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/modules.py\", line 1014, in populate_module_and_state\n    step_errors = module_injector.inject( step, step_args=step_args, source=\"json\" )\n  File \"/usr/local/galaxy/production1/galaxy-dist/lib/galaxy/workflow/modules.py\", line 992, in inject\n    raise MissingToolException()\nMissingToolException\n", "err_msg": "Uncaught exception in exposed API method:", "err_code": 0}
+```

diff -r d31a1bd74e63 -r 5c5027485f7d doc/workflows.png

Binary file doc/workflows.png has changed

diff -r d31a1bd74e63 -r 5c5027485f7d ffp_macros.xml
--- a/ffp_macros.xml Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,21 +0,0 @@
-<macros>
-
-    <xml name="stdio">
-        <stdio>
-            
-            <exit_code range="1:" />
-            <exit_code range=":-1" />
-            
-            <regex match="Error:" />
-            <regex match="Exception:" />
-        </stdio>
-    </xml>
-
-    <xml name="requirements">
-        <requirements>
-            <requirement type="binary">@BINARY@</requirement>
-        </requirements>
-        <version_command interpreter="python">@BINARY@ --version</version_command>
-    </xml>
-
-</macros>
\ No newline at end of file

diff -r d31a1bd74e63 -r 5c5027485f7d ffp_phylogeny.py
--- a/ffp_phylogeny.py Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

[

b'@@ -1,364 +0,0 @@\n-#!/usr/bin/python\n-import optparse\n-import re\n-import time\n-import os\n-import tempfile\n-import sys\n-import shlex, subprocess\n-from string import maketrans\n-\n-VERSION_NUMBER = "0.1.03"\n-\n-class MyParser(optparse.OptionParser):\n-\t"""\n-\t From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output\n-\t Provides a better class for displaying formatted help info in epilog() portion of optParse; allows for carriage returns.\n-\t"""\n-\tdef format_epilog(self, formatter):\n-\t\treturn self.epilog\n-\n-\n-def stop_err( msg ):\n- sys.stderr.write("%s\\n" % msg)\n- sys.exit(1)\n-\n-def getTaxonomyNames(type, multiple, abbreviate, filepaths, filenames):\n-\t"""\n-\tReturns a taxonomic list of names corresponding to each file being analyzed by ffp.\n-\tThis may also include names for each fasta sequence found within a file if the\n-\t"-m" multiple option is provided. \tDefault is to use the file names rather than fasta id\'s inside the files.\n-\tNOTE: THIS DOES NOT (MUST NOT) REORDER NAMES IN NAME ARRAY. \n-\tEACH NAME ENTRY IS TRIMMED AND MADE UNIQUE\n-\t\n-\t@param type string [\'text\',\'amino\',\'nucleotide\']\n-\t@param multiple boolean Flag indicates to look within files for labels\n-\t@param abbreviate boolean Flag indicates to shorten labels\t\n-\t@filenames array original input file names as user selected them\n-\t@filepaths array resulting galaxy dataset file .dat paths\n-\t\n-\t"""\n-\t# Take off prefix/suffix whitespace/comma :\n-\ttaxonomy = filenames.strip().strip(\',\').split(\',\')\n-\tnames=[]\n-\tptr = 0\n-\n-\tfor file in filepaths:\n-\t\t# Trim labels to 50 characters max. ffpjsd kneecaps a taxonomy label to 10 characters if it is greater than 50 chars.\n-\t\ttaxonomyitem = taxonomy[ptr].strip()[:50] #.translate(translations)\n-\t\t# Convert non-alphanumeric characters to underscore in taxonomy names. ffprwn IS VERY SENSITIVE ABOUT THIS.\n-\t\ttaxonomyitem = re.sub(\'[^0-9a-zA-Z]+\', \'_\', taxonomyitem)\n-\n-\t\tif (not type in \'text\') and multiple:\n-\t\t\t#Must read each fasta file, looking for all lines beginning ">"\n-\t\t\twith open(file) as fastafile:\n-\t\t\t\tlineptr = 0\n-\t\t\t\tfor line in fastafile:\n-\t\t\t\t\tif line[0] == \'>\':\n-\t\t\t\t\t\tname = line[1:].split(None,1)[0].strip()[:50]\n-\t\t\t\t\t\t# Odd case where no fasta description found\n-\t\t\t\t\t\tif name == \'\': name = taxonomyitem + \'.\' + str(lineptr)\n-\t\t\t\t\t\tnames.append(name)\n-\t\t\t\t\t\tlineptr += 1\n-\t\telse:\n-\n-\t\t\tnames.append(taxonomyitem)\n-\t\t\n-\t\tptr += 1\n-\n-\tif abbreviate:\n-\t\tnames = trimCommonPrefixes(names)\n-\t\tnames = trimCommonPrefixes(names, True) # reverse = Suffixes.\n-\n-\treturn names\n-\t\n-def trimCommonPrefixes(names, reverse=False):\n-\t"""\n-\tExamines sorted array of names. Trims off prefix of each subsequent pair.\n-\t\n-\t@param names array of textual labels (file names or fasta taxonomy ids)\n-\t@param reverse boolean whether to reverse array strings before doing prefix trimming.\n-\t"""\n-\twordybits = \'|.0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'\n-\n-\tif reverse:\n-\t\tnames = map(lambda name: name[::-1], names) #reverses characters in names\n-\t\n-\tsortednames = sorted(names)\n-\tptr = 0\n-\tsortedlen = len(sortednames)\n-\toldprefixlen=0\n-\tprefixlen=0\n-\tfor name in sortednames:\n-\t\tptr += 1\n-\n-\t\t#If we\'re not at the very last item, reevaluate prefixlen\n-\t\tif ptr < sortedlen:\n-\n-\t\t\t# Skip first item in an any duplicate pair. Leave duplicate name in full.\n-\t\t\tif name == sortednames[ptr]:\n-\t\t\t\tif reverse:\n-\t\t\t\t\tcontinue\n-\t\t\t\telse:\n-\t\t\t\t\tnames[names.index(name)] = \'DupLabel-\' + name\n-\t\t\t\t\tcontinue\n-\n-\t\t\t# See http://stackoverflow.com/questions/9114402/regexp-finding-longest-common-prefix-of-two-strings\n-\t\t\tprefixlen = len( name[:([x[0]==x[1] for x in zip(name, sortednames[ptr])]+[0]).index(0)] )\n-\t\t\t\t\n-\t\tif prefixlen <= oldprefixlen:\n-\t\t\tnewprefix = name[:oldprefixlen]\n-\t\telse:\n-\t\t\tnewprefix = name[:prefixlen]\n-\t\t# Expands label to include any preceeding characters that were probably part of it.\n-\t\tnewprefix = newprefix.rstrip(wordybits)\n-\t\tnewname = name[len(newprefix):]\n-\t\t# Some tree visua'..b'imidine group) before being counted. Disable this to treat individual characters as distinct.\')\n-\n-\t\tparser.add_option(\'-a\', \'--abbreviate\', dest=\'abbreviate\', default=False, action=\'store_true\', \n-\t\t\thelp=\'Shorten tree taxonomy labels as much as possible.\')\n-\t\t\n-\t\tparser.add_option(\'-s\', \'--similarity\', dest=\'similarity\', default=False, action=\'store_true\', \n-\t\t\thelp=\'Enables pearson correlation coefficient matrix and any of the binary distance measures to be turned into similarity matrixes.\')\n-\t\t\n-\t\tparser.add_option(\'-f\', \'--filter\', type=\'choice\', dest=\'filter\', default=\'none\',\n-\t\t\tchoices=[\'none\',\'count\',\'f\',\'n\',\'e\',\'freq\',\'norm\',\'evd\'],\n-\t\t\thelp=\'Choice of [f=raw frequency|n=normal|e=extreme value (Gumbel)] distribution: Features are trimmed from the data based on lower/upper cutoff points according to the given distribution.\')\n-\n-\t\tparser.add_option(\'-L\', \'--lower\', type=\'float\', dest=\'lower\', \n-\t\t\thelp=\'Filter lower bound is a 0.00 percentages\')\n-\t\t\n-\t\tparser.add_option(\'-U\', \'--upper\', type=\'float\', dest=\'upper\',\n-\t\t\thelp=\'Filter upper bound is a 0.00 percentages\')\n-\n-\t\tparser.add_option(\'-o\', \'--output\', type=\'string\', dest=\'output\', \n-\t\t\thelp=\'Path of output file to create\')\n-\n-\t\tparser.add_option(\'-T\', \'--tree\', dest=\'tree\', default=False, action=\'store_true\', help=\'Generate Phylogenetic Tree output file\')\n-\n-\t\tparser.add_option(\'-v\', \'--version\', dest=\'version\', default=False, action=\'store_true\', help=\'Version number\')\n-\n-\t\t# Could also have -D INT decimal precision included for ffprwn .\n-\t\t\t\n-\t\toptions, args = parser.parse_args()\n-\n-\t\tif options.version:\n-\t\t\tprint VERSION_NUMBER\n-\t\t\treturn\n-\t\t\n-\t\timport time\n-\t\ttime_start = time.time()\n-\n-\t\ttry:\n-\t\t\tin_files = args[:]\n-\t\t\n-\t\texcept:\n-\t\t\tstop_err("Expecting at least 1 input data file.")\n-\n-\t\t\n-\t\t#ffptxt / ffpaa / ffpry\n-\t\tif options.type in \'text\':\n-\t\t\tcommand = \'ffptxt\'\n-\t\t\t\n-\t\telse:\n-\t\t\tif options.type == \'amino\':\n-\t\t\t\tcommand = \'ffpaa\'\n-\t\t\telse:\n-\t\t\t\tcommand = \'ffpry\'\n-\t\t\t\t\n-\t\t\tif options.disable:\n-\t\t\t\tcommand += \' -d\'\n-\t\t\t\t\n-\t\t\tif options.multiple:\n-\t\t\t\tcommand += \' -m\'\n-\t\t\n-\t\tcommand += \' -l \' + str(options.length)\n-\n-\t\tif len(in_files): #Note: app isn\'t really suited to stdio\n-\t\t\tcommand += \' "\' + \'" "\'.join(in_files) + \'"\'\n-\t\t\t\t\n-\t\t#ffpcol / ffpfilt\n-\t\tif options.filter != \'none\':\t\t\n-\t\t\tcommand += \' | ffpfilt\'\n-\t\t\tif options.filter != \'count\':\n-\t\t\t\tcommand += \' -\' + options.filter\n-\t\t\tif options.lower > 0:\n-\t\t\t\tcommand += \' --lower \' + str(options.lower)\n-\t\t\tif options.upper > 0:\n-\t\t\t\tcommand += \' --upper \' + str(options.upper)\n-\t\t\t\n-\t\telse:\n-\t\t\tcommand += \' | ffpcol\'\n-\n-\t\tif options.type in \'text\':\n-\t\t\tcommand += \' -t\'\n-\t\t\t\n-\t\telse:\n-\n-\t\t\tif options.type == \'amino\':\n-\t\t\t\tcommand += \' -a\'\n-\t\t\t\n-\t\t\tif options.disable:\n-\t\t\t\tcommand += \' -d\'\n-\t\t\t\n-\t\t#if options.normalize:\n-\t\tcommand += \' | ffprwn\'\n-\n-\t\t#Now create a taxonomy label file, ensuring a name exists for each profile.\n-\t\ttaxonomyNames = getTaxonomyNames(options.type, options.multiple, options.abbreviate, in_files, options.taxonomy)\n-\t\ttaxonomyTempFile = getTaxonomyFile(taxonomyNames)\n-\t\t\n-\t\t# -p = Include phylip format \'infile\' of the taxon names to use. Very simple, just a list of fasta identifier names.\n-\t\tcommand += \' | ffpjsd -p \' + taxonomyTempFile\n-\n-\t\tif options.metric and len(options.metric) >0 :\n-\t\t\tcommand += \' --\' + options.metric\n-\t\t\tif options.similarity:\n-\t\t\t\tcommand += \' -s\'\n-\n-\t\t# Generate Newick (.nhx) formatted tree if we have at least 3 taxonomy items:\n-\t\tif options.tree:\n-\t\t\tif len(taxonomyNames) > 2:\n-\t\t\t\tcommand += \' | ffptree -q\' \n-\t\t\telse:\n-\t\t\t\tstop_err("For a phylogenetic tree display, one must have at least 3 ffp profiles.")\n-\n-\t\t#print command\n-\t\t\n-\t\tresult = check_output(command)\n-\t\twith open(options.output,\'w\') as fw:\n-\t\t\tfw.writelines(result)\n-\t\tos.remove(taxonomyTempFile)\n-\n-if __name__ == \'__main__\':\n-\n-\ttime_start = time.time()\n-\n-\treportEngine = ReportEngine()\n-\treportEngine.__main__()\n-\t\n-\tprint(\'Execution time (seconds): \' + str(int(time.time()-time_start)))\n-\t\n'

diff -r d31a1bd74e63 -r 5c5027485f7d ffp_phylogeny.xml
--- a/ffp_phylogeny.xml Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

[

b'@@ -1,293 +0,0 @@\n-<tool id="ffp_phylogeny" name="Feature Frequency Profile Phylogeny" version="0.1.04">\n-\t<description>An alignment free comparison tool for phylogenetic analysis and text comparison</description>\n-\t<requirements>\n-\t\t<requirement type="package" version="0.3.19_d4382db015acec0e5cc43d6c1ac80ae12cb7e6b3">ffp-phylogeny</requirement>\n-\t</requirements>\n-\t\n-\t<macros>\n-\t\t<token name="@BINARY@">./ffp_phylogeny.py</token>\n-\t\t<import>ffp_macros.xml</import>\n-\t</macros>\n-\t<expand macro="requirements" />\n- <command interpreter="python"><![CDATA[\n-\t\tffp_phylogeny.py\n-\t\t#for $i in $sequence.filesin\n-\t\t\t"$i" ## full file paths\n-\t\t#end for\n-\t\t-x "\n-\t\t#for $i in $sequence.filesin\n-\t\t\t$i.name, ## original file names\n-\t\t#end for\n-\t\t"\n-\t\t-t "$(sequence.file_type.split(\'-\')[0])"\n-\t\t-l "$length"\n-\t\t-o "$info"\n-\t\t##if $normalize\n-\t\t##\t-n\n-\t\t##end if\n-\t\t#if $sequence.file_type != \'text\'\n-\t\t\t#if $sequence.file_type == \'amino-multi\' or $sequence.file_type == \'nucleotide-multi\'\n-\t\t\t\t-m\n-\t\t\t#end if\n-\t\t\t#if $sequence.groupings\n-\t\t\t\t#pass\n-\t\t\t#else\n-\t\t\t\t-d\n-\t\t\t#end if\n-\t\t\t#if $metric\n-\t\t\t\t-M "$metric"\n-\t\t\t#end if\n-\t\t\t#if $similarity\n-\t\t\t\t-s\n-\t\t\t#end if\n-\t\t\t#if $abbreviate\n-\t\t\t\t-a\n-\t\t\t#end if\t\t\t\n-\t\t#end if\n-\t\t#if $phylogeny.phylo_type == \'filt\'\n-\t\t\t-f "$phylogeny.filt.filter_type"\n-\t\t\t-L "$phylogeny.filt.lower"\n-\t\t\t-U "$phylogeny.filt.upper" \n-\t\t#end if\n-\t\t#if $tree\n-\t\t\t-T\n-\t\t#end if\t\t\n-\t\t##ffpjsd -n FLOAT , --normval=FLOAT\n-\t\t## For option -e, --euclid, change the n-norm distance (Default is n=2) to any other value where n > 1\n-\n- ]]></command>\n- <expand macro="stdio" />\n- <inputs>\n-\n-\t\t\n-\t\t\n-\n-\t\t<param name="length" type="integer" min="1" max="25" label="l-mer length" value="6" help="String of valid characters of this length will be counted. Synonyms: feature, k-mer, n-gram, k-tuple" size="2"/>\n-\t\t\n-\t\t<conditional name="sequence">\n-\t\t\t<param type="select" name="file_type" label="File type" help="Note: For phylogeny display, at least three profiles are required.">\n-\t\t\t\t<option value="amino">Amino Acids, one profile per file</option>\n-\t\t\t\t<option value="amino-multi">Amino Acids, one profile per fasta sequence in file</option>\n-\t\t\t\t<option value="nucleotide">Nucleic acids, one profile per file</option>\n-\t\t\t\t<option value="nucleotide-multi">Nucleic acids, one profile per fasta sequence in file</option>\n-\t\t\t\t<option value="text">Text, single file</option>\n-\t\t\t</param>\n-\n-\t\t\t<when value="amino">\n-\t\t\t\t<param name="filesin" type="data" label="Select input file(s)" format="fasta" multiple="true" />\n-\t\t\t\t<param name="groupings" label="Enable amino acid grouping" type="boolean" checked="true" help="Counts amino acids in groups rather than individually (usually advantageous, see below)." />\n-\t\t\t</when>\n-\n-\t\t\t<when value="amino-multi">\n-\t\t\t\t<param name="filesin" type="data" label="Select input file(s)" format="fasta" multiple="true" />\n-\t\t\t\t<param name="groupings" label="Enable amino acid grouping" type="boolean" checked="true" help="Counts amino acids in groups rather than individually (usually advantageous, see below)." />\n-\t\t\t</when>\n-\t\t\t\t\t\t\n-\t\t\t<when value="nucleotide">\n-\t\t\t\t<param name="filesin" type="data" label="Select input file(s)" format="fasta" multiple="true" />\t\n-\t\t\t\t<param name="groupings" label="Enable purine / pyrimidine grouping" type="boolean" checked="true" help="Counts each nucleotide as a purine(R) or pyrimidine(Y) rather than individually (usually advantageous)." />\n-\t\t\t</when>\n-\t\t\t\n-\t\t\t<when value="nucleotide-multi">\n-\t\t\t\t<param name="filesin" type="data" label="Select input file(s)" format="fasta" multiple="true" />\n-\t\t\t\t<param name="groupings" label="Enable purine / pyrimidine grouping" type="boolean" checked="true" help="Counts each nucle'..b'<param name="length" value="2"/>\n-\t\t\t<param name="tree" value="0"/>\n-\t\t\t<param name="groupings" value="true"/>\n-\t\t\t<param name="file_type" value="nucleotide-multi"/>\n-\t\t\t<param name="filesin" value="genome1,genome2"/>\n-\t\t\t<output name="info" file="test_length_2b_output.tabular"/>\n-\t\t</test>\t\t\n-\t</tests>\n-\n- <help><![CDATA[\n- \n-.. class:: infomark\n-\n-\n-**What it does**\n-\n-FFP (Feature frequency profile) is an alignment free comparison tool for phylogenetic analysis and text comparison. It can be applied to nucleotide sequences, complete genomes, proteomes and even used for text comparison.\n-\n-This galaxy tool prepares a mini-pipeline consisting of **[ffpry | ffpaa | ffptxt] > [ ffpfilt | ffpcol > ffprwn] > ffpjsd > ffptree** . The last step is optional - by deselecting the "Generate Tree Phylogeny" checkbox, the tool will output only the precursor distance matrix file rather than a Newick (.nhx) formatted tree file.\n-\n-Each sequence or text file has a profile containing tallies of each feature found. A feature is a string of valid characters of given length. \n-\n-For nucleotide data, by default each character (ATGC) is grouped as either purine(R) or pyrmidine(Y) before being counted.\n-\n-For amino acid data, by default each character is grouped into one of the following:\n-(ST),(DE),(KQR),(IVLM),(FWY),C,G,A,N,H,P. Each group is represented by the first character in its series.\n-\n-One other key concept is that a given feature, e.g. "TAA" is counted in forward \n-AND reverse directions, mirroring the idea that a feature's orientation is not\n-so important to distinguish when it comes to alignment-free comparison. \n-The counts for "TAA" and "AAT" are merged.\n- \n-The labeling of the resulting counted feature items is perhaps the trickiest\n-concept to master. Due to computational efficiency measures taken by the \n-developers, a feature that we see on paper as "TAC" may be stored and labeled \n-internally as "GTA", its reverse compliment. One must look for the alternative\n-if one does not find the original. \n-\n-Also note that in amino acid sequences the stop codon "*" (or any other character \n-that is not in the Amino acid alphabet) causes that character frame not to be\n-counted. Also, character frames never span across fasta entries.\n-\n-A few tutorials:\n- * http://sourceforge.net/projects/ffp-phylogeny/files/Documentation/tutorial.pdf\n- * https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny\n-\n--------\n-\n-.. class:: warningmark\n-\n-**Note**\n-\n-Taxonomy label details: If each file contains one profile, the file\'s name is used to label the profile.\n-If each file contains fasta sequences to profile individually, their fasta identifiers will be used to label them.\n-The "short labels" option will find the shortest label that uniquely identifies each profile.\n-Either way, there are some quirks: ffpjsd clips labels to 10 characters if they are greater than 50 characters, so all labels are trimmed to 50 characters first.\n-Also "id" is prefixed to any numeric label since some tree visualizers won\'t show purely numeric labels.\n-In the accidental case where a Fasta sequence label is a duplicate of a previous one it will be prefixed by "DupLabel-".\n-\n-The command line ffpjsd can hang if one provides an l-mer length greater than the length of file content.\n-One must identify its process id (">ps aux | grep ffpjsd") and kill it (">kill [process id]").\n--------\n-\n-**References**\n-\n-The original ffp-phylogeny code is at http://ffp-phylogeny.sourceforge.net/ .\n-This tool uses Aaron Petkau\'s modified version: https://github.com/apetkau/ffp-3.19-custom .\n- \n-The development of the ff-phylogeny should be attributed to:\n-\n-Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 2009;106(8):2677-2682. doi:10.1073/pnas.0813249106.\n-\n- ]]></help>\n-</tool>\n-\n-\n'

diff -r d31a1bd74e63 -r 5c5027485f7d tarballit.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tarballit.sh Sun Aug 09 16:07:50 2015 -0400

@@ -0,0 +1,2 @@
+#!/bin/bash
+ tar -zcvf versioned_data.tar.gz * --exclude "*~" --exclude "*.pyc" --exclude "tool_test_output*" --exclude "*.gz"

diff -r d31a1bd74e63 -r 5c5027485f7d test-data/genome1
--- a/test-data/genome1 Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,2 +0,0 @@
->genome1
-AATT

diff -r d31a1bd74e63 -r 5c5027485f7d test-data/genome2
--- a/test-data/genome2 Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,2 +0,0 @@
->genome2
-AAGG

diff -r d31a1bd74e63 -r 5c5027485f7d test-data/test_length_1_output.tabular
--- a/test-data/test_length_1_output.tabular Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,3 +0,0 @@
-2
-genome1 0.00e+00 1.89e-01
-genome2 1.89e-01 0.00e+00

diff -r d31a1bd74e63 -r 5c5027485f7d test-data/test_length_2_output.tabular
--- a/test-data/test_length_2_output.tabular Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,3 +0,0 @@
-2
-genome1 0.00e+00 4.58e-01
-genome2 4.58e-01 0.00e+00

diff -r d31a1bd74e63 -r 5c5027485f7d test-data/test_length_2b_output.tabular
--- a/test-data/test_length_2b_output.tabular Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,3 +0,0 @@
-2
-genome1 0.00e+00 1.42e-01
-genome2 1.42e-01 0.00e+00

diff -r d31a1bd74e63 -r 5c5027485f7d tool_dependencies.xml
--- a/tool_dependencies.xml Sun Aug 09 16:05:40 2015 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000

@@ -1,24 +0,0 @@
-<?xml version="1.0"?>
-<tool_dependency>
- <package name="ffp-phylogeny" version="0.3.19_d4382db015acec0e5cc43d6c1ac80ae12cb7e6b3">
- <install version="1.0">
- <actions>
- <action type="shell_command">git clone https://github.com/apetkau/ffp-3.19-custom.git ffp-phylogeny</action>
- <action type="shell_command">git reset --hard d4382db015acec0e5cc43d6c1ac80ae12cb7e6b3</action>
- <action type="shell_command">./configure --disable-gui --prefix=$INSTALL_DIR</action>
- <action type="make_install"></action>
- 
- <action type="set_environment">
- <environment_variable name="PATH" action="prepend_to">$INSTALL_DIR/bin</environment_variable>
- </action>
- </actions>
- </install>
- <readme>
- apetkau/ffp-3.19-custom is a customized version of http://sourceforge.net/projects/ffp-phylogeny/
- </readme>
- </package>
-</tool_dependency>
-

diff -r d31a1bd74e63 -r 5c5027485f7d vdb_common.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/vdb_common.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,88 @@
+#!/usr/bin/python
+import re
+import os
+import time
+import dateutil
+import dateutil.parser as parser2
+import datetime
+import calendar
+
+def parse_date(adate):
+ """
+ Convert human-entered time into linux integer timestamp
+
+ @param adate string Human entered date to parse into linux time
+
+ @return integer Linux time equivalent or 0 if no date supplied
+ """
+ adate = adate.strip()
+ if adate > '':
+     adateP = parser2.parse(adate, fuzzy=True)
+     #dateP2 = time.mktime(adateP.timetuple())
+     # This handles UTC & daylight savings exactly
+     return calendar.timegm(adateP.timetuple())
+ return 0
+
+
+def get_unix_time(vtime, voffset=0):
+ return float(vtime) - int(voffset)/100*60*60
+
+
+def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
+ return [int(text) if text.isdigit() else text.lower()
+ for text in re.split(_nsre, s)]
+
+
+class date_matcher(object):
+ """
+ Enables cycling through a list of versions and picking the one that matches
+ or immediately preceeds a given date.  As soon as an item is found, subsequent
+ calls to date_matcher return false (because of the self.found flag)
+ """
+ def __init__(self, unix_time):
+ """
+ @param adate linux date/time
+ """
+ self.found = False
+ self.unix_time = unix_time
+
+ def __iter__(self):
+ return self
+
+ def next(self, unix_datetime):
+ select = False
+ if (self.found == False) and (self.unix_time > 0) and (unix_datetime <= self.unix_time):
+ self.found = True
+ select = True
+ return select
+
+
+def dateISOFormat(atimestamp):
+ return datetime.datetime.isoformat(datetime.datetime.fromtimestamp(atimestamp))
+
+def lightDate(unixtime):
+ return datetime.datetime.utcfromtimestamp(float(unixtime)).strftime('%Y-%m-%d %H:%M')
+
+def move_files(source_path, destination_path, file_paths):
+ """
+ MOVE FILES TO CACHE FOLDER (RATHER THAN COPYING THEM) FOR GREATER EFFICIENCY.
+ Since a number of data source systems have hidden / temporary files in their
+ data folder structure, a list of file_paths is required to select only that
+ content that should be copied over.  Note: this will leave skeleton of folders; only files are moved.
+
+ Note: Tried using os.renames() but it errors when attempting to remove folders
+ from git archive that aren't empty due to files that are not to be copied.
+
+
+ @param source_path string Absolute folder path to move data files from
+ @param destination_path string Absolute folder path to move data files to
+ @param file_paths string List of files and their relative paths from source_path root
+ """
+ for file_name in file_paths:
+ if len(file_name):
+ print "(" + file_name + ")"
+ v_path = os.path.dirname(os.path.join(destination_path, file_name))
+ if not os.path.isdir(v_path):
+ os.makedirs(v_path)
+ os.rename(os.path.join(source_path, file_name), os.path.join(destination_path, file_name) )
+

diff -r d31a1bd74e63 -r 5c5027485f7d vdb_retrieval.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/vdb_retrieval.py Sun Aug 09 16:07:50 2015 -0400

[

b'@@ -0,0 +1,853 @@\n+#!/usr/bin/python\n+\t\n+"""\n+****************************** vdb_retrieval.py ******************************\n+ VDBRetrieval() instance called in two stages:\n+ 1) by tool\'s versioned_data.xml form (dynamic_option field) \n+ 2) by its executable versioned_data.py script.\n+\n+"""\n+\n+import os, sys, glob, time\n+import string\n+from random import choice\n+\n+from bioblend.galaxy import GalaxyInstance\n+from requests.exceptions import ChunkedEncodingError\n+from requests.exceptions import ConnectionError\n+\n+import urllib2\n+import json\n+import vdb_common\n+\n+# Store these values in python/galaxy environment variables?\n+VDB_DATA_LIBRARY = \'Versioned Data\'\n+VDB_WORKFLOW_CACHE_FOLDER_NAME = \'Workflow cache\'\n+VDB_CACHED_DATA_LABEL = \'Cached data\'\n+\n+# Don\'t forget to add "versionedata@localhost.com" to galaxy config admin_users list.\n+\n+VDB_ADMIN_API_USER = \'versioneddata\'\n+VDB_ADMIN_API_EMAIL = \'versioneddata@localhost.com\'\n+VDB_ADMIN_API_KEY_PATH = os.path.join(os.path.dirname(sys._getframe().f_code.co_filename), \'versioneddata_api_key.txt\')\n+\t\t\t\n+#kipper, git, folder and other registered handlers\n+VDB_STORAGE_OPTIONS = \'kipper git folder biomaj\'\n+\n+# Used in versioned_data_form.py\n+VDB_DATASET_NOT_AVAILABLE = \'This database is not currently available (no items).\'\n+VDB_DATA_LIBRARY_FOLDER_ERROR = \'Error: this data library folder is not configured correctly.\'\n+VDB_DATA_LIBRARY_CONFIG_ERROR = \'Error: Check folder config file: \'\n+\t\t\n+\n+class VDBRetrieval(object):\n+\n+\tdef __init__(self, api_key=None, api_url=None):\n+\t\t"""\n+\t\tThis gets either trans.x.y from <code file="..."> call in versioned_data.xml,\n+\t\tor it gets a call with api_key and api_url from versioned_data.py\n+\n+\t\t@param api_key_path string File path to temporary file containing user\'s galaxy api_key\n+\t\t@param api_url string contains http://[ip]:[port] for handling galaxy api calls. \n+\t\t\n+\t\t"""\n+\t\t# Initialized constants during the life of a request:\n+\t\tself.global_retrieval_date = None\n+\t\tself.library_id = None\n+\t\tself.history_id = None\n+\t\tself.data_stores = []\n+\t\t\n+\t\t# Entire json library structure. item.url, type=file|folder, id, name (library path) \n+\t\t# Note: changes to library during a request aren\'t reflected here.\n+\t\tself.library = None \t\t\n+\n+\t\tself.user_api_key = None\t\n+\t\tself.user_api = None\n+\t\tself.admin_api_key = None\n+\t\tself.admin_api = None\n+\t\tself.api_url = None\n+\t\t\t\n+\n+\tdef set_trans(self, api_url, history_id, user_api_key=None): #master_api_key=None, \n+\t\t"""\n+\t\tUsed only on initial presentation of versioned_data.xml form. Doesn\'t need admin_api\n+\t\t"""\n+\t\tself.history_id = history_id\n+\t\tself.api_url = api_url\n+\t\tself.user_api_key = user_api_key\n+\t\t#self.master_api_key = master_api_key\n+\t\t\n+\t\tself.set_user_api()\n+\t\tself.set_admin_api()\t\t\n+\t\tself.set_datastores()\n+\t\t\n+\t\n+\tdef set_api(self, api_info_path):\n+\t\t"""\n+\t\t"api_info_path" is provided only when user submits tool via versioned_data.py call. \n+\t\tIt encodes both the api_url and the history_id of current session\n+\t\tOnly at this point will we need the admin_api, so it is looked up below.\n+\n+\t\t"""\n+\n+\t\twith open(api_info_path, \'r\') as access:\n+\t\t\t\n+\t\t\tself.user_api_key = access.readline().strip()\n+\t\t\t#self.master_api_key = access.readline().strip()\n+\t\t\tapi_info = access.readline().strip() #[api_url]-[history_id]\n+\t\t\tself.api_url, self.history_id = api_info.split(\'-\')\n+\t\t\n+\t\tself.set_user_api()\n+\t\tself.set_admin_api()\n+\t\tself.set_datastores()\n+\n+\n+\tdef set_user_api(self):\n+\t\t"""\n+\t\tNote: error message tacked on to self.data_stores for display back to user.\n+\t\t"""\n+\t\tself.user_api = GalaxyInstance(url=self.api_url, key=self.user_api_key)\n+\n+\t\tif not self.user_api:\n+\t\t\tself.data_stores.append({\'name\':\'Error: user Galaxy API connection was not set up correctly. Try getting another user API key.\', \'id\':\'none\'})\n+\t\t\treturn\n+\t\n+\t\n+\tdef set_datastores(self):\n+\t\t"""\n+\t\tProvides the list of data stores that users can select versions from.\n+\t\tNote: error message tacked on to self.data_stores for display back to'..b'ry they actually only deletes their history link to that dataset. True of api admin user?)\n+\t\t\n+\t\tFUTURE: have the galaxy-created data shared from a server location?\n+\t\t"""\n+\n+\t\t(workflow_input_key, workflow_input_label, annotation_key, dataset_map) = codings\n+\n+\t\t# This will create folder if it doesn\'t exist:\n+\t\t_library_cache_labels = os.path.join("/", VDB_WORKFLOW_CACHE_FOLDER_NAME, workflow_summary[\'name\'], \'On \' + workflow_input_label)\n+\t\tfolder_id = self.get_library_folder("/", library_cache_path, _library_cache_labels)\n+\t\tif not folder_id: # Case should never happen\n+\t\t\tprint \'Error: unable to determine library folder to place cache in:\' + library_cache_path\n+\t\t\tsys.exit(1)\n+\n+\t\t\n+\t\tfor dataset_id in work_result[\'outputs\']:\n+\t\t\t# We have to mark each dataset entry with the Workflow ID and input datasets it was generated by. \n+\t\t\t# No other way to know they are associated. ADD VERSION ID TO END OF workflowinput_label?\n+\t\t\tlabel = workflow_summary[\'name\'] +\' on \' + workflow_input_label \n+\t\t\t\n+\t\t\t# THIS WILL BE IN ADMIN API HISTORY\n+\t\t\tself.admin_api.histories.update_dataset(history_id, dataset_id, annotation = annotation_key, name=label) \n+\n+\t\t\t# Upload dataset_id and give it description \'cached data\'\n+\t\t\tif \'copy_from_dataset\' in dir(self.admin_api.libraries):\n+\t\t\t\t# IN BIOBLEND LATEST:\n+\t\t\t\tself.admin_api.libraries.copy_from_dataset(self.library_id, dataset_id, folder_id, VDB_CACHED_DATA_LABEL + ": version " + version_id)\n+\t\t\telse:\n+\t\t\t\tself.library_cache_setup_privileged(folder_id, dataset_id, VDB_CACHED_DATA_LABEL + ": version " + version_id)\n+\n+\n+\n+\tdef library_cache_setup_privileged(self, folder_id, dataset_id, message):\n+\t\t"""\n+\t\tCopy a history HDA into a library LDDA (that the current admin api user has add permissions on)\n+\t\tin the given library and library folder. Requires that dataset_id has been created by admin_api_key user.\t Nicola Soranzo [nicola.soranzo@gmail.com will be adding to BIOBLEND eventually.\n+\t\n+\t\tWe tried linking a Versioned Data library Workflow Cache folder to the dataset(s) a non-admin api user has just generated. It turns out API user that connects the two must be both a Library admin AND the owner of the history dataset being uploaded, or an error occurs. So system can\'t do action on behalf of non-library-privileged user. Second complication with that approach is that there is no Bioblend API call - one must do this directly in galaxy API via direct URL fetc.\n+\n+\t\tNOTE: This will raise "HTTPError(req.get_full_url(), code, msg, hdrs, fp)" if given empty folder_id for example\n+\t\n+\t\t@see def copy_hda_to_ldda( library_id, library_folder_id, hda_id, message=\'\' ):\n+\t\t@see https://wiki.galaxyproject.org/Events/GCC2013/TrainingDay/API?action=AttachFile&do=view&target=lddas_1.py\n+\n+\t\t@uses library_id: the id of the library which we want to query.\n+\t\n+\t\t@param dataset_id: the id of the user\'s history dataset we want to copy into the library folder.\n+\t\t@param folder_id: the id of the library folder to copy into.\n+\t\t@param message: an optional message to add to the new LDDA.\n+\t\t"""\n+\n+\n+\n+\t\tfull_url = self.api_url + \'/libraries\' + \'/\' + self.library_id + \'/contents\'\n+\t\turl = self.make_url( self.admin_api_key, full_url )\n+\t\t\t\n+\t\tpost_data = {\n+\t\t\'folder_id\' : folder_id,\n+\t\t\'create_type\' : \'file\',\n+\t\t\'from_hda_id\' : dataset_id,\n+\t\t\'ldda_message\' : message\n+\t\t}\n+\n+\t\treq = urllib2.Request( url, headers = { \'Content-Type\': \'application/json\' }, data = json.dumps( post_data ) )\n+\t\t#try:\n+\n+\t\tresults = json.loads( urllib2.urlopen( req ).read() ) \n+\t\treturn\n+\n+\n+\t#Expecting to phase this out with bioblend api call for library_cache_setup()\n+\tdef make_url(self, api_key, url, args=None ):\n+\t\t# Adds the API Key to the URL if it\'s not already there.\n+\t\tif args is None:\n+\t\t\targs = []\n+\t\targsep = \'&\'\n+\t\tif \'?\' not in url:\n+\t\t\targsep = \'?\'\n+\t\tif \'?key=\' not in url and \'&key=\' not in url:\n+\t\t\targs.insert( 0, ( \'key\', api_key ) )\n+\t\treturn url + argsep + \'&\'.join( [ \'=\'.join( t ) for t in args ] )\n+\n+\n+\n'

diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/versioned_data.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,149 @@
+#!/usr/bin/python
+import os
+import optparse
+import sys
+import time
+import re
+
+import vdb_common
+import vdb_retrieval
+
+class MyParser(optparse.OptionParser):
+ """
+ From http://stackoverflow.com/questions/1857346/python-optparse-how-to-include-additional-info-in-usage-output
+ Provides a better display of formatted help info in epilog() portion of optParse.
+ """
+ def format_epilog(self, formatter):
+ return self.epilog
+
+
+def stop_err( msg ):
+    sys.stderr.write("%s\n" % msg)
+    sys.exit(1)
+
+
+class ReportEngine(object):
+
+ def __init__(self): pass
+
+ def __main__(self):
+
+ options, args = self.get_command_line()
+ retrieval_obj = vdb_retrieval.VDBRetrieval()
+ retrieval_obj.set_api(options.api_info_path)
+
+ retrievals=[]
+
+ for retrieval in options.retrievals.strip().strip('|').split('|'):
+ # Normally xml form supplies "spec_file_id, [version list], [workflow_list]"
+ params = retrieval.strip().split(',')
+
+ spec_file_id = params[0]
+
+ if spec_file_id == 'none':
+ print 'Error: Form was selected without requesting a data store to retrieve!'
+ sys.exit( 1 )
+
+ # STEP 1:  Determine data store type and location
+ data_store_spec = retrieval_obj.user_api.libraries.show_folder(retrieval_obj.library_id, spec_file_id)
+ data_store_type = retrieval_obj.test_data_store_type(data_store_spec['name'])
+ base_folder_id = data_store_spec['folder_id']
+
+ if not data_store_type:
+ print 'Error: unrecognized data store type [' + data_store_type + ']'
+ sys.exit( 1 )
+
+ ds_obj = retrieval_obj.get_data_store_gateway(data_store_type, spec_file_id)
+
+ if len(params) > 1 and len(params[1].strip()) > 0:
+ _versionList = params[1].strip()
+ version_id = _versionList.split()[0] # VersionList SHOULD just have 1 id
+ else:
+ # User didn't select version_id via "Add new retrieval"
+ if options.globalRetrievalDate:
+ _retrieval_date = vdb_common.parse_date(options.globalRetrievalDate)
+ version_id = ds_obj.get_version_options(global_retrieval_date=_retrieval_date, selection=True)
+
+ else:
+ version_id = ''
+
+ # Reestablishes file(s) if they don't exist on disk. Do data library links to it as well.
+ ds_obj.get_version(version_id)
+ if ds_obj.version_path == None:
+
+ print "Error: unable to retrieve version [%s] from %s archive [%s].  Archive doesn't contain this version id?" % (version_id, data_store_type, ds_obj.library_version_path)
+ sys.exit( 1 )
+
+ # Version data file(s) are sitting in [ds_obj.version_path] ready for retrieval.
+ library_dataset_ids = retrieval_obj.get_library_version_datasets(ds_obj.library_version_path, base_folder_id, ds_obj.version_label, ds_obj.version_path)
+
+ # The only thing that doesn't have cache lookup is "folder" data that isn't linked in.
+ # In that case try lookup directly.
+ if len(library_dataset_ids) == 0 and data_store_type == 'folder':
+ library_version_datasets = retrieval_obj.get_library_folder_datasets(ds_obj.library_version_path)
+ library_dataset_ids = [item['id'] for item in library_version_datasets]
+
+ if len(library_dataset_ids) == 0:
+
+ print 'Error: unable to retrieve version [%s] from %s archive [%s] ' % (version_id, data_store_type, ds_obj.library_version_path)
+ sys.exit( 1 )
+
+ # At this point we have references to the galaxy ids of the requested versioned dataset, after regeneration
+ versioned_datasets = retrieval_obj.update_history(library_dataset_ids, ds_obj.library_version_path, version_id)
+
+ if len(params) > 2:
+
+ workflow_list = params[2].strip()
+
+ if len(workflow_list) > 0:
+ # We have workflow run via admin_api and admin_api history.
+ retrieval_obj.get_workflow_data(workflow_list, versioned_datasets, version_id)
+
+
+ result=retrievals
+
+ # Output file needs to exist.  Otherwise Galaxy doesn't generate a placeholder file name for the output, and so we can't do things like check for [placeholder name]_files folder.  Add something to report on?
+ with open(options.output,'w') as fw:
+ fw.writelines(result)
+
+
+ def get_command_line(self):
+ ## *************************** Parse Command Line *****************************
+ parser = MyParser(
+ description = 'This Galaxy tool retrieves versions of prepared data sources and places them in a galaxy "Versioned Data" library',
+ usage = 'python versioned_data.py [options]',
+ epilog="""Details:
+
+ This tool retrieves links to current or past versions of fasta (or other key-value text) databases from a cache kept in the data library called "Fasta Databases". It then places them into the current history so that subsequent tools can work with that data.
+ """)
+
+ parser.add_option('-r', '--retrievals', type='string', dest='retrievals',
+ help='List of datasources and their versions and galaxy workflows to return')
+
+ parser.add_option('-o', '--output', type='string', dest='output',
+ help='Path of output log file to create')
+
+ parser.add_option('-O', '--output_id', type='string', dest='output_id',
+ help='Output identifier')
+
+ parser.add_option('-d', '--date', type='string', dest='globalRetrievalDate',
+ help='Provide date/time for data recall.  Defaults to now.')
+
+ parser.add_option('-v', '--version', dest='version', default=False, action='store_true',
+ help='Version number of this program.')
+
+ parser.add_option('-s', '--api_info_path', type='string', dest='api_info_path', help='Galaxy user api key/path.')
+
+ return parser.parse_args()
+
+
+
+if __name__ == '__main__':
+
+ time_start = time.time()
+
+ reportEngine = ReportEngine()
+ reportEngine.__main__()
+
+ print('Execution time (seconds): ' + str(int(time.time()-time_start)))
+

diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/versioned_data.xml Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,124 @@
+<tool id="versioned_data" name="Versioned data retrieval" version="0.1.03">
+ <description>Retrieve versioned sequence files and/or their blast, bowtie, etc. database indexes</description>
+ <macros>
+ <token name="@BINARY@">versioned_data.py</token>
+ <import>bccdc_macros.xml</import>
+ </macros>
+ <expand macro="requirements" />
+ <command interpreter="python">
+ #assert $__user__, Exception( 'You must be logged in to use this tool.' )
+ versioned_data.py
+ #if $globalRetrievalDate.strip() > ''
+ -d "$globalRetrievalDate"
+ #end if
+ -r
+ "
+ #for $v in $versions:
+ ${v.database},
+ #for $r in $v.retrieval:
+ ${r.retrievalId}
+ #end for
+ ,
+ #for $w in $v.workflows:
+ ${w.workflow}
+ #end for
+ |
+ #end for
+ "
+ -o "$log"
+ -O "$__app__.security.encode_id($log.id)"
+ --api_info_path "$api_info_path" ##Actually a file path to configfile that holds api key
+ </command>
+ 
+ <expand macro="stdio" />
+
+ <inputs>
+ 
+ <param name="globalRetrievalDate" type="text" label="Global retrieval date [YYYY-MM-DD]" help="The recall system will use this date to try to select the appropriate versions below.  Leave empty to select current versions." size="25" />
+
+ <param name="api_info" display="radio" type="drill_down" label="For user with Galaxy API Key" dynamic_options="vdb_init_tool_user(__trans__)" />
+
+ <repeat name="versions" title="Data Source" min="1" max="15">
+
+ <param name="database" type="select" label="Data" dynamic_options="vdb_get_databases()" multiple="false" />
+
+ <repeat name="retrieval" title="Retrieval" min="0" max="1">
+ <param name="retrievalId" label="Version date/id" type="select" dynamic_options="vdb_get_versions(database, globalRetrievalDate)"/>
+ </repeat>
+
+ <repeat name="workflows" title="Workflow" min="0" max="5" >
+ <param name="workflow" type="select" label="Name" dynamic_options="vdb_get_workflows(database)" />
+ </repeat>
+
+ </repeat>
+
+ </inputs>
+
+ <configfiles>
+ <configfile name="api_info_path">${__user__.api_keys[0].key}
+ $api_info
+ </configfile>
+ </configfiles>
+
+ <outputs>
+ <data name="log" format="txt" label="Versioned Data Retrieval" />
+    </outputs>
+
+ <code file="versioned_data_form.py" />
+
+ <tests>
+ <test>
+ <param name="db_type" value="nucl"/>
+ 
+ </test>
+ </tests>
+
+    <help>
+
+.. class:: infomark
+
+
+**What it does**
+
+This tool retrieves links to current or past versions of fasta or other types of
+data from a cache kept in the Galaxy data library called "Versioned Data".  It then places
+them into one's current history so that subsequent tools can work with that data.
+
+For example, after using this tool to select a version of the NCBI nt database, a blast search can be carried out on it by selecting "BLAST database from your history" from the "Subject database/sequences" field of the NCBI BLAST+ search tool.
+
+You can select one or more files or databases by version date or id. This list
+is supplied from the Shared Data > Data Libraries > Versioned Data folder that has
+been set up by an administrator.
+
+The Workflows section allows you to select one or more pre-defined workflows
+to execute on the versioned data.  The results are placed in your history for use
+by other tools or workflows.
+
+A caching system exists to cache the versioned data or workflow data that the tool generates.
+If you request versioned data or derivative data that isn't cached, it may take time to regenerate.
+
+The top-level "Global retrieval date [YYYY-MM-DD]" field that the form starts with will be applied to
+all selected databases.  This can be overriden by a retrieval date or version that
+you supply for a particular database.  Leave it and any "Retrievals" inputs empty if you just need the latest version of selected databases.
+
+-------
+
+.. class:: warningmark
+
+**Note**
+
+Again, some past database versions can take time to regenerate if there is no cached version available, for example NCBI nt is a 50+ gigabyte file that needs to be read through to get a fasta version, and a makeblastdb workflow on top of that can take hours on the first call.  Access to cached versions is immediate.
+
+Setup of versioned data sources and workflow options can only be done by a Galaxy administrator.
+
+-------
+
+**References**
+
+If you use this Galaxy tool in work leading to a scientific publication please
+cite the following paper:
+
+*Reference coming soon...*
+
+    </help>
+</tool>

diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data_cache_clear.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/versioned_data_cache_clear.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,130 @@
+#!/usr/bin/python
+
+"""
+****************************** versioned_data_cache_clear.py ******************************
+ Call this script directly to clear out all but the latest galaxy Versioned Data data library
+ and server data store cached folder versions.
+
+ SUGGEST RUNNING THIS UNDER GALAXY OR LESS PRIVILEGED USER, BUT the versioneddata_api_key file does need to be readable by the user.
+
+"""
+import vdb_retrieval
+import vdb_common
+import glob
+import os
+
+# Note that globals from vdb_retrieval can be referenced by prefixing with vdb_retrieval.XYZ
+# Note that this script uses the admin_api established in vdb_retrieval.py
+
+retrieval_obj = vdb_retrieval.VDBRetrieval()
+retrieval_obj.set_admin_api()
+retrieval_obj.user_api = retrieval_obj.admin_api
+retrieval_obj.set_datastores()
+
+workflow_keepers = [] #stack of Versioned Data library dataset_ids that if found in a workflow data input folder key name, can be saved; otherwise remove folder.
+library_folder_deletes = []
+library_dataset_deletes = []
+
+# Cycle through datastores, listing subfolders under each, sorted.
+# Permanently delete all but latest subfolder.
+for data_store in retrieval_obj.data_stores:
+ spec_file_id = data_store['id']
+ # STEP 1:  Determine data store type and location
+ data_store_spec = retrieval_obj.admin_api.libraries.show_folder(retrieval_obj.library_id, spec_file_id)
+ data_store_type = retrieval_obj.test_data_store_type(data_store_spec['name'])
+
+ if not data_store_type in 'folder biomaj': # Folders are static - they don't do caching.
+
+ base_folder_id = data_store_spec['folder_id']
+ ds_obj = retrieval_obj.get_data_store_gateway(data_store_type, spec_file_id)
+
+ print
+
+ #Cycle through library tree; have to look at the whole thing since there's no /[string]/* wildcard search:
+ folders = retrieval_obj.get_library_folders(ds_obj.library_label_path)
+ for ptr, folder in enumerate(folders):
+
+ # Ignore folder that represents data store itself:
+ if ptr == 0:
+ print 'Data Store ::' + folder['name']
+
+ # Keep most recent cache item
+ elif ptr == len(folders)-1:
+ print 'Cached Version ::' + folder['name']
+ workflow_keepers.extend(folder['files'])
+
+ # Drop version caches that are further in the past:
+ else:
+ print 'Clearing version cache:' + folder['name']
+ library_folder_deletes.extend(folder['id'])
+ library_dataset_deletes.extend(folder['files'])
+
+
+ # Now auto-clean versioned/ folders too?
+ print "Server loc: " + ds_obj.data_store_path
+
+ items = os.listdir(ds_obj.data_store_path)
+ items = sorted(items, key=lambda el: vdb_common.natural_sort_key(el), reverse=True)
+ count = 0
+ for name in items:
+
+ # If it is a directory and it isn't the master or symlinked "current" one:
+ # Add ability to skip sym-linked folders too?
+ version_folder=os.path.join(ds_obj.data_store_path, name)
+ if not name == 'master' \
+ and os.path.isdir(version_folder) \
+ and not os.path.islink(version_folder):
+
+ count += 1
+ if count == 1:
+ print "Keeping cache:" + name
+ else:
+ print "Dropping cache:" + name
+ for root2, dirs2, files2 in os.walk(version_folder):
+ for version_file in files2:
+ full_path = os.path.join(root2, version_file)
+ print "Removing " + full_path
+ os.remove(full_path)
+ #Not expecting any subfolders here.
+
+ os.rmdir(version_folder)
+
+
+# Permanently delete specific data library datasets:
+for item in library_dataset_deletes:
+ retrieval_obj.admin_api.libraries.delete_library_dataset(retrieval_obj.library_id, item['id'], purged=True)
+
+
+# Newer Bioblend API method for deleting galaxy library folders.
+# OLD Galaxy way possible: http DELETE request to {{url}}/api/folders/{{encoded_folder_id}}?key={{key}}
+if 'folders' in dir(retrieval_obj.admin_api):
+ for folder in library_folder_deletes:
+ retrieval_obj.admin_api.folders.delete(folder['id'])
+
+
+print workflow_keepers
+
+workflow_cache_folders = retrieval_obj.get_library_folders('/'+ vdb_retrieval.VDB_WORKFLOW_CACHE_FOLDER_NAME+'/')
+
+for folder in workflow_cache_folders:
+ dataset_ids = folder['name'].split('_') #input dataset ids separated by underscore
+ count = 0
+ for id in dataset_ids:
+ if id in workflow_keepers:
+ count += 1
+
+ # If every input dataset in workflow cache exists in library cache, then keep it.
+ if count == len(dataset_ids):
+ continue
+
+ # We have one or more cached datasets to drop.
+ print "Dropping workflow cache: " + folder['name']
+ for id in [item['id'] for item in folder['files']]:
+ print id
+ retrieval_obj.admin_api.libraries.delete_library_dataset(retrieval_obj.library_id, id, purged=True)
+
+ # NOW DELETE WORKFLOW FOLDER.
+ if 'folders' in dir(retrieval_obj.admin_api):
+ retrieval_obj.admin_api.folders.delete(folder['id'])
+
+

diff -r d31a1bd74e63 -r 5c5027485f7d versioned_data_form.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/versioned_data_form.py Sun Aug 09 16:07:50 2015 -0400

[

@@ -0,0 +1,149 @@
+#!/usr/bin/python
+#
+import os
+import sys
+
+# Extra step enables this script to locate vdb_common and vdb_retrieval
+# From http://code.activestate.com/recipes/66062-determining-current-function-name/
+_self_dir = os.path.dirname(sys._getframe().f_code.co_filename)
+sys.path.append(_self_dir)
+
+import vdb_common
+import vdb_retrieval
+
+retrieval_obj = None # Global used here to manage user's currently selected retrieval info.
+
+
+def vdb_init_tool_user(trans):
+ """
+ Retrieves a user's api key if they have one, otherwise lets them know they need one
+ This function is automatically called from versioned_data.xml form on presentation to user
+ Note that this is how self.api_url gets back into form, for passage back to 2nd call via versioned_data.py
+ self.api_key is passed via secure <configfile> construct.
+ ALSO: squeezing history_id in this way since no other way to pass it.
+ "trans" is provided only by tool form presentation via <code file="...">
+ BUT NOW SEE John Chilton's: https://gist.github.com/jmchilton/27c5bb05e155a611294d
+   See galaxy source code at https://galaxy-dist.readthedocs.org/en/latest/_modules/galaxy/web/framework.html,
+ See http://dev.list.galaxyproject.org/error-using-get-user-id-in-xml-file-in-new-Galaxy-td4665274.html
+ See http://dev.list.galaxyproject.org/hg-galaxy-2780-Real-Job-tm-support-for-the-library-upload-to-td4133384.html
+ master api key, set in galaxy config: #self.master_api_key = trans.app.config.master_api_key
+ """
+ global retrieval_obj
+
+ api_url = trans.request.application_url + '/api'
+ history_id = str(trans.security.encode_id(trans.history.id))
+ user_api_key = None
+ #master_api_key = trans.app.config.master_api_key
+
+ if trans.user:
+
+ user_name = trans.user.username
+
+ if trans.user.api_keys and len(trans.user.api_keys) > 0:
+ user_api_key = trans.user.api_keys[0].key #First key is always the active one?
+ items = [ { 'name': user_name, 'value': api_url + '-' + history_id, 'options':[], 'selected': True } ]
+
+ else:
+ items = [ { 'name': user_name + ' - Note: you need a key (see "User" menu)!', 'value': '0', 'options':[], 'selected': False } ]
+
+ else:
+ items = [ { 'name': 'You need to be logged in to use this tool!', 'value': '1', 'options':[], 'selected': False } ]
+
+ retrieval_obj = vdb_retrieval.VDBRetrieval()
+ retrieval_obj.set_trans(api_url, history_id, user_api_key) #, master_api_key
+
+ return items
+
+
+def vdb_get_databases():
+ """
+ Called by Tool Form, retrieves list of versioned databases from galaxy library called "Versioned Data"
+
+ @return [name,value,selected] array
+ """
+ global retrieval_obj
+ items = retrieval_obj.get_library_data_store_list()
+
+ if len(items) == 0:
+ # Not great: Communicating library problem by text in form select pulldown input.
+ items = [[vdb_retrieval.VDB_DATA_LIBRARY_FOLDER_ERROR, None, False]]
+
+ return items
+
+
+def vdb_get_versions(spec_file_id, global_retrieval_date):
+ """
+   Retrieve applicable versions of given database.
+   Unfortunately this is only refreshed when form screen is refreshed
+
+ @param dbKey [folder_id]
+
+ @return [name,value,selected] array
+ """
+ global retrieval_obj
+
+ items = []
+ if spec_file_id:
+
+ data_store_spec = retrieval_obj.user_api.libraries.show_folder(retrieval_obj.library_id, spec_file_id)
+
+ if data_store_spec: #OTHERWISE DOES THIS MEAN USER DOESN'T HAVE PERMISSIONS?  VALIDATE
+
+ file_name = data_store_spec['name']  # Short (no path), original file name, not galaxy-assigned file_name
+ data_store_type = retrieval_obj.test_data_store_type(file_name)
+ library_label_path = retrieval_obj.get_library_label_path(spec_file_id)
+
+ if not data_store_type or not library_label_path:
+ # Cludgy method of sending message to user
+ items.append([vdb_retrieval.VDB_DATA_LIBRARY_FOLDER_ERROR, None, False])
+ items.append([vdb_retrieval.VDB_DATA_LIBRARY_CONFIG_ERROR + '"' + vdb_retrieval.VDB_DATA_LIBRARY + '/' + str(library_label_path) + '/' + file_name + '"', None, False])
+ return items
+
+ _retrieval_date = vdb_common.parse_date(global_retrieval_date)
+
+ # Loads interface for appropriate data source retrieval
+ ds_obj = retrieval_obj.get_data_store_gateway(data_store_type, spec_file_id)
+ items = ds_obj.get_version_options(global_retrieval_date=_retrieval_date)
+
+ else:
+ items.append(['Unable to find' + spec_file_id + ':Check permissions?','',False])
+
+ return items
+
+
+def vdb_get_workflows(dbKey):
+ """
+ List appropriate workflows for database.  These are indicated by prefix "Versioning:" in name.
+ Currently can see ALL workflows that are published; admin_api() receives this in all galaxy versions.
+ Seems like only some galaxy versions allow user_api() to also see published workflows.
+ Only alternative is to list only individual workflows that current user can see - ones they created, and published workflows; but versioneddata user needs to have permissions on these too.
+
+ Future: Sensitivity: Some workflows apply only to some kinds of database
+
+ @param dbKey [data_spec_id]
+ @return [name,value,selected] array
+ """
+ global retrieval_obj
+
+ data = []
+ try:
+ workflows = retrieval_obj.admin_api.workflows.get_workflows(published=True)
+
+ except Exception as err:
+ if err.message[-21:] == 'HTTP status code: 403':
+ data.append(['Error: User does not have permissions to see workflows (or they need to be published).' , 0, False])
+ else:
+ data.append(['Error: In getting workflows: %s' % err.message , 0, False])
+ return data
+
+ oldName = ""
+ for workflow in workflows:
+ name = workflow['name']
+ if name[0:11].lower() == "versioning:" and name != oldName:
+ # Interface Bug: If an item is published and is also shared personally with a user, it is shown twice.
+ data.append([name, workflow['id'], False])
+ oldName = name
+
+ return data
+
+