comparison doc/data_stores.md @ 1:5c5027485f7d draft

Uploaded correct file
author damion
date Sun, 09 Aug 2015 16:07:50 -0400
parents
children
comparison
equal deleted inserted replaced
0:d31a1bd74e63 1:5c5027485f7d
1 # Server Data Stores
2
3 These folders hold the git, Kipper, and plain folder data stores. A data store connection directly to a Biomaj databank is also possible, though it is limited to those databanks that have .fasta files in their /flat/ subfolder. How you want to store your data depends on the data itself (size, format) and its frequency of use; generally large fasta databases should be held in a Kipper data store. Generally, each git or Kipper data store needs to have:
4 * An appropriately named folder;
5 * A sub-folder called "master/", with the data store files set up in it;
6 * A file called "pointer.[type of data store]" which contains a path to itself. This file will be linked into a galaxy data library.
7 Note that all these folders need to be accessible by the same user that runs your Galaxy installation for Galaxy integration to work. Specifically Galaxy needs recursive rwx permissions on these folders.
8
9 To start, you may want to set up the Kipper RDP RNA database (included in the Kipper repository RDP-test-case folder, see README.md file there). It comes complete with the folder structure and files for the Kipper example that follows. You will have to adjust the pointer.Kipper file content appropriately.
10
11 The data store folders can be placed within a single folder, or may be in different locations on the server as desired. In our example versioned data stores are all located under /projects2/reference_dbs/versioned/ .
12
13 ## Data Store Examples
14
15 ### Kipper data store example: A data store for the RDP RNA database:
16 ```bash
17 /projects2/reference_dbs/versioned/RDP_RNA/
18 /projects2/reference_dbs/versioned/RDP_RNA/pointer.kipper
19 /projects2/reference_dbs/versioned/RDP_RNA/master/
20 /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_1
21 /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_2
22 /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna.md
23 ```
24 To start a Kipper data store from scratch, go into the master folder, initialize a Kipper data store there, and import a version of a content file into the Kipper db.
25
26 ```bash
27 cd master/
28 kipper.py rdp_rna -M fasta
29 kipper.py rdp_rna -i [file to import] -o .
30 ```
31
32 Kipper stores and retrieves just one file at a time. Currently there is no provision for retrieving multiple files from different Kipper data stores at once.
33
34 Note that very large temporary files can be generated during the archive/recall process. For example, a compressed 10Gb NCBI "nr" input fasta file may be resorted and reformatted; or a 10Gb file may be transformed from Kipper format and output to a file. For this reason we have located any necessary temporary files in the input and output folders specified in the Kipper command line. (The system /tmp/ folder can be too small to fit them).
35
36
37 ### Folder data store example
38
39 In this scenario the data we want to archive probably isn't of a key-value nature, nor is it amenable to diff storage via git, so we're storing each version as a separate file. We don't need the "master" sub-folder since there is no master database, but we do need 1 additional folder to store each version's file(s).
40
41 The version folder names must be in the format **[date]_[version id]** to convey to users the date and version id of each version. The folder names will be displayed directly in the Galaxy Versioned Data tool's selectable list of versions. Note that this can allow for various date and time granularity, e.g. "2005_v1" and "2005-01-05 10:24_v1" are both acceptable folder names. Note that several versions can be published on the same day.
42
43 Example of a refseq50 protein database as a folder data store:
44
45 ```bash
46 /projects/reference_dbs/versioned/refseq50/
47 /projects/reference_dbs/versioned/refseq50/pointer.folder
48 /projects/reference_dbs/versioned/refseq50/2005-01-05_v1/file.fasta
49 /projects/reference_dbs/versioned/refseq50/2005-01-05_v2/file.fasta
50 /projects/reference_dbs/versioned/refseq50/2005-02-04_v3/file.fasta
51 ...
52 /projects/reference_dbs/versioned/refseq50/2005-05-24_v4/file.fasta
53 ...
54 /projects/reference_dbs/versioned/refseq50/2005-09-27_v5/file.fasta
55 ```
56
57 etc...
58
59 A data store of type "folder" doesn't have to be stored outside of galaxy. Exactly the same folder structure can be set up directly within the galaxy data library, and files can be uploaded inside them. The one drawback to this approach is that then other (non-galaxy platform) server users can't have easy access to version data.
60
61 Needless to say, administrators should **never delete these files since they are not cached!**
62
63 ### Git Data Store example
64
65 A git data store for versions of the NCBI 16S microbial database would look like:
66
67 ```
68 /projects2/reference_dbs/versioned/ncbi_16S/
69 /projects2/reference_dbs/versioned/ncbi_16S/pointer.git
70 /projects2/reference_dbs/versioned/ncbi_16S/master/.git (hidden file)
71 ```
72
73 One must initialize a git repository (or clone one) into the master/ folder. The Versioned Data system depends on use of git 'tags' to specify versions of data. See the Git Data Store section for details.
74 To start a git repository from scratch, go into the master folder, initialize git there, copy versioned content into the folder, and then commit it. Finally add a git tag that describes the version identifier. The versioned data system only distinguishes versions by their tag name. Thus one can have several commits between versions.
75
76 ```
77 cd master/
78 git init
79 cp [files from wherever] ./
80 git add [those files]
81 git commit -m 'various changes'
82 ...
83 git add [changed files]
84 git commit -m 'various changes'
85 ...
86 git tag -a v1 -m v1
87 ```
88
89 Once your tag is defined it will be listed in Galaxy as a version
90
91
92 ### Biomaj data store example
93
94 In this scenario the data we want versioned access to is sitting directly in the /flat/ folder of a Biomaj databank. Each version is a separate file that Biomaj manages. Biomaj can be set to keep all old versions alongside any new one it downloads, or it can limit the total # of versions to a fixed number (with the oldest removed when the newest arrives).
95
96 *coming soon...*