annotate doc/data_stores.md @ 1:5c5027485f7d draft

Uploaded correct file
author damion
date Sun, 09 Aug 2015 16:07:50 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
1 # Server Data Stores
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
2
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
3 These folders hold the git, Kipper, and plain folder data stores. A data store connection directly to a Biomaj databank is also possible, though it is limited to those databanks that have .fasta files in their /flat/ subfolder. How you want to store your data depends on the data itself (size, format) and its frequency of use; generally large fasta databases should be held in a Kipper data store. Generally, each git or Kipper data store needs to have:
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
4 * An appropriately named folder;
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
5 * A sub-folder called "master/", with the data store files set up in it;
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
6 * A file called "pointer.[type of data store]" which contains a path to itself. This file will be linked into a galaxy data library.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
7 Note that all these folders need to be accessible by the same user that runs your Galaxy installation for Galaxy integration to work. Specifically Galaxy needs recursive rwx permissions on these folders.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
8
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
9 To start, you may want to set up the Kipper RDP RNA database (included in the Kipper repository RDP-test-case folder, see README.md file there). It comes complete with the folder structure and files for the Kipper example that follows. You will have to adjust the pointer.Kipper file content appropriately.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
10
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
11 The data store folders can be placed within a single folder, or may be in different locations on the server as desired. In our example versioned data stores are all located under /projects2/reference_dbs/versioned/ .
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
12
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
13 ## Data Store Examples
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
14
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
15 ### Kipper data store example: A data store for the RDP RNA database:
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
16 ```bash
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
17 /projects2/reference_dbs/versioned/RDP_RNA/
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
18 /projects2/reference_dbs/versioned/RDP_RNA/pointer.kipper
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
19 /projects2/reference_dbs/versioned/RDP_RNA/master/
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
20 /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_1
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
21 /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna_2
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
22 /projects2/reference_dbs/versioned/RDP_RNA/master/rdp_rna.md
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
23 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
24 To start a Kipper data store from scratch, go into the master folder, initialize a Kipper data store there, and import a version of a content file into the Kipper db.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
25
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
26 ```bash
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
27 cd master/
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
28 kipper.py rdp_rna -M fasta
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
29 kipper.py rdp_rna -i [file to import] -o .
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
30 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
31
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
32 Kipper stores and retrieves just one file at a time. Currently there is no provision for retrieving multiple files from different Kipper data stores at once.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
33
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
34 Note that very large temporary files can be generated during the archive/recall process. For example, a compressed 10Gb NCBI "nr" input fasta file may be resorted and reformatted; or a 10Gb file may be transformed from Kipper format and output to a file. For this reason we have located any necessary temporary files in the input and output folders specified in the Kipper command line. (The system /tmp/ folder can be too small to fit them).
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
35
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
36
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
37 ### Folder data store example
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
38
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
39 In this scenario the data we want to archive probably isn't of a key-value nature, nor is it amenable to diff storage via git, so we're storing each version as a separate file. We don't need the "master" sub-folder since there is no master database, but we do need 1 additional folder to store each version's file(s).
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
40
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
41 The version folder names must be in the format **[date]_[version id]** to convey to users the date and version id of each version. The folder names will be displayed directly in the Galaxy Versioned Data tool's selectable list of versions. Note that this can allow for various date and time granularity, e.g. "2005_v1" and "2005-01-05 10:24_v1" are both acceptable folder names. Note that several versions can be published on the same day.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
42
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
43 Example of a refseq50 protein database as a folder data store:
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
44
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
45 ```bash
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
46 /projects/reference_dbs/versioned/refseq50/
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
47 /projects/reference_dbs/versioned/refseq50/pointer.folder
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
48 /projects/reference_dbs/versioned/refseq50/2005-01-05_v1/file.fasta
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
49 /projects/reference_dbs/versioned/refseq50/2005-01-05_v2/file.fasta
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
50 /projects/reference_dbs/versioned/refseq50/2005-02-04_v3/file.fasta
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
51 ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
52 /projects/reference_dbs/versioned/refseq50/2005-05-24_v4/file.fasta
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
53 ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
54 /projects/reference_dbs/versioned/refseq50/2005-09-27_v5/file.fasta
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
55 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
56
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
57 etc...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
58
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
59 A data store of type "folder" doesn't have to be stored outside of galaxy. Exactly the same folder structure can be set up directly within the galaxy data library, and files can be uploaded inside them. The one drawback to this approach is that then other (non-galaxy platform) server users can't have easy access to version data.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
60
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
61 Needless to say, administrators should **never delete these files since they are not cached!**
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
62
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
63 ### Git Data Store example
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
64
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
65 A git data store for versions of the NCBI 16S microbial database would look like:
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
66
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
67 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
68 /projects2/reference_dbs/versioned/ncbi_16S/
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
69 /projects2/reference_dbs/versioned/ncbi_16S/pointer.git
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
70 /projects2/reference_dbs/versioned/ncbi_16S/master/.git (hidden file)
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
71 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
72
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
73 One must initialize a git repository (or clone one) into the master/ folder. The Versioned Data system depends on use of git 'tags' to specify versions of data. See the Git Data Store section for details.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
74 To start a git repository from scratch, go into the master folder, initialize git there, copy versioned content into the folder, and then commit it. Finally add a git tag that describes the version identifier. The versioned data system only distinguishes versions by their tag name. Thus one can have several commits between versions.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
75
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
76 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
77 cd master/
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
78 git init
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
79 cp [files from wherever] ./
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
80 git add [those files]
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
81 git commit -m 'various changes'
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
82 ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
83 git add [changed files]
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
84 git commit -m 'various changes'
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
85 ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
86 git tag -a v1 -m v1
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
87 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
88
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
89 Once your tag is defined it will be listed in Galaxy as a version
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
90
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
91
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
92 ### Biomaj data store example
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
93
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
94 In this scenario the data we want versioned access to is sitting directly in the /flat/ folder of a Biomaj databank. Each version is a separate file that Biomaj manages. Biomaj can be set to keep all old versions alongside any new one it downloads, or it can limit the total # of versions to a fixed number (with the oldest removed when the newest arrives).
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
95
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
96 *coming soon...*