annotate doc/background.md @ 1:5c5027485f7d draft

Uploaded correct file
author damion
date Sun, 09 Aug 2015 16:07:50 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
1 # Background Research
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
2
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
3 ### Git Archiving
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
4
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
5 Git was investigated as a generic solution for versioning textual files but in a number of cases its performance was detrimental. Initially the git archive option seemed to fail on fasta files of any size. In tests git was replacing one version with another by a diff that contained deletes of every original line followed by inserts of every line in new file.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
6
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
7 By reformatting the incoming fasta file so that each line held one fasta record >[sequence id] [description] [sequence], git's diff algorithm seemed to regain its efficiency. The traditional format:
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
8
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
9 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
10 >gi|324234235|whatever a description
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
11 ATGCCGTAAGATTC ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
12 ATGCCGTAAGATTC ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
13 ATGCCGTAAGATTC ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
14 >gi|25345345|another entry
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
15 etc.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
16 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
17
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
18 was converted to a 1 line format:
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
19
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
20 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
21 >gi|324234235|whatever a description [tab] ATGCCGTAAGATTCATGCCGTAAGATTCATGCCGTAAGATTC ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
22 >gi|25345345|another entry ...
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
23 ```
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
24
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
25 This approach seemed to archive fasta files very efficiently. However it turned out this only worked for some kinds of fasta data. Git's file differential algorithm failed possibly because of a lack of variation from line to line of a file - it may need lines to have different lengths, word count, or variation in character content (it seemed to be ok for some nucleotide data, but we found it occasionally failed on uniprot v1 to v11 protein data.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
26
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
27 There is a git feature to disable diff analysis for files above a certain file size, which then makes git archive size comparable to a straight text-file compression algorithm (zip/gzip). Overall, a practical limit of 15gb has been reported for git archive size. The main barrier seems to be content format; archive retrieval does get slower over time if diffs are being calculated.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
28
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
29 Git also has has very high memory requirements to do its processing, which is a server liability. Tests indicate that for some inputs, git is successful for file sizes up to at least a few gigabytes.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
30
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
31 ### XML Archiving
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
32
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
33 We also examined the versioning of XML content. The XArch: http://xarch.sourceforge.net/ approach is very promising as a generic solution, though it entails slow processing due to the nature of xml documents. It compares current and past xml documents and datestamps the additions/deletions to an xml structure. A good article summarizing the XML archive problem: http://useless-factor.blogspot.ca/2008/01/matching-diffing-and-merging-xml.html
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
34
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
35
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
36 ### Other Database Solutions
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
37
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
38 A number of key-value databases exist (e.g. LevelDB, LMDB, and DiscoDB) but they are designed to gain quick access to individual key-value entries, not to query properties of each key essential to version construction, namely to filter every key in the database by when it was created or deleted. Some key-value databases have features that could support speedy version retrieval.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
39
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
40 Google's Leveldb - a straight key-value file based database - may work, though we would need to think up a scheme for including create & delete times in the key for each fasta identifier. Possibly 1 key=fasta id record indicating existence of fasta identifier sequence; and subsequent fastaId.created.deleted key as per the keydb solution. An advantage here is that compression is built-in. In the case where incoming data consists a full file to synchronize with, inserts are easy to account for - but deletes can be very tricky since finding them involves scanning through the entire leveldb's key list, with the assumption that they are sorted the same way as the incoming data (the keydb approach) (VERIFY THIS???)
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
41
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
42 For processing an incoming entire version of key-value content, the functionality required is that the keys are stored in a reliable sort order, and that they can be read fast sequentially . During this loop we can retrieve all the create/update/delete transactions of each key by version or date. (Usually the db has no ability to structure the value data though, so we would usually have to create such a schema). This is basically what keydb does.
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
43
5c5027485f7d Uploaded correct file
damion
parents:
diff changeset
44 For processing separate incoming files that contain only inserts and deletes, the job is much the same, only we skip swaths of the master database content. At some point it becomes more attractive performance-wise to implement an indexed key value store, thus avoiding the sorting and sequential access necessary for simpler key-value dbs.