Mercurial > repos > dawe > srf2fastq
comparison srf2fastq/io_lib-1.12.2/docs/ZTR_format @ 0:d901c9f41a6a default tip
Migrated tool version 1.0.1 from old tool shed archive to new tool shed repository
author | dawe |
---|---|
date | Tue, 07 Jun 2011 17:48:05 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:d901c9f41a6a |
---|---|
1 Notes: 28th May 2008 | |
2 | |
3 For version 2.0 consider the following: | |
4 | |
5 1) Remove defunct or useless chunk types and compression formats. | |
6 2) Rationalise inconsitent behaviour (eg endianness on zlib chunk). | |
7 3) Support split header/data formats for SRF | |
8 4) Formalise meta-data use better. | |
9 5) More pie-in-the-sky ideas? | |
10 | |
11 What we've described so far could easily be said to be v1.4. It's | |
12 backwards compatible and fairly minor is change. If we truely want to | |
13 go for version 2 then taking the chance to remove all those niggles | |
14 that we've kept purely for backwards compatibility would be good. | |
15 | |
16 In more detail: | |
17 | |
18 1) Removal of RLE and floating point chebyshev polynomials. Mark XRLE | |
19 as deprecated? | |
20 | |
21 We may wish to add an extra option to XRLE2 to indicate the repeat | |
22 count before specifying the remaining run-length. This breaks the | |
23 format though. (Or add XRLE3 to allow such control?) | |
24 | |
25 2) Strange things I can see are: | |
26 | |
27 2.1) All chunks use big-endian data except for zlib which has a | |
28 little-endian length. | |
29 | |
30 2.2) The order that data is stored in differs per chunk type. For | |
31 trace data we store all As, then all Cs, all Gs and finally | |
32 all Ts. For confidence values we store called first followed | |
33 by remaining. Both SMP4 and CNF4 essentially hold 1 piece of | |
34 data per base type per base position, it's just the word size | |
35 and packing order that differs. | |
36 | |
37 This means TSHIFT and QSHIFT compression types are tied very | |
38 much to trace and quality value chunks, rather than being | |
39 generic transforms. Maybe we should always have the same | |
40 encoding order and some standard compression/transformations | |
41 to reorder as desired. | |
42 | |
43 An example: | |
44 All data related per call is stored in the natural order | |
45 produced. (eg as utilised in CNF1, BPOS). | |
46 | |
47 All data related per base-type per call is stored in the order | |
48 produced: A, C, G, T for the first base position, A, C, G, T | |
49 for the second position, and so on. | |
50 | |
51 Then we have standard filters that can swap between | |
52 ACGTACGTACGT... and AAA...CCC...GGG...TTT... or to | |
53 <called><non-called * 3>... order (which requires a BASE chunk | |
54 present to encode/decode). We'd have 1, 2 and 4 byte variants | |
55 of such filters. They do not need to understand the nature of | |
56 the data they're manipulating, just the word size and a | |
57 predetermined order to shuffle the data around in. | |
58 | |
59 For CNF4 a combination of {ACGT}* to {<called><non-called*3>}* | |
60 followed by {ACGT}* to A*C*G*T* ordering would end up with all | |
61 <called> followed by all 3 remaining non-called. Ie as it is | |
62 now (which we then promptly "undo" in solexa data by using | |
63 TSHIFT).a | |
64 | |
65 3) I'm wondering if there's mileage here in having negative lengths to | |
66 indicate constant data + variable data further on. | |
67 | |
68 Eg length -10 means the next 10 bytes are the start of the data for | |
69 this chunk. Some stage later we'll read a 4-byte length followed by | |
70 the remaining data for this chunk. | |
71 | |
72 Rationale: often we end up with many identical bytes at the start | |
73 of a chunk. For example, we take a solexa trace (0 0 value...), run | |
74 it through TSHIFT (80 0 0 0 previous data => 80 0 0 0 0 0 value ...) | |
75 and then through STHUFF (77 80(eg) data), but data is the | |
76 compressed stream always starting with 80 0 0 0 0 0 so typically it's | |
77 always the same starting string. | |
78 | |
79 Tested on an SRF file I see SMP4 always starting with the same 9 | |
80 bytes of data, BASE starting with the same 3 bytes and CNF4 always | |
81 starting with the same 7 bytes. Hence we'd have lengths -9, -3 and | |
82 -7 in the chunk headers and move that common data to the header | |
83 block too. That's approx 3% of the size of our SRF file. | |
84 | |
85 4) I propose *all* chunks have some standard meta-data fields | |
86 available for use. These can be: | |
87 | |
88 4.1) GROUP - all chunks sharing the same GROUP value are considered | |
89 as being related to one another. This provides a mechanism for | |
90 multiple base-call, base position and confidence value chunks | |
91 while still knowing which confidence values belong to which | |
92 call. It also allows for multiple SAMP chunks (instead of the | |
93 SMP4 chunk) to be collated together if desired. | |
94 | |
95 I don't expect many ZTR files to contain calls from multiple | |
96 base-callers, but it's maybe a nice extension and seems quite | |
97 a simple/clean use of meta-data. | |
98 | |
99 4.2) ENCODING - the default encoding for the chunk data is as | |
100 described in the chunk. We may however wish to override this | |
101 and, for example, store SMP4 data as 32-bit floating point | |
102 values instead of 16-bit integers. This specifies that. | |
103 | |
104 Question: do we want this available universally everywhere? If | |
105 not, we should at least use the same meta-data keyword for all | |
106 occurrences. | |
107 | |
108 4.3) TRANSFORM - a simple transformation description. This is | |
109 essentially a mini-formula. It replaces the OFFS meta-data | |
110 used in SMP4 which is simply a transform of X+value. | |
111 | |
112 5) There are more generic ways to save storage by removing redundancy. | |
113 | |
114 Most probably they're not worth it, but I list them here for | |
115 discussion still. | |
116 | |
117 5.1) Use 7-bit variable sized encodings for values instead of fixed | |
118 32-bit sizes. | |
119 | |
120 Eg instead of storing 1000 as 0x3*0x100 + 0xe8 (00 00 03 e8) | |
121 we could store it as 0x7*0x80 + 0x68 (80|07 68). The logic | |
122 here being setting the top bit implies this isn't the final | |
123 value and more data follows. It allows for variable sized | |
124 fields so that small numbers take up fewer bytes. The same can | |
125 be applied to data in SRF structs too. | |
126 | |
127 Realistically it saves 2 bytes per record in SRF and an | |
128 unknown amount for ZTR - estimated 8 or so (3 for cnf4/base | |
129 and 2 for smp4). It's only 1.5% saving though in total. | |
130 | |
131 5.2) A general purpose dictionary system. Instead of attempting to | |
132 move headers to one area and data somewhere else, possibly | |
133 also taking common portions of data and putting that somewhere | |
134 too, we could provide a dictionary system whereby we | |
135 previously remove redundancy by replacing all occurrences of a | |
136 particular byte pattern with a new shorter code. (We'd need an | |
137 escape mechanism for when it occurs by chance.) The dictionary | |
138 can then be specified in it's own chunk which is stored in the | |
139 header portion. | |
140 | |
141 This then works for portions of chunk header (eg if the | |
142 meta-data changes) rather than full headers, where the data | |
143 blocks always start with the same text, or where we want to | |
144 have sensible names in text fields but don't like them taking | |
145 up too much space. | |
146 | |
147 It's maybe a bit messy though and complex to implement, plus | |
148 it's unknown how big an impact having to escape accidental | |
149 dictionary codes from appearing in real data. The more formal | |
150 way of removing redundancy is probably better. | |
151 | |
152 5.3) Lossy compression. I believe there's still room for this, | |
153 although it needs careful thought. | |
154 | |
155 The floating point format really isn't an ideal way to do it | |
156 though, so I'd much rather have an encoding system that uses | |
157 N*log(signal/M+1) plus a sign bit, stored in integers. | |
158 | |
159 As we store data in integers the value of N combined with the | |
160 maximum value for log(signal/M+1) gives us the number of bits | |
161 we wish to encode to. Essentially we're storing the log value | |
162 to a fixed point precision. | |
163 | |
164 The value of M dictates the slope of the errors we get from | |
165 logging. It's hard to describe, but basically as signal gets | |
166 larger our average error in storing the signal also gets | |
167 larger. That's true for floating point values too as there's | |
168 a fixed number of bits and they're being used to represent | |
169 larger and larger values, meaning the resolution drops. | |
170 | |
171 I have various test code and graphs showing error profiles | |
172 for logs vs fixed point vs floating point. Logs or fixed | |
173 point are nearly always preferable to a floating point format | |
174 for size vs accuracy. | |
175 | |
176 ----------------------------------------------------------------------------- | |
177 | |
178 CHANGE (since 1.2): | |
179 SAMP and SMP4 now has meta data fields indicating the zero base-line. | |
180 | |
181 CLARIFICATION | |
182 The specification now explicitly states that trace samples are | |
183 unsigned, although the new OFFS meta-data can be used to turn these | |
184 into signed values. | |
185 | |
186 CLARIFICATION | |
187 We explicitly state that multiple TEXT chunks maybe present in the ZTR | |
188 file and will be concatenated together. Also the trailing (nul) byte | |
189 is now optional. | |
190 | |
191 CHANGE | |
192 Added CSET (character set) meta-data for BASEs so ABI SOLID encoding | |
193 can be used. This removes the requirement of IUPAC characters only. | |
194 | |
195 CHANGE | |
196 Added XRLE2, QSHIFT, TSHIFT and STHUFF compression types. | |
197 | |
198 INCOMPATIBLE CHANGE: | |
199 I propose for this version to make all meta-data adhere to a specific | |
200 format rather than adhoc. It'll consist of zero or more copies of | |
201 'identifier nul value nul'. See the format below for details. | |
202 | |
203 The only use of meta-data in 1.2 was for SAMP (not SMP4) chunks to | |
204 indicate the channel the data came from. From now on file readers will | |
205 need to check the version number in the header to determine how to | |
206 parse the SAMP meta-data. | |
207 | |
208 | |
209 [Search for "FIXME" for my comments / questions to be answered. They | |
210 elaborate on the summary below and provide more context.] | |
211 | |
212 | |
213 QUESTION1: | |
214 Should we adapt ZTR to not be so inefficient with regards | |
215 to tiny chunks. Specifically a 5 byte chunk size, 4 byte meta-data | |
216 size (normally zero anyway) and 4 byte data length is all | |
217 wasteful. These combined comprise 5-10% of the total SRF size. Note | |
218 that changing this would break backwards compatibility. | |
219 | |
220 QUESTION2: | |
221 Do I need a means to specify the "default meta-data". Specifically if | |
222 we have lots of SAMP chunks (for example) and every single one is | |
223 stating that the zero "offset" value is 32768 then we may want a | |
224 mechanism of specifying that the default OFFS value is 32768 for all | |
225 subsequent SAMP chunks. | |
226 | |
227 One possible way to do this is to have a new chunk type which sets the | |
228 default. Eg for the SAMP chunk we could define a SaMP chunk to modify | |
229 the default for SAMP. This seems oddly named, but it's utilising the | |
230 bit5 of the 2nd byte which so far has been reserved as zero. (In the | |
231 first byte bit 5 set => private namespace and not part of the public spec.) | |
232 | |
233 For now I'm just ignoring this issue though. | |
234 | |
235 QUESTION3: | |
236 I've defined new transforms named TSHIFT and QSHIFT specifically | |
237 designed for adjusting the layout of CND4 and SMP4 chunk types to an | |
238 order more amenable for compression by interlaced deflate. They do the | |
239 job, but I'm wondering if it's better to simply redefine the input | |
240 data to be a more consistent ordering so that we can define more | |
241 general purpose transforms rather than one dedicated to the original | |
242 trace layout and one for the quality layout. | |
243 | |
244 I'm ignoring this for now as it would break backwards compatibility. | |
245 | |
246 QUESTION4: | |
247 For the OFFS meta-data in SMP4 and SAMP chunks I have a 16-bit offset | |
248 to specify the zero position. Ie OFFS of 10000 means a sample of 9000 | |
249 becomes -1000 after processing. | |
250 | |
251 Should it be a signed or unsigned 16-bit value. Signed means we could | |
252 encode values ranging from 10000 to 70000 by specify OFFS as -10000. | |
253 | |
254 Should it be 32-bit instead? Should we have OFFI and OFFF for integer | |
255 and floating point equivalents? | |
256 | |
257 QUESTION5: | |
258 For region encoding where should the region name belong - the | |
259 meta-data section or the REGION_LIST TEXT identifier? It's currently | |
260 in both places. My gut instinct tells me it belongs in the meta-data | |
261 for the REGION_LIST chunk itself. | |
262 | |
263 QUESTION6: | |
264 Can we have clarification on what the region code types mean, | |
265 specifically "tech read". | |
266 | |
267 QUESTION7: | |
268 Should we add SAMP/SMP4 meta-data indicating a down-scale factor? For | |
269 454 data this could be 100, so we know value 123 is really 1.23. Note | |
270 this is maybe better implemented below using fixed-point precision. | |
271 | |
272 QUESTION8: | |
273 How do we deal with floating point values? | |
274 | |
275 I think the chunk meta-data should detail the format of the data block | |
276 itself (as it is strictly speaking data about the data so it fits | |
277 there well). A lack of meta data should imply the usual unsigned | |
278 16-bit quantities. | |
279 | |
280 There's two main ways to encode fractions: | |
281 | |
282 Floating point where we have a mantissa and an exponent. | |
283 - See http://en.wikipedia.org/wiki/IEEE_floating-point_standard | |
284 - large dynamic range | |
285 - fixed number of significant bits | |
286 - varying "resolution". Ie can represent tiny differences | |
287 between two very small floating point numbers, but not | |
288 between two very large floating point numbers. | |
289 | |
290 Fixed point where we have a fixed number of bits for the component | |
291 before and after the decimal point. | |
292 - See http://en.wikipedia.org/wiki/Q_%28number_format%29 | |
293 - constant resolution | |
294 - effectively used by SFF (specified to 2 decimal places) | |
295 - easy to treat as integers so can be fast and dealt with by | |
296 small embedded CPUs without FPUs. | |
297 | |
298 | |
299 Floating point maybe appropriate as effectively it's the same as | |
300 logging your signals and storing those. It offers large dynamic range | |
301 so can cope with abnormally large values (at the expense of precision) | |
302 while retaining lots of variation at the low end to distinguish small | |
303 values. However it's CPU intensive to cope with anything other than | |
304 the CPU provided 32-bit and 64-bit floating point formats. | |
305 | |
306 Single precision 32-bit floats in IEEE-754 have: | |
307 1 bit (31): Sign | |
308 8 bits (23-30): Exponent (bias 127, so stroring 100 => -27) | |
309 23 bits (0-22): Mantissa | |
310 | |
311 Effectively we store any binary value as a normalised expression: | |
312 | |
313 <exponent> | |
314 1.<mantissa> * 2 | |
315 | |
316 | |
317 Eg 1732.5: | |
318 | |
319 => 11011000100.1 (binary) | |
320 => 1.10110001001 (binary) * 2^10 | |
321 | |
322 Exponent+127 => 137 => 10001001 (binary) | |
323 | |
324 sign exponent mantissa | |
325 0 10001001 10110001001000000000000 | |
326 | |
327 (17325 => 0x43ad => 0x0010001110101101 | |
328 | |
329 However we probably want 16-bit and 24-bit floating point types for | |
330 efficiencies sake. Do we go with some fixed predefined floating point | |
331 formats for 8-bit, 16-bit, 24-bit and 32-bit layouts (with 32-bit | |
332 being identical to IEEE754) or do we allow for specification of the | |
333 mantissa and exponent Eg FLOAT=23.8, FLOAT=17.6 or FLOAT=5.2 in the | |
334 meta-data block? | |
335 | |
336 FLOAT=17.6 (24-bit) gives ranges +/- 8.6*10^9 | |
337 FLOAT=5.2 (8-bit) gives ranges +/- 64 (I think). | |
338 | |
339 Alternatively if we restrict ourselves to only using the most | |
340 significant 14 bits of the mantissa then storing as standard 32-bit | |
341 floats implies 1 in every 4 bytes is zero. This may provide for a | |
342 very crude, but fast way to implement reduced size floating point | |
343 values - ie FLOAT=15.8 (24-bit signed). | |
344 | |
345 For fixed point (as in SFF values) there's already a draft standard | |
346 for implementation in C (ISO/IEC TR 18037:2004). | |
347 | |
348 One benefit of fixed point over floating point is speed of | |
349 implementation. Fixed point numbers can just be dealt with as | |
350 integers. Eg subtracting two fixed point 16-bit values can be done in | |
351 integers using a-b and the result is the same as if we'd done all the | |
352 bit twiddling and maths directly simulating a real fixed-point unit. | |
353 | |
354 My gut feeling is that we'd want to explicitly declare the number of | |
355 bits for integral and fractional components in the meta-data block. | |
356 | |
357 Comments? | |
358 | |
359 James | |
360 | |
361 PS. The latest (only minor tweaks from before) ZTR draft spec | |
362 follows. | |
363 | |
364 | |
365 | |
366 | |
367 1.3 draft 3 (19 Oct 2007) | |
368 | |
369 ZTR SPEC v1.3 | |
370 ============= | |
371 | |
372 Header | |
373 ====== | |
374 | |
375 The header consists of an 8 byte magic number (see below), followed by | |
376 a 1-byte major version number and 1-byte minor version number. | |
377 | |
378 Changes in minor numbers should not cause problems for parsers. It indicates | |
379 a change in chunk types (different contents), but the file format is the | |
380 same. | |
381 | |
382 The major number is reserved for any incompatible file format changes (which | |
383 hopefully should be never). | |
384 | |
385 /* The header */ | |
386 typedef struct { | |
387 unsigned char magic[8]; /* 0xae5a54520d0a1a0a (b.e.) */ | |
388 unsigned char version_major; /* 1 */ | |
389 unsigned char version_minor; /* 3 */ | |
390 } ztr_header_t; | |
391 | |
392 /* The ZTR magic numbers */ | |
393 #define ZTR_MAGIC "\256ZTR\r\n\032\n" | |
394 #define ZTR_VERSION_MAJOR 1 | |
395 #define ZTR_VERSION_MINOR 3 | |
396 | |
397 So the total header will consist of: | |
398 | |
399 Byte number 0 1 2 3 4 5 6 7 8 9 | |
400 +--+--+--+--+--+--+--+--+--+--+ | |
401 Hex values |ae 5a 54 52 0d 0a 1a 0a|01 03| | |
402 +--+--+--+--+--+--+--+--+--+--+ | |
403 | |
404 Chunk format | |
405 ============ | |
406 | |
407 The basic structure of a ZTR file is (header,chunk*) - ie header followed by | |
408 zero or more chunks. Each chunk consists of a type, some meta-data and some | |
409 data, along with the lengths of both the meta-data and data. | |
410 | |
411 Byte number 0 1 2 3 4 5 6 7 8 9 | |
412 +--+--+--+--+----+----+----+---+--+ - +--+--+--+--+--+-- - --+ | |
413 Hex values | type |meta-data length | meta-data |data length| data .. | | |
414 +--+--+--+--+----+----+----+---+--+ - +--+--+--+--+--+-- - --+ | |
415 | |
416 FIXME: For very short reads this is a large overhead. We have 8 bytes | |
417 of length information (of which typically only 1-2 are non-zero) and 4 | |
418 bytes for type (which typically only has one of 4-5 values). This | |
419 means about 10 bytes wasted per chunk, or maybe 5-10% of the total | |
420 file size. Changing this would be a radical departure from ZTR; is it | |
421 justified given the savings? (est. 4.8% for 74bp reads, 8.4% for 27bp | |
422 reads). | |
423 One idea if to consider a ZTR file (the non "block" components at | |
424 least) to be a series of huffman codes, by default all 8-bit long and | |
425 matching their ASCII codes. Then a dedicated chunk could be used to | |
426 adjust these default codes. It's therefore backwards compatible, but | |
427 is that also overkill? (NB, this looks like it'd save 6% on the | |
428 overall file size.) | |
429 | |
430 Ie in C: | |
431 | |
432 typedef struct { | |
433 uint4 type; /* chunk type (b.e.) */ | |
434 uint4 mdlength; /* length of meta-data field (b.e.) */ | |
435 char *mdata; /* meta data */ | |
436 uint4 dlength; /* length of data field (b.e.) */ | |
437 char *data; /* a format byte and the data itself */ | |
438 } ztr_chunk_t; | |
439 | |
440 All 2 and 4-byte integer values are stored in big endian format. | |
441 | |
442 The meta-data is uncompressed (and so it does not start with a format | |
443 byte). From version 1.3 onwards meta-data is defined to be in key | |
444 value pairs adhering to the same structure defined in the TEXT chunk | |
445 ("key\0value\0"). Exceptions are made for this only for purposes of | |
446 backwards compatibility in the SAMP chunk type. The contents of the | |
447 meta-data is chunk specific, and many chunk types will have no | |
448 meta-data. In this case the meta-data length field will be zero and | |
449 this will be followed immediately by the data-length field. | |
450 | |
451 Ie all meta-data adheres to the following structure: | |
452 | |
453 Meta-data: (version 1.3 onwards only) | |
454 +- - -+--+- - -+--+- -+- - -+--+- - -+--+ | |
455 Hex values | ident | 0| value | 0| - | ident | 0| value | 0| | |
456 +- - -+--+- - -+--+- -+- - -+--+- - -+--+ | |
457 | |
458 FIXME: Can we have specify the meta-data once per ZTR file and omit it | |
459 in subsequent chunks? Eg a blank chunk with meta-data only in the | |
460 header. Chunks in the body then specify meta-data length as 0xFFFFFFFF | |
461 as an indicator meaning "use the last meta-data defined for this chunk | |
462 type". Useful when split in two, as in SRF? | |
463 | |
464 Note that this means both ident and values must not themselves contain | |
465 the zero byte (a nul character), hence we generally store ident-value | |
466 pairs in ASCII string forms. | |
467 | |
468 The data length ("dlength") is the length in bytes of the entire | |
469 'data' block, including the format information held within it. | |
470 | |
471 The first byte of the data consists of a format byte. The most basic format is | |
472 zero - indicating that the data is "as is"; it's the real thing. Other formats | |
473 exist in order to encode various filtering and compression techniques. The | |
474 information encoded in the next bytes will depend on the format byte. | |
475 | |
476 | |
477 RAW (#0) - no formatting | |
478 -------- | |
479 | |
480 Byte number 0 1 2 N | |
481 +--+--+-- - --+ | |
482 Hex values | 0| raw data | | |
483 +--+--+-- - --+ | |
484 | |
485 Raw data has no compression or filtering. It just contains the unprocessed | |
486 data. It consists of a one byte header (0) indicating raw format followed by N | |
487 bytes of data. | |
488 | |
489 | |
490 RLE (#1) - simple run-length encoding | |
491 ------- | |
492 | |
493 Byte number 0 1 2 3 4 5 6 7 8 N | |
494 +--+----+----+-----+-----+-------+--+--+--+-- - --+--+--+ | |
495 Hex values | 1| Uncompressed length | guard | run length encoded data| | |
496 +--+----+----+-----+-----+-------+--+--+--+-- - --+--+--+ | |
497 | |
498 Run length encoding replaces stretches of N identical bytes (with value V) | |
499 with the guard byte G followed by N and V. All other byte values are stored | |
500 as normal, except for occurrences of the guard byte, which is stored as G 0. | |
501 For example with a guard value of 8: | |
502 | |
503 Input data: | |
504 20 9 9 9 9 9 10 9 8 7 | |
505 | |
506 Output data: | |
507 1 (rle format) | |
508 0 0 0 10 (original length) | |
509 8 (guard) | |
510 20 8 5 9 10 9 8 0 7 (rle data) | |
511 | |
512 | |
513 ZLIB (#2) - see RFC 1950 | |
514 --------- | |
515 | |
516 Byte number 0 1 2 3 4 5 6 7 N | |
517 +--+----+----+-----+-----+--+--+--+-- - --+ | |
518 Hex values | 2| Uncompressed length | Zlib encoded data| | |
519 +--+----+----+-----+-----+--+--+--+-- - --+ | |
520 | |
521 This uses the zlib code to compress a data stream. The ZLIB data may itself be | |
522 encoded using a variety of methods (LZ77, Huffman), but zlib will | |
523 automatically determine the format itself. Often using zlib mode | |
524 Z_HUFFMAN_ONLY will provide best compression when combined with other | |
525 filtering techniques. | |
526 | |
527 | |
528 XRLE (#3) - multi-byte run-length encoding | |
529 --------- | |
530 | |
531 Byte number 0 1 2 3 4 5 N | |
532 +--+------+-------+--+--+--+-- - --+--+--+ | |
533 Hex values | 3| size | guard | run length encoded data| | |
534 +--+------+-------+--+--+--+-- - --+--+--+ | |
535 | |
536 Much standard RLE, but this mechanism has a byte to specify the length | |
537 of the data item we compare to check for runs. It is not restricted to | |
538 spotted runs aligned on 'size' byte boundaries either. | |
539 | |
540 No uncompressed length is encoded here as technically this is not | |
541 required (although it does make decoding a bit slower). The compressed | |
542 length alone is sufficient to work out the uncompressed length after | |
543 decompressing. | |
544 | |
545 Guard bytes in the input stream are 'escaped' by the replacing the | |
546 guard byte followed by zero. Guard bytes in a parameterised run (ie X | |
547 copies of Y where Y contains the guard) do not need to be 'escaped' | |
548 | |
549 Input data: | |
550 10 12 12 13 12 13 12 13 12 13 14 | |
551 | |
552 Output data: | |
553 3 (xrle format) | |
554 2 (size of blocks to compare) | |
555 12 (guard, 12 is a bad choice but illustrative) | |
556 10 12 0 12 4 12 13 14 (rle data) | |
557 | |
558 | |
559 XRLE2 (#4) - word aligned multi-byte run-length encoding | |
560 ---------- | |
561 Version 1.3 onwards | |
562 | |
563 Byte number 0 1 RSZ multiple of RSZ | |
564 +--+-----+---------+-- - - - - - - - - - ---+ | |
565 Hex values | 4| RSZ | padding | run length encoded data| | |
566 +--+-----+---------+-- - - - - - - - - - ---+ | |
567 | |
568 This achieves the same goal as XRLE, but is designed to maintain data | |
569 aligned to specific 'record size' boundaries. This sometimes has | |
570 benefits over XRLE in that subsequent a interlaced deflate entropy | |
571 encoding may work better on record-aligned data streams. | |
572 | |
573 The first byte holds the format (#4) while the record size (RSZ) is | |
574 held in the second byte. In order to ensure the entire block of data | |
575 is aligned on 'RSZ' bounaries RSZ-2 padding bytes are written out | |
576 before the data itself starts. The contents of these bytes can be | |
577 anything. | |
578 | |
579 Unlike XRLE it also does not use an explicit guard byte. If we term a | |
580 'word' to be a block of data of size RSZ, then whenever we read a word | |
581 which is identical to the last word written then we write out that | |
582 word (so we have two consecutive words in the output data) followed by | |
583 a counter of how many additional copies of that word are found, up to | |
584 255. This counter consists of 1 byte indicating the number of | |
585 additional copies of the word followed by RSZ-1 padding bytes to | |
586 maintain word alignment. While the contents of these padding bytes may | |
587 be anything, it is suggested that they adhere to same value | |
588 distribution as observed elsewhere in the data block in order to keep | |
589 the data entropy low. (For example repeating the previous bytes from | |
590 'word' will do.) | |
591 | |
592 Example: | |
593 | |
594 Input data: taken in pairs: | |
595 1 0 2 2 2 2 3 1 3 1 3 1 2 4 2 4 2 4 2 3 | |
596 | |
597 Output data: | |
598 4 2 (xrle2 format, rec size 2) | |
599 1 0 ("1 0" from input) | |
600 2 2 2 2 0 2 ("2 2" x 2) | |
601 3 1 3 1 1 1 ("3 1" x 3) | |
602 2 4 2 4 1 4 ("2 4" x 3) | |
603 2 3 ("2 3") | |
604 | |
605 | |
606 DELTA1 (#64) - 8-bit delta | |
607 ------------ | |
608 | |
609 Byte number 0 1 2 N | |
610 +--+-------------+-- - --+ | |
611 Hex values |40| Delta level | data | | |
612 +--+-------------+-- - --+ | |
613 | |
614 This technique replaces successive bytes with their differences. The level | |
615 indicates how many rounds of differencing to apply, which should be between 1 | |
616 and 3. For determining the first difference we compare against zero. All | |
617 differences are internally performed using unsigned values with automatic an | |
618 wrap-around (taking the bottom 8-bits). Hence 2-1 is 1 and 1-2 is 255. | |
619 | |
620 For example, with level set to 1: | |
621 | |
622 Input data: | |
623 10 20 10 200 190 5 | |
624 | |
625 Output data: | |
626 1 (delta1 format) | |
627 1 (level) | |
628 10 10 246 190 246 71 (delta data) | |
629 | |
630 For level set to 2: | |
631 | |
632 Input data: | |
633 10 20 10 200 190 5 | |
634 | |
635 Output data: | |
636 1 (delta1 format) | |
637 2 (level) | |
638 10 0 236 200 56 81 (delta data) | |
639 | |
640 | |
641 DELTA2 (#65) - 16-bit delta | |
642 ------------ | |
643 | |
644 Byte number 0 1 2 N | |
645 +--+-------------+-- - --+ | |
646 Hex values |41| Delta level | data | | |
647 +--+-------------+-- - --+ | |
648 | |
649 This format is as data format 64 except that the input data is read in 2-byte | |
650 values, so we take the difference between successive 16-bit numbers. For | |
651 example "0x10 0x20 0x30 0x10" (4 8-bit numbers; 2 16-bit numbers) yields "0x10 | |
652 0x20 0x1f 0xf0". All 16-bit input data is assumed to be aligned to the start | |
653 of the buffer and is assumed to be in big-endian format. | |
654 | |
655 | |
656 DELTA2 (#66) - 32-bit delta | |
657 ------------ | |
658 | |
659 Byte number 0 1 2 3 4 N | |
660 +--+-------------+--+--+-- - --+ | |
661 Hex values |42| Delta level | 0| 0| data | | |
662 +--+-------------+--+--+-- - --+ | |
663 | |
664 | |
665 This format is as data formats 64 and 65 except that the input data is read in | |
666 4-byte values, so we take the difference between successive 32-bit numbers. | |
667 | |
668 Two padding bytes (2 and 3) should always be set to zero. Their purpose is to | |
669 make sure that the compressed block is still aligned on a 4-byte boundary | |
670 (hence making it easy to pass straight into the 32to8 filter). | |
671 | |
672 | |
673 Data format 67-69/0x43-0x45 - reserved | |
674 --------------------------- | |
675 | |
676 At present these are reserved for dynamic differencing where the 'level' field | |
677 varies - applying the appropriate level for each section of data. Experimental | |
678 at present... | |
679 | |
680 | |
681 16TO8 (#70) - 16 to 8 bit conversion | |
682 ----------- | |
683 | |
684 Byte number 0 | |
685 +--+-- - --+ | |
686 Hex values |46| data | | |
687 +--+-- - --+ | |
688 | |
689 This method assumes that the input data is a series of big endian 2-byte | |
690 signed integer values. If the value is in the range of -127 to +127 inclusive | |
691 then it is written as a single signed byte in the output stream, otherwise we | |
692 write out -128 followed by the 2-byte value (in big endian format). This | |
693 method works well following one of the delta techniques as most of the 16-bit | |
694 values are typically then small enough to fit in one byte. | |
695 | |
696 Example input data: | |
697 0 10 0 5 -1 -5 0 200 -4 -32 (bytes) | |
698 (As 16-bit big-endian values: 10 5 -5 200 -800) | |
699 | |
700 Output data: | |
701 70 (16-to-8 format) | |
702 10 5 -5 -128 0 200 -128 -4 -32 | |
703 | |
704 | |
705 32TO8 (#71) - 32 to 8 bit conversion | |
706 ----------- | |
707 | |
708 Byte number 0 | |
709 +--+-- - --+ | |
710 Hex values |47| data | | |
711 +--+-- - --+ | |
712 | |
713 This format is similar to format 16TO8, but we are reducing 32-bit numbers (big | |
714 endian) to 8-bit numbers. | |
715 | |
716 | |
717 FOLLOW1 (#72) - "follow" predictor | |
718 ------------- | |
719 | |
720 Byte number 0 1 FF 100 101 N | |
721 +--+-- - - - --+-- - --+ | |
722 Hex values |48| follow bytes | data | | |
723 +--+-- - - - --+-- - --+ | |
724 | |
725 For each symbol we compute the most frequent symbol following it. This is | |
726 stored in the "follow bytes" block (256 bytes). The first character in the | |
727 data block is stored as-is. Then for each subsequent character we store the | |
728 difference between the predicted character value (obtained by using | |
729 follow[previous_character]) and the real value. This is a very crude, but | |
730 fast, method of removing some residual non-randomness in the input data and so | |
731 will reduce the data entropy. It is best to use this prior to entropy encoding | |
732 (such as huffman encoding). | |
733 | |
734 | |
735 CHEB445 (#73) - floating point 16-bit chebyshev polynomial predictor | |
736 ------------- | |
737 Version 1.1 only. | |
738 Deprecated: replaced by format 74 in Version 1.2. | |
739 | |
740 WARNING: This method was experimental and have been replaced with an | |
741 integer equivalent. The floating point method may give system specific | |
742 results. | |
743 | |
744 Byte number 0 1 2 N | |
745 +--+--+-- - --+ | |
746 Hex values |49| 0| data | | |
747 +--+--+-- - --+ | |
748 | |
749 This method takes big-endian 16-bit data and attempts to curve-fit it using | |
750 chebyshev polynomials. The exact method employed uses the 4 preceeding values | |
751 to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients | |
752 only 4 are used to predict the next value. Then we store the difference | |
753 between the predicted value and the real value. This procedure is repeated | |
754 throughout each 16-bit value in the data. The first four 16-bit values are | |
755 stored with a simple 1-level 16-bit delta function. Reversing the predictor | |
756 follows the same procedure, except now adding the differences between stored | |
757 value and predicted value to get the real value. | |
758 | |
759 | |
760 ICHEB (#74) - integer based 16-bit chebyshev polynomial predictor | |
761 ----------- | |
762 Version 1.2 onwards | |
763 This replaces the floating point CHEB445 format in ZTR v1.1. | |
764 | |
765 | |
766 Byte number 0 1 2 N | |
767 +--+--+-- - --+ | |
768 Hex values |4A| 0| data | | |
769 +--+--+-- - --+ | |
770 | |
771 This method takes big-endian 16-bit data and attempts to curve-fit it using | |
772 chebyshev polynomials. The exact method employed uses the 4 preceeding values | |
773 to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients | |
774 only 4 are used to predict the next value. Then we store the difference | |
775 between the predicted value and the real value. This procedure is repeated | |
776 throughout each 16-bit value in the data. The first four 16-bit values are | |
777 stored with a simple 1-level 16-bit delta function. Reversing the predictor | |
778 follows the same procedure, except now adding the differences between stored | |
779 value and predicted value to get the real value. | |
780 | |
781 STHUFF (#77) - Interlaced Deflate | |
782 ------------ | |
783 Version 1.3 onwards | |
784 | |
785 Byte number 0 1 2 N | |
786 +--+--+-- - - - - - --+-- - - --+ | |
787 Hex values |4D| C| huffman codes | data | | |
788 +--+--+-- - - - - - --+-- - - --+ | |
789 | |
790 This compresses data using huffman encoding using the Deflate | |
791 algorithm for storing the codes and data. It is analogous to using | |
792 zlib with the Z_HUFFMAN_ONLY strategy and a negative window | |
793 size. However it has a few tweaks for optimal compression of very | |
794 small data sets. See RFC 1951 for details of Deflate. If the following | |
795 text is in decrepancy with RFC 1951 then the RFC takes priority. The | |
796 following is included as additional explanatory material only. | |
797 | |
798 Huffman compression works by replacing each character (or 'symbol') | |
799 with a string of bits. Common symbols have are encoded using few bits | |
800 and rare symbols need a longer string of bits. The net effect is that | |
801 the overall number of bits needed to store a message is reduced. | |
802 | |
803 To uncompress a compressed data stream it is necessary to know which | |
804 symbols are present and what their bit-strings are. For brevity this | |
805 is achieved by storing only the lengths of the bit-string for each | |
806 symbol and generating bit-strings from the lengths. As long as the | |
807 same canonical algorithm is used in both the encoder and decoder then | |
808 knowing the lengths alone is sufficient. Knowledge of this algorithm | |
809 is required for uncompressing the data, so it is defined as follows: | |
810 | |
811 1. Sort symbols by the length of their bit-strings, smallest first. | |
812 | |
813 The collating order for symbols sharing the same length is defined | |
814 as ASCII values 0 to 255 inclusive followed by the EOF symbol. | |
815 | |
816 2. X = 0 | |
817 | |
818 3. For all bit lengths 'L' from 1 to 24 inclusive: | |
819 | |
820 For all Symbols of bit length 'L', sorted as above: | |
821 Code(Symbol) = least significant 'L' bits of X | |
822 X = X + 1 | |
823 End loop | |
824 | |
825 X = X * 2 | |
826 | |
827 End loop | |
828 | |
829 This is the same algorithm utilised in the Deflate algorithm (RFC 1951). | |
830 | |
831 | |
832 For example compressing "abracadabra" gives: /\ | |
833 0 1 | |
834 Symbol bit-length Code(X) / \ | |
835 ------------------------------- a /\ | |
836 a 1 0 0 / \ | |
837 b 3 4 100 0 1 | |
838 c 3 5 101 / \ | |
839 r 3 6 110 / \ | |
840 d 4 14 1110 /\ /\ | |
841 EOF 4 15 1111 0 1 0 1 | |
842 / \ / \ | |
843 which in turn leads to 28 bits b c r /\ | |
844 of output: 0 1 | |
845 / \ | |
846 0100110010101110010011001111 d EOF | |
847 (ab r ac ad ab r aEOF) | |
848 | |
849 | |
850 In the data format defined above, 'C' is a code-set number. If it is | |
851 zero the the huffman codes to uncompress 'data' are stored in the | |
852 following bytes using the same format describe in the DFLH chunk type | |
853 below, otherwise no huffman codes are stored and a predefined | |
854 set of huffman codes are used being either defined in a preceeding | |
855 DFLH chunk (for 128 <= 'C' <= 255) or statically defined in this | |
856 document (for 1 <= 'C' <= 127). Immediately following this is the | |
857 compressed bit-stream itself. | |
858 | |
859 The statically defined huffman code-sets are as follows. The symbols | |
860 are listed below as their printable ASCII character or hash followed | |
861 by a number, so A and #65 are the same symbol. We use the algorithm | |
862 described above to turn these bit-lengths into actual huffman codes. | |
863 | |
864 C=1: CODE_DNA | |
865 | |
866 Length Symbols | |
867 ---------------- | |
868 2 A C T | |
869 3 G | |
870 4 N | |
871 5 #0 | |
872 6 EOF | |
873 13 #1 to #6 inclusive | |
874 14 #7 to #255 except where already listed above | |
875 | |
876 C=2: CODE_DNA_AMBIG (DNA with IUPAC ambiguity codes) | |
877 | |
878 Length Symbols | |
879 ---------------- | |
880 2 A C T | |
881 3 G | |
882 4 N | |
883 7 #0 #45 | |
884 8 B D H K M R S V W Y | |
885 11 EOF | |
886 14 #226 | |
887 15 #1 to #255 except where already listed above | |
888 | |
889 C=3: CODE_ENGLISH (English text) | |
890 | |
891 Length Symbols | |
892 ---------------- | |
893 3 #32 e | |
894 4 a i n o s t | |
895 5 d h l r u | |
896 6 #10 #13 #44 c f g m p w y | |
897 7 #46 b v | |
898 8 #34 I k | |
899 9 #45 A N T | |
900 10 #39 #59 #63 B C E H M S W x | |
901 11 #33 0 1 F G | |
902 15 #0 to #255 except where already listed above | |
903 | |
904 | |
905 It is recommended that this compression format is used only for small | |
906 data sizes and ZLIB is used for larger (a few K and above) data. | |
907 | |
908 | |
909 QSHIFT (#79) - 4-byte quality reorder | |
910 ------------ | |
911 Version 1.3 onwards | |
912 | |
913 This reorders the quality signal to be 4-tuples of the quality for the | |
914 called base followed by the quality of the other 3 base types in the | |
915 order they appear in a,c,g,t (minus the called base). | |
916 | |
917 The purpose is to allow a 4-byte interlaced deflate algorithm to | |
918 operate efficiently. | |
919 | |
920 | |
921 TSHIFT (#70) - 8-byte trace reorder | |
922 ------------ | |
923 Version 1.3 onwards | |
924 | |
925 This reorders the trace signal to be 4-tuples of the 16-bit trace | |
926 signals for the called base followed by the signal from the other 3 | |
927 base types in the order they appear in a,c,g,t (minus the called | |
928 base). | |
929 | |
930 The purpose is to allow a 8-byte interlaced deflate algorithm to | |
931 operate efficiently. | |
932 | |
933 FIXME: QSHIFT and TSHIFT could be general purpose byte rearrangements | |
934 without any knowledge of the data type they're holding. They need the | |
935 input data to be consistently ordered and not the large differences we | |
936 see between quality and trace right now. | |
937 | |
938 | |
939 Version 1.3 onwards | |
940 Chunk types | |
941 =========== | |
942 | |
943 As described above, each chunk has a type. The format of the data contained in | |
944 the chunk data field (when written in format 0) is described below. | |
945 Note that no chunks are mandatory. It is valid to have no chunks at all. | |
946 However some chunk types may depend on the existance of others. This will be | |
947 indicated below, where applicable. | |
948 | |
949 Each chunk type is stored as a 4-byte value. Bit 5 of the first byte is used | |
950 to indicate whether the chunk type is part of the public ZTR spec (bit 5 of | |
951 first byte == 0) or is a private/custom type (bit 5 of first byte == 1). Bit | |
952 5 of the remaining 3 bytes is reserved - they must always be set to zero. | |
953 | |
954 Practically speaking this means that public chunk types consist entirely of | |
955 upper case letters (eg TEXT) whereas private chunk types start with a | |
956 lowercase letter (eg tEXT). Note that in this example TEXT and tEXT are | |
957 completely independent types and they may have no more relationship with each | |
958 other than (for example) TEXT and BPOS types. | |
959 | |
960 It is valid to have multiples of some chunks (eg text chunks), but not for | |
961 others (such as base calls). The order of chunks does not matter unless | |
962 explicitly specified. | |
963 | |
964 A chunk may have meta-data associated with it. This is data about the data | |
965 chunk. For example the data chunk could be a series of 16-bit trace samples, | |
966 while the meta-data could be a label attached to that trace (to distinguish | |
967 trace A from traces C, G and T). Meta-data is typically very small and so it | |
968 is never need be compressed in any of the public chunk types (although | |
969 meta-data is specific to each chunk type and so it would be valid to have | |
970 private chunks with compressed meta-data if desirable). | |
971 | |
972 The first byte of each chunk data when uncompressed must be zero, indicating | |
973 raw format. If, having read the chunk data, this is not the case then the | |
974 chunk needs decompressing or reverse filtering until the first byte is | |
975 zero. There may be a few padding bytes between the format byte and the first | |
976 element of real data in the chunk. This is to make file processing simpler | |
977 when the chunk data consists of 16 or 32-bit words; the padding bytes ensure | |
978 that the data is aligned to the appropriate word size. Any padding bytes | |
979 required will be listed in the appopriate chunk definition below. | |
980 | |
981 | |
982 The following lists the chunk types available in 32-bit big-endian format. | |
983 In all cases the data is presented in the uncompressed form, starting with the | |
984 raw format byte and any appropriate padding. | |
985 | |
986 SAMP | |
987 ---- | |
988 | |
989 Or Meta-data: (version 1.2 and before) | |
990 Byte number 0 1 2 3 | |
991 +--+--+--+--+ | |
992 Hex values | data name | | |
993 +--+--+--+--+ | |
994 | |
995 Data: | |
996 Byte number 0 1 2 3 4 5 6 7 N | |
997 +--+--+--+--+--+--+--+--+- -+ | |
998 Hex values | 0| 0| data| data| data| - | | |
999 +--+--+--+--+--+--+--+--+- -+ | |
1000 | |
1001 This encodes a series of 16-bit unsigned trace samples. The first data | |
1002 byte is the format (raw); the second data byte is present for padding | |
1003 purposes only. After that comes a series of 16-bit big-endian | |
1004 values. Although stored as unsigned, a baseline value can be | |
1005 specified which is should then be subtracted from all values to | |
1006 generated signed data if required. By default the baseline is zero. | |
1007 | |
1008 Valid identifiers for the meta-data (version 1.3 onwards) are: | |
1009 | |
1010 Ident Value(s) | |
1011 --------------------------------------------------------------------- | |
1012 TYPE "A", "C", "G", "T", "PYNO" or "PYRW" | |
1013 OFFS 16-bit signed integer representing the 'zero' position, | |
1014 in ASCII. | |
1015 | |
1016 [ FIXME: signed or unsigned? Signed means we couldn't store data in | |
1017 the range from -48K to +16K. Unsigned means we couldn't store data in | |
1018 the range 10K to 70K. What's most useful? Or should OFFS be 32-bit | |
1019 instead? ] | |
1020 | |
1021 Versions prior to 1.3 specified meta-data consisted of a single 4-byte | |
1022 block containing a 4-byte name associated with the trace. If a | |
1023 type-name is shorter than 4 bytes then it should be right padded with | |
1024 nul characters to 4 bytes. For sequencing traces the four lanes | |
1025 representig A, C, G and T signals have names "A\0\0\0", "C\0\0\0", | |
1026 "G\0\0\0" and "T\0\0\0". PYNO and PYRW refer to normalised and raw | |
1027 pyrogram data (eg from 454 instruments). At present other names are | |
1028 not reserved, but it is recommended that (for consistency with | |
1029 elsewhere) you label private trace arrays with names starting in a | |
1030 lowercase letter (specifically, bit 5 is 1). | |
1031 | |
1032 For the purposes of backwards compatibility, readers should check the | |
1033 version number in the ZTR header to determine whether the old or new | |
1034 style meta-data formatting is in use. | |
1035 | |
1036 For sequencing traces it is expected that there will be four SAMP chunks, | |
1037 although the order is not specified. | |
1038 | |
1039 | |
1040 SMP4 | |
1041 ---- | |
1042 | |
1043 Meta-data: optional - see below | |
1044 | |
1045 Data: | |
1046 Byte number 0 1 2 3 4 5 6 7 N | |
1047 +--+--+--+--+--+--+--+--+- -+ | |
1048 Hex values | 0| 0| data| data| data| - | | |
1049 +--+--+--+--+--+--+--+--+- -+ | |
1050 | |
1051 | |
1052 As per SAMP, this encodes a series of unsigned 16-bit trace values, to | |
1053 be base-line corrected by the OFFS meta-data value as appropriate. | |
1054 | |
1055 The first byte is 0 (raw format). Next is a single padding byte (also 0). | |
1056 Then follows a series of 2-byte big-endian trace samples for the "A" trace, | |
1057 followed by a series of 2-byte big-endian traces samples for the "C" trace, | |
1058 also followed by the "G" and "T" traces (in that order). The assumption is | |
1059 made that there is the same number of data points for all traces and hence the | |
1060 length of each trace is simply the number of data elements divided by four. | |
1061 | |
1062 Experimentation has shown that this gives around 3% saving over 4 | |
1063 separate SAMP chunks, but it lacks in flexibility. | |
1064 | |
1065 Valid identifiers for the meta-data are: | |
1066 | |
1067 Ident Value(s) | |
1068 --------------------------------------------------------------------- | |
1069 OFFS 16-bit signed integer representing the 'zero' position | |
1070 TYPE The type of data-set encoded. Values can be: | |
1071 "PROC" - processed data for viewing, also the default | |
1072 when no type field is found. | |
1073 "SLXI" - Illumina GA raw intensities (.int.txt files) | |
1074 "SLXN" - Illumina GA noise intensities (.nse.txt files) | |
1075 | |
1076 | |
1077 BASE | |
1078 ---- | |
1079 | |
1080 Meta-data: optional - see below | |
1081 | |
1082 Data: | |
1083 Byte number 0 1 2 3 N | |
1084 +--+--+--+-- - --+ | |
1085 Hex values | 0| base calls | | |
1086 +--+--+--+-- - --+ | |
1087 | |
1088 The first byte is 0 (raw format). This is followed by the base calls in ASCII | |
1089 format (one base per byte). By default it is assumed that all base | |
1090 calls are stored using the IUPAC characters[1]. | |
1091 | |
1092 Valid identifiers for the meta-data are: | |
1093 | |
1094 Ident Meaning Value(s) | |
1095 --------------------------------------------------------------------- | |
1096 CSET Character-set 'I' (ASCII #73) => IUPAC ("ACGTUMRWSYKVHDBN") | |
1097 '0' (ASCII #49) => ABI SOLiD ("0123N") | |
1098 | |
1099 BPOS | |
1100 ---- | |
1101 | |
1102 Meta-data: none present | |
1103 | |
1104 Data: | |
1105 Byte number 0 1 2 3 4 5 6 7 | |
1106 +--+--+--+--+--+--+--+--+- -+--+--+--+--+ | |
1107 Hex values | 0| padding| data | - | data | | |
1108 +--+--+--+--+--+--+--+--+- -+--+--+--+--+ | |
1109 | |
1110 This chunk contains the mapping of base call (BASE) numbers to sample (SAMP) | |
1111 numbers; it defines the position of each base call in the trace data. The | |
1112 position here is defined as the numbering of the 16-bit positions held in the | |
1113 SAMP array, counting zero as the first value. | |
1114 | |
1115 The format is 0 (raw format) followed by three padding bytes (all 0). Next | |
1116 follows a series of 4-byte big-endian numbers specifying the position of each | |
1117 base call as an index into the sample arrays (when considered as a 2-byte | |
1118 array with the format header stripped off). | |
1119 | |
1120 Excluding the format and padding bytes, the number of 4-byte elements should | |
1121 be identical to the number of base calls. All sample numbers are counted from | |
1122 zero. No sample number in BPOS should be beyond the end of the SAMP arrays | |
1123 (although it should not be assumed that the SAMP chunks will be before this | |
1124 chunk). Note that the BPOS elements may not be totally in sorted order as | |
1125 the base calls may be shifted relative to one another due to compressions. | |
1126 | |
1127 | |
1128 CNF1 | |
1129 ---- | |
1130 | |
1131 Meta-data: optional - see below | |
1132 | |
1133 Data: | |
1134 Byte number 0 1 N | |
1135 +--+--+-- - --+--+ | |
1136 Hex values | 0| call confidence | | |
1137 +--+--+-- - --+--+ | |
1138 | |
1139 (N == number of bases in BASE chunk) | |
1140 | |
1141 Valid identifiers for the meta-data are: | |
1142 | |
1143 Ident Value(s) Meaning | |
1144 --------------------------------------------------------------------- | |
1145 SCALE PH Phred-scaled confidence values. (Default). i.e. for | |
1146 a call with probability p: -10*log10(1-p) | |
1147 LO Log-odds scaled values. ie: 10*log10(p/(1-p)) | |
1148 | |
1149 | |
1150 The first byte of this chunk is 0 (raw format). This is then followed by a | |
1151 series signed 8-bit confidence values for the called bases. | |
1152 | |
1153 Either phred or log-odds (as used by the Illumina GA) scale ranges are | |
1154 appropriate. | |
1155 | |
1156 | |
1157 CNF4 | |
1158 ---- | |
1159 | |
1160 Meta-data: optional - see below | |
1161 | |
1162 Data: | |
1163 Byte number 0 1 N 4N | |
1164 +--+--+-- - --+--+----- - -----+ | |
1165 Hex values | 0| call confidence | A/C/G/T conf | | |
1166 +--+--+-- - --+--+----- - -----+ | |
1167 | |
1168 (N == number of bases in BASE chunk) | |
1169 | |
1170 Valid identifiers for the meta-data are: | |
1171 | |
1172 Ident Value(s) Meaning | |
1173 --------------------------------------------------------------------- | |
1174 SCALE PH Phred-scaled confidence values. i.e. for a call | |
1175 with probability p: -10*log10(1-p) | |
1176 (NB: default, but often inappropriate.) | |
1177 LO Log-odds scaled values. ie: 10*log10(p/(1-p)) | |
1178 | |
1179 | |
1180 The first byte of this chunk is 0 (raw format). This is then followed by a | |
1181 series signed 8-bit confidence values for the called base. Next comes | |
1182 all the remaining confidence values for A, C, G and T excluding those | |
1183 that have already been written (ie the called base). So for a sequence | |
1184 AGT we would store confidences A1 G2 T3 C1 G1 T1 A2 C2 T2 A3 C3 G3. | |
1185 | |
1186 The purpose of this is to group the (likely) highest confidence value (those | |
1187 for the called base) at the start of the chunk followed by the remaining | |
1188 values. Hence if phred confidence values are written in a CNF4 chunk the first | |
1189 quarter of chunk will consist of phred confidence values and the last three | |
1190 quarters will (assuming no ambiguous base calls) consist entirely of zeros. | |
1191 | |
1192 For the purposes of storage the confidence value for a base call that is not | |
1193 A, C, G or T (in any case) is stored as if the base call was T. | |
1194 | |
1195 If only one confidence value exists per base then either the phred or | |
1196 log-odds scales work well. The first N bytes will be the called bases | |
1197 and the remaining 3*N will be zero (optimal for run-length-encoding), | |
1198 but consider using the CNF1 chunk type instead in this situation. | |
1199 | |
1200 If all 4 base types have their own confidence value then the log-odds | |
1201 scale will work well. In this case the phred scale is an inappropriate | |
1202 choice as it cannot encode both very likely and very unlikely events. | |
1203 | |
1204 Note: if this chunk exists it must exist after a BASE chunk. | |
1205 | |
1206 TEXT | |
1207 ---- | |
1208 | |
1209 Meta-data: none present | |
1210 | |
1211 Data: 0 | |
1212 +--+- - -+--+- - -+--+- -+- - -+--+- - -+--+-----+ | |
1213 Hex values | 0| ident | 0| value | 0| - | ident | 0| value | 0| (0) | | |
1214 +--+- - -+--+- - -+--+- -+- - -+--+- - -+--+-----+ | |
1215 | |
1216 This contains a series of "identifier\0value\0" pairs. | |
1217 | |
1218 The identifiers and values may be any length and may contain any data | |
1219 except the nul character. The nul character marks the end of the | |
1220 identifier or the end of the value. Multiple identifier-value pairs | |
1221 are allowable. Prior to version 1.3 a double nul character marked the | |
1222 end of the list (labeled "(0)" above), but from version 1.3 the end | |
1223 of the list may also be marked by the end of chunk. | |
1224 | |
1225 Identifiers starting with bit 5 clear (uppercase) are part of the public ZTR | |
1226 spec. Any public identifier not listed as part of this spec should be | |
1227 considered as reserved. Identifiers that have bit 6 set (lowercase) are for | |
1228 private use and no restriction is placed on these. | |
1229 | |
1230 Multiple TEXT chunks may exist within the ZTR file. If so they are | |
1231 considered to be concatenated together. | |
1232 | |
1233 See below for the text identifier list. | |
1234 | |
1235 CLIP | |
1236 ---- | |
1237 | |
1238 Meta-data: none present | |
1239 | |
1240 Data: | |
1241 Byte number 0 1 2 3 4 5 6 7 8 | |
1242 +--+--+--+--+--+--+--+--+--+ | |
1243 Hex values | 0| left clip | right clip| | |
1244 +--+--+--+--+--+--+--+--+--+ | |
1245 | |
1246 This contains suggested quality clip points. These are stored as zero (raw | |
1247 data) followed by a 4-byte big endian value for the left clip point and a | |
1248 4-byte big endian value for the right clip point. Clip points are defined in | |
1249 units of base calls, starting from 0. (Q: is that correct!?) | |
1250 | |
1251 | |
1252 | |
1253 CR32 | |
1254 ---- | |
1255 | |
1256 Meta-data: none present | |
1257 | |
1258 Data: | |
1259 Byte number 0 1 2 3 4 | |
1260 +--+--+--+--+--+ | |
1261 Hex values | 0| CRC-32 | | |
1262 +--+--+--+--+--+ | |
1263 | |
1264 This chunk is always just 4 bytes of data containing a CRC-32 checksum, | |
1265 computed according to the widely used ANSI X3.66 standard. If present, the | |
1266 checksum will be a check of all of the data since the last CR32 chunk. | |
1267 This will include checking the header if this is the first CR32 chunk, and | |
1268 including the previous CRC32 chunk if it is not. Obviously the checksum will | |
1269 not include checks on this CR32 chunk. | |
1270 | |
1271 | |
1272 COMM | |
1273 ---- | |
1274 | |
1275 Meta-data: none present | |
1276 | |
1277 Data: | |
1278 Byte number 0 1 N | |
1279 +--+-- - --+ | |
1280 Hex values | 0| free text | | |
1281 +--+-- - --+ | |
1282 | |
1283 This allows arbitrary textual data to be added. It does not require a | |
1284 identifier-value pairing or any nul termination. | |
1285 | |
1286 | |
1287 DFLH | |
1288 ---- | |
1289 | |
1290 Meta-data: none present | |
1291 | |
1292 Data: | |
1293 Byte number 0 1 N | |
1294 +--+--+-- - - - - - - - - - - --+ | |
1295 Hex values | 0| C| Deflate format data ... | | |
1296 +--+--+-- - - - - - - - - - - --+ | |
1297 | |
1298 'C' is the code-set number referred to within that compression method. | |
1299 It should be 128 onwards and is used to distinguish between multiple | |
1300 huffman tables. It is used in conjunction with the data compression | |
1301 format 77 ("Deflate"). | |
1302 | |
1303 Following this is data in the Deflate format (RFC 1951). This should | |
1304 consist of the header for a single block using dynamic huffman with | |
1305 the BFINAL (last block) flag set. | |
1306 | |
1307 In Deflate streams the end of the huffman codes and the start of | |
1308 the compressed data stream itself may occur part way through a | |
1309 byte. Therefore the last byte of the this block is bitwise ORed | |
1310 with the first byte of the data stream compressed referring back to | |
1311 this code-set number. Therefore all unused bits in the last byte of | |
1312 this block should be set to zero. Likewise if the data bit-stream in | |
1313 this block ends on an exact byte boundary then an additional blank | |
1314 byte must be added to ensure the ORing method above still works. | |
1315 | |
1316 | |
1317 DFLC | |
1318 ---- | |
1319 | |
1320 Meta-data: none present | |
1321 | |
1322 Data: | |
1323 Byte number 0 | |
1324 +--+---+- - - - ---+--+-- - - - - - - - - - - - --+ | |
1325 Hex values | 0| C |code-order |FF| Deflate dynamic codes ... | | |
1326 +--+---+- - - - ---+--+-- - - - - - - - - - - - --+ | |
1327 | |
1328 Multi-context Deflate compression codes defined for use by data format | |
1329 78 (HUFF_MULTI). | |
1330 | |
1331 This is like the DFLH format, except it encodes multiple huffman trees | |
1332 instead of a single tree along with the order in which the multiple | |
1333 trees should be used (the "code-order"). | |
1334 | |
1335 'C' is the code-set number referred to within that compression method. | |
1336 It should be 128 onwards and is used to distinguish between multiple | |
1337 huffman tables. | |
1338 | |
1339 The code-order is a run-length encoded series of 8-bit numbers | |
1340 indicating which huffman code set should be used for which byte. For | |
1341 each byte in the input stream the HUFF_MULTI method selects the | |
1342 appropriate huffman code by using indexing code-order with the input | |
1343 data position modulo the number of values in code-order. | |
1344 | |
1345 Following this is data in the Deflate format (RFC 1951). This should | |
1346 consist of the header component for a single block using dynamic | |
1347 huffman with the BFINAL (last block) flag set, up to and including | |
1348 the HDIST+1 code lengths for the distance alphabet. This will then be | |
1349 immediately followed by the next set of huffman codes, and so on until | |
1350 all index values containing within the code-order have been accounted | |
1351 for. | |
1352 | |
1353 In Deflate streams the end of the huffman codes and the start of | |
1354 the compressed data stream itself may occur part way through a | |
1355 byte. Therefore the last byte of the this block is bitwise ORed | |
1356 with the first byte of the data stream compressed referring back to | |
1357 this code-set number. Therefore all unused bits in the last byte of | |
1358 this block should be set to zero. Likewise if the data bit-stream in | |
1359 this block ends on an exact byte boundary then an additional blank | |
1360 byte must be added to ensure the ORing method above still works. | |
1361 | |
1362 | |
1363 For example, compression of 16-bit data is sometimes best achieved by | |
1364 producing one set of huffman codes for the top 8 bits and another set | |
1365 for the bottom 8 bits, rather than mixing these together by treating | |
1366 the 16-bit data as a series of 8-bit quantities. In this case our | |
1367 code-order would consist of just two entries; (0, 1). | |
1368 | |
1369 Alternatively we may have 4 1-byte confidence values stored per base | |
1370 in the order of the confidence of the base-called base type first | |
1371 followed by the 3 remaining confidence values. We observe that | |
1372 compressing byte 0, 4, 8, 12, ... as one set and bytes 1,2,3, 5,6,7, | |
1373 ... as another set yields higher compression ratios. In this case the | |
1374 code-order would consist of 4 entries; (0, 1, 1, 1). | |
1375 | |
1376 | |
1377 REGN | |
1378 ---- | |
1379 | |
1380 Meta-data: optional - see below | |
1381 | |
1382 Data: | |
1383 Byte number 0 1 2 3 4 5 6 7 8 | |
1384 +--+---+---+---+---+---+---+---+---+ | |
1385 Hex values | 0| 1st boundary | 2nd boundary | ... | |
1386 +--+---+---+---+---+---+---+---+---+ | |
1387 | |
1388 This chunk is used to break a trace down into a series of segments. We | |
1389 store the boundary between segments, so the list above will contain | |
1390 one less boundary than there are segments with the first segment | |
1391 implicitly starting from the first base and the last segment implictly | |
1392 extending to the last base. | |
1393 | |
1394 Each 4-byte unsigned value indicates a position within the sequence or | |
1395 trace counting from 0 as the first element and marking the first base | |
1396 of the next region. For example three regions of DNA may be: | |
1397 | |
1398 0 1 2 3 4 5 6 7 8 9 10 11 12 | |
1399 T A C G G A T T C G A A C | |
1400 |<-reg. 1->| |<--reg. 2--->| |<-reg. 3->| | |
1401 | |
1402 This would give the 1st boundary as 4 and the 2nd boundary as 9. | |
1403 | |
1404 The lack of a REGN chunk implies one single region extending from the | |
1405 first to last base in the sequence. | |
1406 | |
1407 Valid identifiers for the meta-data are: | |
1408 | |
1409 Ident Meaning Value(s) | |
1410 --------------------------------------------------------------------- | |
1411 COORD Coordinate system 'T' = trace coordinates | |
1412 'B' = base coordinations (default) | |
1413 | |
1414 NAME Region names A semicolon separated list of | |
1415 "name:code" pairs. Eg | |
1416 primer1:T;read1:P;primer2:T;read2:P | |
1417 | |
1418 [FIXME: NAME identifier here is the same as the REGION_LIST TEXT | |
1419 identifier. We need to decide where it belongs and pick one. If we can | |
1420 get a way to specify the default meta-data contents then logically | |
1421 speaking the best place to store this is in the meta-data along side | |
1422 the chunk data itself.] | |
1423 | |
1424 The NAME identifier is used to attach a meaning to the regions | |
1425 described in the data chunk. It consists of a semi-colon separated | |
1426 list of names or name:code pairs. The codes, if present are a single | |
1427 character from the predefined list below and are separated from the | |
1428 name by a colon. | |
1429 | |
1430 Code Meaning | |
1431 --------------------------------------- | |
1432 T Tech read (e.g. primer, linker) | |
1433 B Bio read | |
1434 I Inverted read | |
1435 D Duplicate read | |
1436 P Paired read | |
1437 | |
1438 FIXME: I don't like the above meanings. They don't, well, "mean" much | |
1439 to me! What's a tech read? | |
1440 | |
1441 | |
1442 | |
1443 Text Identifiers | |
1444 ================ | |
1445 | |
1446 These are for use in the TEXT segments. None are required, but if any of these | |
1447 identifiers are present they must confirm to the description below. Much | |
1448 (currently all) of this list has been taken from the NCBI Trace Archive [2] | |
1449 documentation. It is duplicated here as the ZTR spec is not tied to the same | |
1450 revision schedules as the NCBI trace archive (although it is intended that any | |
1451 suitable updates to the trace archive should be mirrored in this ZTR spec). | |
1452 | |
1453 The Trace Archive specifies a maximum length of values. The ZTR spec does not | |
1454 have length limitations, but for compatibility these sizes should still be | |
1455 observed. | |
1456 | |
1457 The Trace Archive also states some identifiers are mandatory; these are marked | |
1458 by asterisks below. These identifiers are not mandatory in the ZTR spec (but | |
1459 clearly they need to exist if the data is to be submitted to the NCBI). | |
1460 | |
1461 Finally, some fields are not appropriate for use in the ZTR spec, such as | |
1462 BASE_FILE (the name of a file containing the base calls). Such fields are | |
1463 included only for compatibility with the Trace Arhive. It is not expected that | |
1464 use of ZTR would allow for the base calls to be read from an external file | |
1465 instead of the ZTR BASE chunk. | |
1466 | |
1467 [ Quoted from TraceArchiveRFC v1.17 ] | |
1468 | |
1469 Identifier Size Meaning Example value(s) | |
1470 ---------- ----- ---------------------------- ----------------- | |
1471 TRACE_NAME * 250 name of the trace HBBBA1U2211 | |
1472 as used at the center | |
1473 unique within the center | |
1474 but not among centers. | |
1475 | |
1476 SUBMISSION_TYPE * - type of submission | |
1477 | |
1478 CENTER_NAME * 100 name of center BCM | |
1479 CENTER_PROJECT 200 internal project name HBBB | |
1480 used within the center | |
1481 | |
1482 TRACE_FILE * 200 file name of the trace ./traces/TRACE001.scf | |
1483 relative to the top of | |
1484 the volume. | |
1485 | |
1486 TRACE_FORMAT * 20 format of the tracefile | |
1487 | |
1488 SOURCE_TYPE * - source of the read | |
1489 | |
1490 INFO_FILE 200 file name of the info file | |
1491 INFO_FILE_FORMAT 20 | |
1492 | |
1493 BASE_FILE 200 file name of the base calls | |
1494 QUAL_FILE 200 file name of the base calls | |
1495 | |
1496 | |
1497 TRACE_DIRECTION - direction of the read | |
1498 TRACE_END - end of the template | |
1499 PRIMER 200 primer sequence | |
1500 PRIMER_CODE which primer was used | |
1501 | |
1502 STRATEGY - sequencing strategy | |
1503 TRACE_TYPE_CODE - purpose of trace | |
1504 | |
1505 PROGRAM_ID 100 creator of trace file phred-0.990722.h | |
1506 program-version | |
1507 | |
1508 TEMPLATE_ID 20 used for read pairing HBBBA2211 | |
1509 | |
1510 CHEMISTRY_CODE - code of the chemistry (see below) | |
1511 ITERATION - attempt/redo 1 | |
1512 (int 1 to 255) | |
1513 | |
1514 CLIP_QUALITY_LEFT left clip of the read in bp due to quality | |
1515 CLIP_QUALITY_RIGHT right " " " " " | |
1516 CLIP_VECTOR_LEFT left clip of the read in bp due to vector | |
1517 CLIP_VECTOR_RIGHT right " " " " " | |
1518 | |
1519 | |
1520 SVECTOR_CODE 40 sequencing vector used (in table) | |
1521 SVECTOR_ACCESSION 40 sequencing vector used (in table) | |
1522 CVECTOR_CODE 40 clone vector used (in table) | |
1523 CVECTOR_ACCESSION 40 clone vector used (in table) | |
1524 | |
1525 INSERT_SIZE - expected size of insert 2000,10000 | |
1526 in base pairs (bp) | |
1527 (int 1 to 2^32) | |
1528 | |
1529 PLATE_ID 32 plate id at the center | |
1530 WELL_ID well 1-384 | |
1531 | |
1532 | |
1533 SPECIES_CODE * - code for species | |
1534 SUBSPECIES_ID 40 name of the subspecies | |
1535 Is this the same as strain | |
1536 | |
1537 CHROMOSOME 8 name of the chromosome ChrX, Chr01, Chr09 | |
1538 | |
1539 | |
1540 LIBRARY_ID 30 the source library of the clone | |
1541 CLONE_ID 30 clone id RPCI11-1234 | |
1542 | |
1543 ACCESSION 30 NCBI accession number AC00001 | |
1544 | |
1545 PICK_GROUP_ID 30 an id to group traces picked | |
1546 at the same time. | |
1547 PREP_GROUP_ID 30 an id to group traces prepared | |
1548 at the same time | |
1549 | |
1550 | |
1551 RUN_MACHINE_ID 30 id of sequencing machine | |
1552 RUN_MACHINE_TYPE 30 type/model of machine | |
1553 RUN_LANE 30 lane or capillary of the trace | |
1554 RUN_DATE - date of run | |
1555 RUN_GROUP_ID 30 an identifier to group traces | |
1556 run on the same machine | |
1557 | |
1558 [ End of quote from TraceArchiveRFC ] | |
1559 | |
1560 More detailed information on the format of these values should be obtained | |
1561 from the Trace Archive RFC [2]. | |
1562 | |
1563 In addition to the above the following TEXT identifiers have meaning | |
1564 specific to the ZTR format: | |
1565 | |
1566 Identifier Meaning Example value(s) | |
1567 ---------- ---------------------------- ------------------------------- | |
1568 REGION_LIST A semi-colon separated list primer1:T;read1:P | |
1569 identifying regions of a | |
1570 trace. See the REGN chunk Region 1;Region 2;Region 3 | |
1571 definition for details. | |
1572 | |
1573 | |
1574 FIXME: Should this simply be the meta-data associated with the REGN | |
1575 chunk? | |
1576 | |
1577 | |
1578 | |
1579 References | |
1580 ========== | |
1581 [1] IUPAC: http://www.chem.qmw.ac.uk/iubmb/misc/naseq.html | |
1582 | |
1583 [2] http://www.ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html | |
1584 | |
1585 [3] J.Bonfield and R.Staden, "ZTR: a new format for DNA sequence trace | |
1586 data". Bioinformatics Vol. 18 no. 1 2002. | |
1587 | |
1588 | |
1589 FIXME: As an aside, not doing the final entropy encoding steps (zlib, | |
1590 deflate, etc) and just using bzip2 on an entire SRF archive yields a | |
1591 considerable saving. On tests it varied between 23% (27bp reads) and | |
1592 13% (74bp reads) smaller than the Deflate compressed | |
1593 data. Unfortunately it pretty much removes all chance of random access | |
1594 in the data unless I can get a working FM-Index implementation | |
1595 (which is very unlikely in a short time). This makes it appropriate | |
1596 for transmission perhaps, but not for indexing and querying random | |
1597 sequences. | |
1598 | |
1599 A substantial chunk (5-9%) of this saving comes from the repeated ZTR | |
1600 block types (names like "BASE", "CNF4" and common components like 0x00000000 | |
1601 for the meta-data size). The remainder probably comes from | |
1602 similarities between one ZTR file and another. |