comparison srf2fastq/io_lib-1.12.2/docs/ZTR_format @ 0:d901c9f41a6a default tip

Migrated tool version 1.0.1 from old tool shed archive to new tool shed repository
author dawe
date Tue, 07 Jun 2011 17:48:05 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:d901c9f41a6a
1 Notes: 28th May 2008
2
3 For version 2.0 consider the following:
4
5 1) Remove defunct or useless chunk types and compression formats.
6 2) Rationalise inconsitent behaviour (eg endianness on zlib chunk).
7 3) Support split header/data formats for SRF
8 4) Formalise meta-data use better.
9 5) More pie-in-the-sky ideas?
10
11 What we've described so far could easily be said to be v1.4. It's
12 backwards compatible and fairly minor is change. If we truely want to
13 go for version 2 then taking the chance to remove all those niggles
14 that we've kept purely for backwards compatibility would be good.
15
16 In more detail:
17
18 1) Removal of RLE and floating point chebyshev polynomials. Mark XRLE
19 as deprecated?
20
21 We may wish to add an extra option to XRLE2 to indicate the repeat
22 count before specifying the remaining run-length. This breaks the
23 format though. (Or add XRLE3 to allow such control?)
24
25 2) Strange things I can see are:
26
27 2.1) All chunks use big-endian data except for zlib which has a
28 little-endian length.
29
30 2.2) The order that data is stored in differs per chunk type. For
31 trace data we store all As, then all Cs, all Gs and finally
32 all Ts. For confidence values we store called first followed
33 by remaining. Both SMP4 and CNF4 essentially hold 1 piece of
34 data per base type per base position, it's just the word size
35 and packing order that differs.
36
37 This means TSHIFT and QSHIFT compression types are tied very
38 much to trace and quality value chunks, rather than being
39 generic transforms. Maybe we should always have the same
40 encoding order and some standard compression/transformations
41 to reorder as desired.
42
43 An example:
44 All data related per call is stored in the natural order
45 produced. (eg as utilised in CNF1, BPOS).
46
47 All data related per base-type per call is stored in the order
48 produced: A, C, G, T for the first base position, A, C, G, T
49 for the second position, and so on.
50
51 Then we have standard filters that can swap between
52 ACGTACGTACGT... and AAA...CCC...GGG...TTT... or to
53 <called><non-called * 3>... order (which requires a BASE chunk
54 present to encode/decode). We'd have 1, 2 and 4 byte variants
55 of such filters. They do not need to understand the nature of
56 the data they're manipulating, just the word size and a
57 predetermined order to shuffle the data around in.
58
59 For CNF4 a combination of {ACGT}* to {<called><non-called*3>}*
60 followed by {ACGT}* to A*C*G*T* ordering would end up with all
61 <called> followed by all 3 remaining non-called. Ie as it is
62 now (which we then promptly "undo" in solexa data by using
63 TSHIFT).a
64
65 3) I'm wondering if there's mileage here in having negative lengths to
66 indicate constant data + variable data further on.
67
68 Eg length -10 means the next 10 bytes are the start of the data for
69 this chunk. Some stage later we'll read a 4-byte length followed by
70 the remaining data for this chunk.
71
72 Rationale: often we end up with many identical bytes at the start
73 of a chunk. For example, we take a solexa trace (0 0 value...), run
74 it through TSHIFT (80 0 0 0 previous data => 80 0 0 0 0 0 value ...)
75 and then through STHUFF (77 80(eg) data), but data is the
76 compressed stream always starting with 80 0 0 0 0 0 so typically it's
77 always the same starting string.
78
79 Tested on an SRF file I see SMP4 always starting with the same 9
80 bytes of data, BASE starting with the same 3 bytes and CNF4 always
81 starting with the same 7 bytes. Hence we'd have lengths -9, -3 and
82 -7 in the chunk headers and move that common data to the header
83 block too. That's approx 3% of the size of our SRF file.
84
85 4) I propose *all* chunks have some standard meta-data fields
86 available for use. These can be:
87
88 4.1) GROUP - all chunks sharing the same GROUP value are considered
89 as being related to one another. This provides a mechanism for
90 multiple base-call, base position and confidence value chunks
91 while still knowing which confidence values belong to which
92 call. It also allows for multiple SAMP chunks (instead of the
93 SMP4 chunk) to be collated together if desired.
94
95 I don't expect many ZTR files to contain calls from multiple
96 base-callers, but it's maybe a nice extension and seems quite
97 a simple/clean use of meta-data.
98
99 4.2) ENCODING - the default encoding for the chunk data is as
100 described in the chunk. We may however wish to override this
101 and, for example, store SMP4 data as 32-bit floating point
102 values instead of 16-bit integers. This specifies that.
103
104 Question: do we want this available universally everywhere? If
105 not, we should at least use the same meta-data keyword for all
106 occurrences.
107
108 4.3) TRANSFORM - a simple transformation description. This is
109 essentially a mini-formula. It replaces the OFFS meta-data
110 used in SMP4 which is simply a transform of X+value.
111
112 5) There are more generic ways to save storage by removing redundancy.
113
114 Most probably they're not worth it, but I list them here for
115 discussion still.
116
117 5.1) Use 7-bit variable sized encodings for values instead of fixed
118 32-bit sizes.
119
120 Eg instead of storing 1000 as 0x3*0x100 + 0xe8 (00 00 03 e8)
121 we could store it as 0x7*0x80 + 0x68 (80|07 68). The logic
122 here being setting the top bit implies this isn't the final
123 value and more data follows. It allows for variable sized
124 fields so that small numbers take up fewer bytes. The same can
125 be applied to data in SRF structs too.
126
127 Realistically it saves 2 bytes per record in SRF and an
128 unknown amount for ZTR - estimated 8 or so (3 for cnf4/base
129 and 2 for smp4). It's only 1.5% saving though in total.
130
131 5.2) A general purpose dictionary system. Instead of attempting to
132 move headers to one area and data somewhere else, possibly
133 also taking common portions of data and putting that somewhere
134 too, we could provide a dictionary system whereby we
135 previously remove redundancy by replacing all occurrences of a
136 particular byte pattern with a new shorter code. (We'd need an
137 escape mechanism for when it occurs by chance.) The dictionary
138 can then be specified in it's own chunk which is stored in the
139 header portion.
140
141 This then works for portions of chunk header (eg if the
142 meta-data changes) rather than full headers, where the data
143 blocks always start with the same text, or where we want to
144 have sensible names in text fields but don't like them taking
145 up too much space.
146
147 It's maybe a bit messy though and complex to implement, plus
148 it's unknown how big an impact having to escape accidental
149 dictionary codes from appearing in real data. The more formal
150 way of removing redundancy is probably better.
151
152 5.3) Lossy compression. I believe there's still room for this,
153 although it needs careful thought.
154
155 The floating point format really isn't an ideal way to do it
156 though, so I'd much rather have an encoding system that uses
157 N*log(signal/M+1) plus a sign bit, stored in integers.
158
159 As we store data in integers the value of N combined with the
160 maximum value for log(signal/M+1) gives us the number of bits
161 we wish to encode to. Essentially we're storing the log value
162 to a fixed point precision.
163
164 The value of M dictates the slope of the errors we get from
165 logging. It's hard to describe, but basically as signal gets
166 larger our average error in storing the signal also gets
167 larger. That's true for floating point values too as there's
168 a fixed number of bits and they're being used to represent
169 larger and larger values, meaning the resolution drops.
170
171 I have various test code and graphs showing error profiles
172 for logs vs fixed point vs floating point. Logs or fixed
173 point are nearly always preferable to a floating point format
174 for size vs accuracy.
175
176 -----------------------------------------------------------------------------
177
178 CHANGE (since 1.2):
179 SAMP and SMP4 now has meta data fields indicating the zero base-line.
180
181 CLARIFICATION
182 The specification now explicitly states that trace samples are
183 unsigned, although the new OFFS meta-data can be used to turn these
184 into signed values.
185
186 CLARIFICATION
187 We explicitly state that multiple TEXT chunks maybe present in the ZTR
188 file and will be concatenated together. Also the trailing (nul) byte
189 is now optional.
190
191 CHANGE
192 Added CSET (character set) meta-data for BASEs so ABI SOLID encoding
193 can be used. This removes the requirement of IUPAC characters only.
194
195 CHANGE
196 Added XRLE2, QSHIFT, TSHIFT and STHUFF compression types.
197
198 INCOMPATIBLE CHANGE:
199 I propose for this version to make all meta-data adhere to a specific
200 format rather than adhoc. It'll consist of zero or more copies of
201 'identifier nul value nul'. See the format below for details.
202
203 The only use of meta-data in 1.2 was for SAMP (not SMP4) chunks to
204 indicate the channel the data came from. From now on file readers will
205 need to check the version number in the header to determine how to
206 parse the SAMP meta-data.
207
208
209 [Search for "FIXME" for my comments / questions to be answered. They
210 elaborate on the summary below and provide more context.]
211
212
213 QUESTION1:
214 Should we adapt ZTR to not be so inefficient with regards
215 to tiny chunks. Specifically a 5 byte chunk size, 4 byte meta-data
216 size (normally zero anyway) and 4 byte data length is all
217 wasteful. These combined comprise 5-10% of the total SRF size. Note
218 that changing this would break backwards compatibility.
219
220 QUESTION2:
221 Do I need a means to specify the "default meta-data". Specifically if
222 we have lots of SAMP chunks (for example) and every single one is
223 stating that the zero "offset" value is 32768 then we may want a
224 mechanism of specifying that the default OFFS value is 32768 for all
225 subsequent SAMP chunks.
226
227 One possible way to do this is to have a new chunk type which sets the
228 default. Eg for the SAMP chunk we could define a SaMP chunk to modify
229 the default for SAMP. This seems oddly named, but it's utilising the
230 bit5 of the 2nd byte which so far has been reserved as zero. (In the
231 first byte bit 5 set => private namespace and not part of the public spec.)
232
233 For now I'm just ignoring this issue though.
234
235 QUESTION3:
236 I've defined new transforms named TSHIFT and QSHIFT specifically
237 designed for adjusting the layout of CND4 and SMP4 chunk types to an
238 order more amenable for compression by interlaced deflate. They do the
239 job, but I'm wondering if it's better to simply redefine the input
240 data to be a more consistent ordering so that we can define more
241 general purpose transforms rather than one dedicated to the original
242 trace layout and one for the quality layout.
243
244 I'm ignoring this for now as it would break backwards compatibility.
245
246 QUESTION4:
247 For the OFFS meta-data in SMP4 and SAMP chunks I have a 16-bit offset
248 to specify the zero position. Ie OFFS of 10000 means a sample of 9000
249 becomes -1000 after processing.
250
251 Should it be a signed or unsigned 16-bit value. Signed means we could
252 encode values ranging from 10000 to 70000 by specify OFFS as -10000.
253
254 Should it be 32-bit instead? Should we have OFFI and OFFF for integer
255 and floating point equivalents?
256
257 QUESTION5:
258 For region encoding where should the region name belong - the
259 meta-data section or the REGION_LIST TEXT identifier? It's currently
260 in both places. My gut instinct tells me it belongs in the meta-data
261 for the REGION_LIST chunk itself.
262
263 QUESTION6:
264 Can we have clarification on what the region code types mean,
265 specifically "tech read".
266
267 QUESTION7:
268 Should we add SAMP/SMP4 meta-data indicating a down-scale factor? For
269 454 data this could be 100, so we know value 123 is really 1.23. Note
270 this is maybe better implemented below using fixed-point precision.
271
272 QUESTION8:
273 How do we deal with floating point values?
274
275 I think the chunk meta-data should detail the format of the data block
276 itself (as it is strictly speaking data about the data so it fits
277 there well). A lack of meta data should imply the usual unsigned
278 16-bit quantities.
279
280 There's two main ways to encode fractions:
281
282 Floating point where we have a mantissa and an exponent.
283 - See http://en.wikipedia.org/wiki/IEEE_floating-point_standard
284 - large dynamic range
285 - fixed number of significant bits
286 - varying "resolution". Ie can represent tiny differences
287 between two very small floating point numbers, but not
288 between two very large floating point numbers.
289
290 Fixed point where we have a fixed number of bits for the component
291 before and after the decimal point.
292 - See http://en.wikipedia.org/wiki/Q_%28number_format%29
293 - constant resolution
294 - effectively used by SFF (specified to 2 decimal places)
295 - easy to treat as integers so can be fast and dealt with by
296 small embedded CPUs without FPUs.
297
298
299 Floating point maybe appropriate as effectively it's the same as
300 logging your signals and storing those. It offers large dynamic range
301 so can cope with abnormally large values (at the expense of precision)
302 while retaining lots of variation at the low end to distinguish small
303 values. However it's CPU intensive to cope with anything other than
304 the CPU provided 32-bit and 64-bit floating point formats.
305
306 Single precision 32-bit floats in IEEE-754 have:
307 1 bit (31): Sign
308 8 bits (23-30): Exponent (bias 127, so stroring 100 => -27)
309 23 bits (0-22): Mantissa
310
311 Effectively we store any binary value as a normalised expression:
312
313 <exponent>
314 1.<mantissa> * 2
315
316
317 Eg 1732.5:
318
319 => 11011000100.1 (binary)
320 => 1.10110001001 (binary) * 2^10
321
322 Exponent+127 => 137 => 10001001 (binary)
323
324 sign exponent mantissa
325 0 10001001 10110001001000000000000
326
327 (17325 => 0x43ad => 0x0010001110101101
328
329 However we probably want 16-bit and 24-bit floating point types for
330 efficiencies sake. Do we go with some fixed predefined floating point
331 formats for 8-bit, 16-bit, 24-bit and 32-bit layouts (with 32-bit
332 being identical to IEEE754) or do we allow for specification of the
333 mantissa and exponent Eg FLOAT=23.8, FLOAT=17.6 or FLOAT=5.2 in the
334 meta-data block?
335
336 FLOAT=17.6 (24-bit) gives ranges +/- 8.6*10^9
337 FLOAT=5.2 (8-bit) gives ranges +/- 64 (I think).
338
339 Alternatively if we restrict ourselves to only using the most
340 significant 14 bits of the mantissa then storing as standard 32-bit
341 floats implies 1 in every 4 bytes is zero. This may provide for a
342 very crude, but fast way to implement reduced size floating point
343 values - ie FLOAT=15.8 (24-bit signed).
344
345 For fixed point (as in SFF values) there's already a draft standard
346 for implementation in C (ISO/IEC TR 18037:2004).
347
348 One benefit of fixed point over floating point is speed of
349 implementation. Fixed point numbers can just be dealt with as
350 integers. Eg subtracting two fixed point 16-bit values can be done in
351 integers using a-b and the result is the same as if we'd done all the
352 bit twiddling and maths directly simulating a real fixed-point unit.
353
354 My gut feeling is that we'd want to explicitly declare the number of
355 bits for integral and fractional components in the meta-data block.
356
357 Comments?
358
359 James
360
361 PS. The latest (only minor tweaks from before) ZTR draft spec
362 follows.
363
364
365
366
367 1.3 draft 3 (19 Oct 2007)
368
369 ZTR SPEC v1.3
370 =============
371
372 Header
373 ======
374
375 The header consists of an 8 byte magic number (see below), followed by
376 a 1-byte major version number and 1-byte minor version number.
377
378 Changes in minor numbers should not cause problems for parsers. It indicates
379 a change in chunk types (different contents), but the file format is the
380 same.
381
382 The major number is reserved for any incompatible file format changes (which
383 hopefully should be never).
384
385 /* The header */
386 typedef struct {
387 unsigned char magic[8]; /* 0xae5a54520d0a1a0a (b.e.) */
388 unsigned char version_major; /* 1 */
389 unsigned char version_minor; /* 3 */
390 } ztr_header_t;
391
392 /* The ZTR magic numbers */
393 #define ZTR_MAGIC "\256ZTR\r\n\032\n"
394 #define ZTR_VERSION_MAJOR 1
395 #define ZTR_VERSION_MINOR 3
396
397 So the total header will consist of:
398
399 Byte number 0 1 2 3 4 5 6 7 8 9
400 +--+--+--+--+--+--+--+--+--+--+
401 Hex values |ae 5a 54 52 0d 0a 1a 0a|01 03|
402 +--+--+--+--+--+--+--+--+--+--+
403
404 Chunk format
405 ============
406
407 The basic structure of a ZTR file is (header,chunk*) - ie header followed by
408 zero or more chunks. Each chunk consists of a type, some meta-data and some
409 data, along with the lengths of both the meta-data and data.
410
411 Byte number 0 1 2 3 4 5 6 7 8 9
412 +--+--+--+--+----+----+----+---+--+ - +--+--+--+--+--+-- - --+
413 Hex values | type |meta-data length | meta-data |data length| data .. |
414 +--+--+--+--+----+----+----+---+--+ - +--+--+--+--+--+-- - --+
415
416 FIXME: For very short reads this is a large overhead. We have 8 bytes
417 of length information (of which typically only 1-2 are non-zero) and 4
418 bytes for type (which typically only has one of 4-5 values). This
419 means about 10 bytes wasted per chunk, or maybe 5-10% of the total
420 file size. Changing this would be a radical departure from ZTR; is it
421 justified given the savings? (est. 4.8% for 74bp reads, 8.4% for 27bp
422 reads).
423 One idea if to consider a ZTR file (the non "block" components at
424 least) to be a series of huffman codes, by default all 8-bit long and
425 matching their ASCII codes. Then a dedicated chunk could be used to
426 adjust these default codes. It's therefore backwards compatible, but
427 is that also overkill? (NB, this looks like it'd save 6% on the
428 overall file size.)
429
430 Ie in C:
431
432 typedef struct {
433 uint4 type; /* chunk type (b.e.) */
434 uint4 mdlength; /* length of meta-data field (b.e.) */
435 char *mdata; /* meta data */
436 uint4 dlength; /* length of data field (b.e.) */
437 char *data; /* a format byte and the data itself */
438 } ztr_chunk_t;
439
440 All 2 and 4-byte integer values are stored in big endian format.
441
442 The meta-data is uncompressed (and so it does not start with a format
443 byte). From version 1.3 onwards meta-data is defined to be in key
444 value pairs adhering to the same structure defined in the TEXT chunk
445 ("key\0value\0"). Exceptions are made for this only for purposes of
446 backwards compatibility in the SAMP chunk type. The contents of the
447 meta-data is chunk specific, and many chunk types will have no
448 meta-data. In this case the meta-data length field will be zero and
449 this will be followed immediately by the data-length field.
450
451 Ie all meta-data adheres to the following structure:
452
453 Meta-data: (version 1.3 onwards only)
454 +- - -+--+- - -+--+- -+- - -+--+- - -+--+
455 Hex values | ident | 0| value | 0| - | ident | 0| value | 0|
456 +- - -+--+- - -+--+- -+- - -+--+- - -+--+
457
458 FIXME: Can we have specify the meta-data once per ZTR file and omit it
459 in subsequent chunks? Eg a blank chunk with meta-data only in the
460 header. Chunks in the body then specify meta-data length as 0xFFFFFFFF
461 as an indicator meaning "use the last meta-data defined for this chunk
462 type". Useful when split in two, as in SRF?
463
464 Note that this means both ident and values must not themselves contain
465 the zero byte (a nul character), hence we generally store ident-value
466 pairs in ASCII string forms.
467
468 The data length ("dlength") is the length in bytes of the entire
469 'data' block, including the format information held within it.
470
471 The first byte of the data consists of a format byte. The most basic format is
472 zero - indicating that the data is "as is"; it's the real thing. Other formats
473 exist in order to encode various filtering and compression techniques. The
474 information encoded in the next bytes will depend on the format byte.
475
476
477 RAW (#0) - no formatting
478 --------
479
480 Byte number 0 1 2 N
481 +--+--+-- - --+
482 Hex values | 0| raw data |
483 +--+--+-- - --+
484
485 Raw data has no compression or filtering. It just contains the unprocessed
486 data. It consists of a one byte header (0) indicating raw format followed by N
487 bytes of data.
488
489
490 RLE (#1) - simple run-length encoding
491 -------
492
493 Byte number 0 1 2 3 4 5 6 7 8 N
494 +--+----+----+-----+-----+-------+--+--+--+-- - --+--+--+
495 Hex values | 1| Uncompressed length | guard | run length encoded data|
496 +--+----+----+-----+-----+-------+--+--+--+-- - --+--+--+
497
498 Run length encoding replaces stretches of N identical bytes (with value V)
499 with the guard byte G followed by N and V. All other byte values are stored
500 as normal, except for occurrences of the guard byte, which is stored as G 0.
501 For example with a guard value of 8:
502
503 Input data:
504 20 9 9 9 9 9 10 9 8 7
505
506 Output data:
507 1 (rle format)
508 0 0 0 10 (original length)
509 8 (guard)
510 20 8 5 9 10 9 8 0 7 (rle data)
511
512
513 ZLIB (#2) - see RFC 1950
514 ---------
515
516 Byte number 0 1 2 3 4 5 6 7 N
517 +--+----+----+-----+-----+--+--+--+-- - --+
518 Hex values | 2| Uncompressed length | Zlib encoded data|
519 +--+----+----+-----+-----+--+--+--+-- - --+
520
521 This uses the zlib code to compress a data stream. The ZLIB data may itself be
522 encoded using a variety of methods (LZ77, Huffman), but zlib will
523 automatically determine the format itself. Often using zlib mode
524 Z_HUFFMAN_ONLY will provide best compression when combined with other
525 filtering techniques.
526
527
528 XRLE (#3) - multi-byte run-length encoding
529 ---------
530
531 Byte number 0 1 2 3 4 5 N
532 +--+------+-------+--+--+--+-- - --+--+--+
533 Hex values | 3| size | guard | run length encoded data|
534 +--+------+-------+--+--+--+-- - --+--+--+
535
536 Much standard RLE, but this mechanism has a byte to specify the length
537 of the data item we compare to check for runs. It is not restricted to
538 spotted runs aligned on 'size' byte boundaries either.
539
540 No uncompressed length is encoded here as technically this is not
541 required (although it does make decoding a bit slower). The compressed
542 length alone is sufficient to work out the uncompressed length after
543 decompressing.
544
545 Guard bytes in the input stream are 'escaped' by the replacing the
546 guard byte followed by zero. Guard bytes in a parameterised run (ie X
547 copies of Y where Y contains the guard) do not need to be 'escaped'
548
549 Input data:
550 10 12 12 13 12 13 12 13 12 13 14
551
552 Output data:
553 3 (xrle format)
554 2 (size of blocks to compare)
555 12 (guard, 12 is a bad choice but illustrative)
556 10 12 0 12 4 12 13 14 (rle data)
557
558
559 XRLE2 (#4) - word aligned multi-byte run-length encoding
560 ----------
561 Version 1.3 onwards
562
563 Byte number 0 1 RSZ multiple of RSZ
564 +--+-----+---------+-- - - - - - - - - - ---+
565 Hex values | 4| RSZ | padding | run length encoded data|
566 +--+-----+---------+-- - - - - - - - - - ---+
567
568 This achieves the same goal as XRLE, but is designed to maintain data
569 aligned to specific 'record size' boundaries. This sometimes has
570 benefits over XRLE in that subsequent a interlaced deflate entropy
571 encoding may work better on record-aligned data streams.
572
573 The first byte holds the format (#4) while the record size (RSZ) is
574 held in the second byte. In order to ensure the entire block of data
575 is aligned on 'RSZ' bounaries RSZ-2 padding bytes are written out
576 before the data itself starts. The contents of these bytes can be
577 anything.
578
579 Unlike XRLE it also does not use an explicit guard byte. If we term a
580 'word' to be a block of data of size RSZ, then whenever we read a word
581 which is identical to the last word written then we write out that
582 word (so we have two consecutive words in the output data) followed by
583 a counter of how many additional copies of that word are found, up to
584 255. This counter consists of 1 byte indicating the number of
585 additional copies of the word followed by RSZ-1 padding bytes to
586 maintain word alignment. While the contents of these padding bytes may
587 be anything, it is suggested that they adhere to same value
588 distribution as observed elsewhere in the data block in order to keep
589 the data entropy low. (For example repeating the previous bytes from
590 'word' will do.)
591
592 Example:
593
594 Input data: taken in pairs:
595 1 0 2 2 2 2 3 1 3 1 3 1 2 4 2 4 2 4 2 3
596
597 Output data:
598 4 2 (xrle2 format, rec size 2)
599 1 0 ("1 0" from input)
600 2 2 2 2 0 2 ("2 2" x 2)
601 3 1 3 1 1 1 ("3 1" x 3)
602 2 4 2 4 1 4 ("2 4" x 3)
603 2 3 ("2 3")
604
605
606 DELTA1 (#64) - 8-bit delta
607 ------------
608
609 Byte number 0 1 2 N
610 +--+-------------+-- - --+
611 Hex values |40| Delta level | data |
612 +--+-------------+-- - --+
613
614 This technique replaces successive bytes with their differences. The level
615 indicates how many rounds of differencing to apply, which should be between 1
616 and 3. For determining the first difference we compare against zero. All
617 differences are internally performed using unsigned values with automatic an
618 wrap-around (taking the bottom 8-bits). Hence 2-1 is 1 and 1-2 is 255.
619
620 For example, with level set to 1:
621
622 Input data:
623 10 20 10 200 190 5
624
625 Output data:
626 1 (delta1 format)
627 1 (level)
628 10 10 246 190 246 71 (delta data)
629
630 For level set to 2:
631
632 Input data:
633 10 20 10 200 190 5
634
635 Output data:
636 1 (delta1 format)
637 2 (level)
638 10 0 236 200 56 81 (delta data)
639
640
641 DELTA2 (#65) - 16-bit delta
642 ------------
643
644 Byte number 0 1 2 N
645 +--+-------------+-- - --+
646 Hex values |41| Delta level | data |
647 +--+-------------+-- - --+
648
649 This format is as data format 64 except that the input data is read in 2-byte
650 values, so we take the difference between successive 16-bit numbers. For
651 example "0x10 0x20 0x30 0x10" (4 8-bit numbers; 2 16-bit numbers) yields "0x10
652 0x20 0x1f 0xf0". All 16-bit input data is assumed to be aligned to the start
653 of the buffer and is assumed to be in big-endian format.
654
655
656 DELTA2 (#66) - 32-bit delta
657 ------------
658
659 Byte number 0 1 2 3 4 N
660 +--+-------------+--+--+-- - --+
661 Hex values |42| Delta level | 0| 0| data |
662 +--+-------------+--+--+-- - --+
663
664
665 This format is as data formats 64 and 65 except that the input data is read in
666 4-byte values, so we take the difference between successive 32-bit numbers.
667
668 Two padding bytes (2 and 3) should always be set to zero. Their purpose is to
669 make sure that the compressed block is still aligned on a 4-byte boundary
670 (hence making it easy to pass straight into the 32to8 filter).
671
672
673 Data format 67-69/0x43-0x45 - reserved
674 ---------------------------
675
676 At present these are reserved for dynamic differencing where the 'level' field
677 varies - applying the appropriate level for each section of data. Experimental
678 at present...
679
680
681 16TO8 (#70) - 16 to 8 bit conversion
682 -----------
683
684 Byte number 0
685 +--+-- - --+
686 Hex values |46| data |
687 +--+-- - --+
688
689 This method assumes that the input data is a series of big endian 2-byte
690 signed integer values. If the value is in the range of -127 to +127 inclusive
691 then it is written as a single signed byte in the output stream, otherwise we
692 write out -128 followed by the 2-byte value (in big endian format). This
693 method works well following one of the delta techniques as most of the 16-bit
694 values are typically then small enough to fit in one byte.
695
696 Example input data:
697 0 10 0 5 -1 -5 0 200 -4 -32 (bytes)
698 (As 16-bit big-endian values: 10 5 -5 200 -800)
699
700 Output data:
701 70 (16-to-8 format)
702 10 5 -5 -128 0 200 -128 -4 -32
703
704
705 32TO8 (#71) - 32 to 8 bit conversion
706 -----------
707
708 Byte number 0
709 +--+-- - --+
710 Hex values |47| data |
711 +--+-- - --+
712
713 This format is similar to format 16TO8, but we are reducing 32-bit numbers (big
714 endian) to 8-bit numbers.
715
716
717 FOLLOW1 (#72) - "follow" predictor
718 -------------
719
720 Byte number 0 1 FF 100 101 N
721 +--+-- - - - --+-- - --+
722 Hex values |48| follow bytes | data |
723 +--+-- - - - --+-- - --+
724
725 For each symbol we compute the most frequent symbol following it. This is
726 stored in the "follow bytes" block (256 bytes). The first character in the
727 data block is stored as-is. Then for each subsequent character we store the
728 difference between the predicted character value (obtained by using
729 follow[previous_character]) and the real value. This is a very crude, but
730 fast, method of removing some residual non-randomness in the input data and so
731 will reduce the data entropy. It is best to use this prior to entropy encoding
732 (such as huffman encoding).
733
734
735 CHEB445 (#73) - floating point 16-bit chebyshev polynomial predictor
736 -------------
737 Version 1.1 only.
738 Deprecated: replaced by format 74 in Version 1.2.
739
740 WARNING: This method was experimental and have been replaced with an
741 integer equivalent. The floating point method may give system specific
742 results.
743
744 Byte number 0 1 2 N
745 +--+--+-- - --+
746 Hex values |49| 0| data |
747 +--+--+-- - --+
748
749 This method takes big-endian 16-bit data and attempts to curve-fit it using
750 chebyshev polynomials. The exact method employed uses the 4 preceeding values
751 to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients
752 only 4 are used to predict the next value. Then we store the difference
753 between the predicted value and the real value. This procedure is repeated
754 throughout each 16-bit value in the data. The first four 16-bit values are
755 stored with a simple 1-level 16-bit delta function. Reversing the predictor
756 follows the same procedure, except now adding the differences between stored
757 value and predicted value to get the real value.
758
759
760 ICHEB (#74) - integer based 16-bit chebyshev polynomial predictor
761 -----------
762 Version 1.2 onwards
763 This replaces the floating point CHEB445 format in ZTR v1.1.
764
765
766 Byte number 0 1 2 N
767 +--+--+-- - --+
768 Hex values |4A| 0| data |
769 +--+--+-- - --+
770
771 This method takes big-endian 16-bit data and attempts to curve-fit it using
772 chebyshev polynomials. The exact method employed uses the 4 preceeding values
773 to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients
774 only 4 are used to predict the next value. Then we store the difference
775 between the predicted value and the real value. This procedure is repeated
776 throughout each 16-bit value in the data. The first four 16-bit values are
777 stored with a simple 1-level 16-bit delta function. Reversing the predictor
778 follows the same procedure, except now adding the differences between stored
779 value and predicted value to get the real value.
780
781 STHUFF (#77) - Interlaced Deflate
782 ------------
783 Version 1.3 onwards
784
785 Byte number 0 1 2 N
786 +--+--+-- - - - - - --+-- - - --+
787 Hex values |4D| C| huffman codes | data |
788 +--+--+-- - - - - - --+-- - - --+
789
790 This compresses data using huffman encoding using the Deflate
791 algorithm for storing the codes and data. It is analogous to using
792 zlib with the Z_HUFFMAN_ONLY strategy and a negative window
793 size. However it has a few tweaks for optimal compression of very
794 small data sets. See RFC 1951 for details of Deflate. If the following
795 text is in decrepancy with RFC 1951 then the RFC takes priority. The
796 following is included as additional explanatory material only.
797
798 Huffman compression works by replacing each character (or 'symbol')
799 with a string of bits. Common symbols have are encoded using few bits
800 and rare symbols need a longer string of bits. The net effect is that
801 the overall number of bits needed to store a message is reduced.
802
803 To uncompress a compressed data stream it is necessary to know which
804 symbols are present and what their bit-strings are. For brevity this
805 is achieved by storing only the lengths of the bit-string for each
806 symbol and generating bit-strings from the lengths. As long as the
807 same canonical algorithm is used in both the encoder and decoder then
808 knowing the lengths alone is sufficient. Knowledge of this algorithm
809 is required for uncompressing the data, so it is defined as follows:
810
811 1. Sort symbols by the length of their bit-strings, smallest first.
812
813 The collating order for symbols sharing the same length is defined
814 as ASCII values 0 to 255 inclusive followed by the EOF symbol.
815
816 2. X = 0
817
818 3. For all bit lengths 'L' from 1 to 24 inclusive:
819
820 For all Symbols of bit length 'L', sorted as above:
821 Code(Symbol) = least significant 'L' bits of X
822 X = X + 1
823 End loop
824
825 X = X * 2
826
827 End loop
828
829 This is the same algorithm utilised in the Deflate algorithm (RFC 1951).
830
831
832 For example compressing "abracadabra" gives: /\
833 0 1
834 Symbol bit-length Code(X) / \
835 ------------------------------- a /\
836 a 1 0 0 / \
837 b 3 4 100 0 1
838 c 3 5 101 / \
839 r 3 6 110 / \
840 d 4 14 1110 /\ /\
841 EOF 4 15 1111 0 1 0 1
842 / \ / \
843 which in turn leads to 28 bits b c r /\
844 of output: 0 1
845 / \
846 0100110010101110010011001111 d EOF
847 (ab r ac ad ab r aEOF)
848
849
850 In the data format defined above, 'C' is a code-set number. If it is
851 zero the the huffman codes to uncompress 'data' are stored in the
852 following bytes using the same format describe in the DFLH chunk type
853 below, otherwise no huffman codes are stored and a predefined
854 set of huffman codes are used being either defined in a preceeding
855 DFLH chunk (for 128 <= 'C' <= 255) or statically defined in this
856 document (for 1 <= 'C' <= 127). Immediately following this is the
857 compressed bit-stream itself.
858
859 The statically defined huffman code-sets are as follows. The symbols
860 are listed below as their printable ASCII character or hash followed
861 by a number, so A and #65 are the same symbol. We use the algorithm
862 described above to turn these bit-lengths into actual huffman codes.
863
864 C=1: CODE_DNA
865
866 Length Symbols
867 ----------------
868 2 A C T
869 3 G
870 4 N
871 5 #0
872 6 EOF
873 13 #1 to #6 inclusive
874 14 #7 to #255 except where already listed above
875
876 C=2: CODE_DNA_AMBIG (DNA with IUPAC ambiguity codes)
877
878 Length Symbols
879 ----------------
880 2 A C T
881 3 G
882 4 N
883 7 #0 #45
884 8 B D H K M R S V W Y
885 11 EOF
886 14 #226
887 15 #1 to #255 except where already listed above
888
889 C=3: CODE_ENGLISH (English text)
890
891 Length Symbols
892 ----------------
893 3 #32 e
894 4 a i n o s t
895 5 d h l r u
896 6 #10 #13 #44 c f g m p w y
897 7 #46 b v
898 8 #34 I k
899 9 #45 A N T
900 10 #39 #59 #63 B C E H M S W x
901 11 #33 0 1 F G
902 15 #0 to #255 except where already listed above
903
904
905 It is recommended that this compression format is used only for small
906 data sizes and ZLIB is used for larger (a few K and above) data.
907
908
909 QSHIFT (#79) - 4-byte quality reorder
910 ------------
911 Version 1.3 onwards
912
913 This reorders the quality signal to be 4-tuples of the quality for the
914 called base followed by the quality of the other 3 base types in the
915 order they appear in a,c,g,t (minus the called base).
916
917 The purpose is to allow a 4-byte interlaced deflate algorithm to
918 operate efficiently.
919
920
921 TSHIFT (#70) - 8-byte trace reorder
922 ------------
923 Version 1.3 onwards
924
925 This reorders the trace signal to be 4-tuples of the 16-bit trace
926 signals for the called base followed by the signal from the other 3
927 base types in the order they appear in a,c,g,t (minus the called
928 base).
929
930 The purpose is to allow a 8-byte interlaced deflate algorithm to
931 operate efficiently.
932
933 FIXME: QSHIFT and TSHIFT could be general purpose byte rearrangements
934 without any knowledge of the data type they're holding. They need the
935 input data to be consistently ordered and not the large differences we
936 see between quality and trace right now.
937
938
939 Version 1.3 onwards
940 Chunk types
941 ===========
942
943 As described above, each chunk has a type. The format of the data contained in
944 the chunk data field (when written in format 0) is described below.
945 Note that no chunks are mandatory. It is valid to have no chunks at all.
946 However some chunk types may depend on the existance of others. This will be
947 indicated below, where applicable.
948
949 Each chunk type is stored as a 4-byte value. Bit 5 of the first byte is used
950 to indicate whether the chunk type is part of the public ZTR spec (bit 5 of
951 first byte == 0) or is a private/custom type (bit 5 of first byte == 1). Bit
952 5 of the remaining 3 bytes is reserved - they must always be set to zero.
953
954 Practically speaking this means that public chunk types consist entirely of
955 upper case letters (eg TEXT) whereas private chunk types start with a
956 lowercase letter (eg tEXT). Note that in this example TEXT and tEXT are
957 completely independent types and they may have no more relationship with each
958 other than (for example) TEXT and BPOS types.
959
960 It is valid to have multiples of some chunks (eg text chunks), but not for
961 others (such as base calls). The order of chunks does not matter unless
962 explicitly specified.
963
964 A chunk may have meta-data associated with it. This is data about the data
965 chunk. For example the data chunk could be a series of 16-bit trace samples,
966 while the meta-data could be a label attached to that trace (to distinguish
967 trace A from traces C, G and T). Meta-data is typically very small and so it
968 is never need be compressed in any of the public chunk types (although
969 meta-data is specific to each chunk type and so it would be valid to have
970 private chunks with compressed meta-data if desirable).
971
972 The first byte of each chunk data when uncompressed must be zero, indicating
973 raw format. If, having read the chunk data, this is not the case then the
974 chunk needs decompressing or reverse filtering until the first byte is
975 zero. There may be a few padding bytes between the format byte and the first
976 element of real data in the chunk. This is to make file processing simpler
977 when the chunk data consists of 16 or 32-bit words; the padding bytes ensure
978 that the data is aligned to the appropriate word size. Any padding bytes
979 required will be listed in the appopriate chunk definition below.
980
981
982 The following lists the chunk types available in 32-bit big-endian format.
983 In all cases the data is presented in the uncompressed form, starting with the
984 raw format byte and any appropriate padding.
985
986 SAMP
987 ----
988
989 Or Meta-data: (version 1.2 and before)
990 Byte number 0 1 2 3
991 +--+--+--+--+
992 Hex values | data name |
993 +--+--+--+--+
994
995 Data:
996 Byte number 0 1 2 3 4 5 6 7 N
997 +--+--+--+--+--+--+--+--+- -+
998 Hex values | 0| 0| data| data| data| - |
999 +--+--+--+--+--+--+--+--+- -+
1000
1001 This encodes a series of 16-bit unsigned trace samples. The first data
1002 byte is the format (raw); the second data byte is present for padding
1003 purposes only. After that comes a series of 16-bit big-endian
1004 values. Although stored as unsigned, a baseline value can be
1005 specified which is should then be subtracted from all values to
1006 generated signed data if required. By default the baseline is zero.
1007
1008 Valid identifiers for the meta-data (version 1.3 onwards) are:
1009
1010 Ident Value(s)
1011 ---------------------------------------------------------------------
1012 TYPE "A", "C", "G", "T", "PYNO" or "PYRW"
1013 OFFS 16-bit signed integer representing the 'zero' position,
1014 in ASCII.
1015
1016 [ FIXME: signed or unsigned? Signed means we couldn't store data in
1017 the range from -48K to +16K. Unsigned means we couldn't store data in
1018 the range 10K to 70K. What's most useful? Or should OFFS be 32-bit
1019 instead? ]
1020
1021 Versions prior to 1.3 specified meta-data consisted of a single 4-byte
1022 block containing a 4-byte name associated with the trace. If a
1023 type-name is shorter than 4 bytes then it should be right padded with
1024 nul characters to 4 bytes. For sequencing traces the four lanes
1025 representig A, C, G and T signals have names "A\0\0\0", "C\0\0\0",
1026 "G\0\0\0" and "T\0\0\0". PYNO and PYRW refer to normalised and raw
1027 pyrogram data (eg from 454 instruments). At present other names are
1028 not reserved, but it is recommended that (for consistency with
1029 elsewhere) you label private trace arrays with names starting in a
1030 lowercase letter (specifically, bit 5 is 1).
1031
1032 For the purposes of backwards compatibility, readers should check the
1033 version number in the ZTR header to determine whether the old or new
1034 style meta-data formatting is in use.
1035
1036 For sequencing traces it is expected that there will be four SAMP chunks,
1037 although the order is not specified.
1038
1039
1040 SMP4
1041 ----
1042
1043 Meta-data: optional - see below
1044
1045 Data:
1046 Byte number 0 1 2 3 4 5 6 7 N
1047 +--+--+--+--+--+--+--+--+- -+
1048 Hex values | 0| 0| data| data| data| - |
1049 +--+--+--+--+--+--+--+--+- -+
1050
1051
1052 As per SAMP, this encodes a series of unsigned 16-bit trace values, to
1053 be base-line corrected by the OFFS meta-data value as appropriate.
1054
1055 The first byte is 0 (raw format). Next is a single padding byte (also 0).
1056 Then follows a series of 2-byte big-endian trace samples for the "A" trace,
1057 followed by a series of 2-byte big-endian traces samples for the "C" trace,
1058 also followed by the "G" and "T" traces (in that order). The assumption is
1059 made that there is the same number of data points for all traces and hence the
1060 length of each trace is simply the number of data elements divided by four.
1061
1062 Experimentation has shown that this gives around 3% saving over 4
1063 separate SAMP chunks, but it lacks in flexibility.
1064
1065 Valid identifiers for the meta-data are:
1066
1067 Ident Value(s)
1068 ---------------------------------------------------------------------
1069 OFFS 16-bit signed integer representing the 'zero' position
1070 TYPE The type of data-set encoded. Values can be:
1071 "PROC" - processed data for viewing, also the default
1072 when no type field is found.
1073 "SLXI" - Illumina GA raw intensities (.int.txt files)
1074 "SLXN" - Illumina GA noise intensities (.nse.txt files)
1075
1076
1077 BASE
1078 ----
1079
1080 Meta-data: optional - see below
1081
1082 Data:
1083 Byte number 0 1 2 3 N
1084 +--+--+--+-- - --+
1085 Hex values | 0| base calls |
1086 +--+--+--+-- - --+
1087
1088 The first byte is 0 (raw format). This is followed by the base calls in ASCII
1089 format (one base per byte). By default it is assumed that all base
1090 calls are stored using the IUPAC characters[1].
1091
1092 Valid identifiers for the meta-data are:
1093
1094 Ident Meaning Value(s)
1095 ---------------------------------------------------------------------
1096 CSET Character-set 'I' (ASCII #73) => IUPAC ("ACGTUMRWSYKVHDBN")
1097 '0' (ASCII #49) => ABI SOLiD ("0123N")
1098
1099 BPOS
1100 ----
1101
1102 Meta-data: none present
1103
1104 Data:
1105 Byte number 0 1 2 3 4 5 6 7
1106 +--+--+--+--+--+--+--+--+- -+--+--+--+--+
1107 Hex values | 0| padding| data | - | data |
1108 +--+--+--+--+--+--+--+--+- -+--+--+--+--+
1109
1110 This chunk contains the mapping of base call (BASE) numbers to sample (SAMP)
1111 numbers; it defines the position of each base call in the trace data. The
1112 position here is defined as the numbering of the 16-bit positions held in the
1113 SAMP array, counting zero as the first value.
1114
1115 The format is 0 (raw format) followed by three padding bytes (all 0). Next
1116 follows a series of 4-byte big-endian numbers specifying the position of each
1117 base call as an index into the sample arrays (when considered as a 2-byte
1118 array with the format header stripped off).
1119
1120 Excluding the format and padding bytes, the number of 4-byte elements should
1121 be identical to the number of base calls. All sample numbers are counted from
1122 zero. No sample number in BPOS should be beyond the end of the SAMP arrays
1123 (although it should not be assumed that the SAMP chunks will be before this
1124 chunk). Note that the BPOS elements may not be totally in sorted order as
1125 the base calls may be shifted relative to one another due to compressions.
1126
1127
1128 CNF1
1129 ----
1130
1131 Meta-data: optional - see below
1132
1133 Data:
1134 Byte number 0 1 N
1135 +--+--+-- - --+--+
1136 Hex values | 0| call confidence |
1137 +--+--+-- - --+--+
1138
1139 (N == number of bases in BASE chunk)
1140
1141 Valid identifiers for the meta-data are:
1142
1143 Ident Value(s) Meaning
1144 ---------------------------------------------------------------------
1145 SCALE PH Phred-scaled confidence values. (Default). i.e. for
1146 a call with probability p: -10*log10(1-p)
1147 LO Log-odds scaled values. ie: 10*log10(p/(1-p))
1148
1149
1150 The first byte of this chunk is 0 (raw format). This is then followed by a
1151 series signed 8-bit confidence values for the called bases.
1152
1153 Either phred or log-odds (as used by the Illumina GA) scale ranges are
1154 appropriate.
1155
1156
1157 CNF4
1158 ----
1159
1160 Meta-data: optional - see below
1161
1162 Data:
1163 Byte number 0 1 N 4N
1164 +--+--+-- - --+--+----- - -----+
1165 Hex values | 0| call confidence | A/C/G/T conf |
1166 +--+--+-- - --+--+----- - -----+
1167
1168 (N == number of bases in BASE chunk)
1169
1170 Valid identifiers for the meta-data are:
1171
1172 Ident Value(s) Meaning
1173 ---------------------------------------------------------------------
1174 SCALE PH Phred-scaled confidence values. i.e. for a call
1175 with probability p: -10*log10(1-p)
1176 (NB: default, but often inappropriate.)
1177 LO Log-odds scaled values. ie: 10*log10(p/(1-p))
1178
1179
1180 The first byte of this chunk is 0 (raw format). This is then followed by a
1181 series signed 8-bit confidence values for the called base. Next comes
1182 all the remaining confidence values for A, C, G and T excluding those
1183 that have already been written (ie the called base). So for a sequence
1184 AGT we would store confidences A1 G2 T3 C1 G1 T1 A2 C2 T2 A3 C3 G3.
1185
1186 The purpose of this is to group the (likely) highest confidence value (those
1187 for the called base) at the start of the chunk followed by the remaining
1188 values. Hence if phred confidence values are written in a CNF4 chunk the first
1189 quarter of chunk will consist of phred confidence values and the last three
1190 quarters will (assuming no ambiguous base calls) consist entirely of zeros.
1191
1192 For the purposes of storage the confidence value for a base call that is not
1193 A, C, G or T (in any case) is stored as if the base call was T.
1194
1195 If only one confidence value exists per base then either the phred or
1196 log-odds scales work well. The first N bytes will be the called bases
1197 and the remaining 3*N will be zero (optimal for run-length-encoding),
1198 but consider using the CNF1 chunk type instead in this situation.
1199
1200 If all 4 base types have their own confidence value then the log-odds
1201 scale will work well. In this case the phred scale is an inappropriate
1202 choice as it cannot encode both very likely and very unlikely events.
1203
1204 Note: if this chunk exists it must exist after a BASE chunk.
1205
1206 TEXT
1207 ----
1208
1209 Meta-data: none present
1210
1211 Data: 0
1212 +--+- - -+--+- - -+--+- -+- - -+--+- - -+--+-----+
1213 Hex values | 0| ident | 0| value | 0| - | ident | 0| value | 0| (0) |
1214 +--+- - -+--+- - -+--+- -+- - -+--+- - -+--+-----+
1215
1216 This contains a series of "identifier\0value\0" pairs.
1217
1218 The identifiers and values may be any length and may contain any data
1219 except the nul character. The nul character marks the end of the
1220 identifier or the end of the value. Multiple identifier-value pairs
1221 are allowable. Prior to version 1.3 a double nul character marked the
1222 end of the list (labeled "(0)" above), but from version 1.3 the end
1223 of the list may also be marked by the end of chunk.
1224
1225 Identifiers starting with bit 5 clear (uppercase) are part of the public ZTR
1226 spec. Any public identifier not listed as part of this spec should be
1227 considered as reserved. Identifiers that have bit 6 set (lowercase) are for
1228 private use and no restriction is placed on these.
1229
1230 Multiple TEXT chunks may exist within the ZTR file. If so they are
1231 considered to be concatenated together.
1232
1233 See below for the text identifier list.
1234
1235 CLIP
1236 ----
1237
1238 Meta-data: none present
1239
1240 Data:
1241 Byte number 0 1 2 3 4 5 6 7 8
1242 +--+--+--+--+--+--+--+--+--+
1243 Hex values | 0| left clip | right clip|
1244 +--+--+--+--+--+--+--+--+--+
1245
1246 This contains suggested quality clip points. These are stored as zero (raw
1247 data) followed by a 4-byte big endian value for the left clip point and a
1248 4-byte big endian value for the right clip point. Clip points are defined in
1249 units of base calls, starting from 0. (Q: is that correct!?)
1250
1251
1252
1253 CR32
1254 ----
1255
1256 Meta-data: none present
1257
1258 Data:
1259 Byte number 0 1 2 3 4
1260 +--+--+--+--+--+
1261 Hex values | 0| CRC-32 |
1262 +--+--+--+--+--+
1263
1264 This chunk is always just 4 bytes of data containing a CRC-32 checksum,
1265 computed according to the widely used ANSI X3.66 standard. If present, the
1266 checksum will be a check of all of the data since the last CR32 chunk.
1267 This will include checking the header if this is the first CR32 chunk, and
1268 including the previous CRC32 chunk if it is not. Obviously the checksum will
1269 not include checks on this CR32 chunk.
1270
1271
1272 COMM
1273 ----
1274
1275 Meta-data: none present
1276
1277 Data:
1278 Byte number 0 1 N
1279 +--+-- - --+
1280 Hex values | 0| free text |
1281 +--+-- - --+
1282
1283 This allows arbitrary textual data to be added. It does not require a
1284 identifier-value pairing or any nul termination.
1285
1286
1287 DFLH
1288 ----
1289
1290 Meta-data: none present
1291
1292 Data:
1293 Byte number 0 1 N
1294 +--+--+-- - - - - - - - - - - --+
1295 Hex values | 0| C| Deflate format data ... |
1296 +--+--+-- - - - - - - - - - - --+
1297
1298 'C' is the code-set number referred to within that compression method.
1299 It should be 128 onwards and is used to distinguish between multiple
1300 huffman tables. It is used in conjunction with the data compression
1301 format 77 ("Deflate").
1302
1303 Following this is data in the Deflate format (RFC 1951). This should
1304 consist of the header for a single block using dynamic huffman with
1305 the BFINAL (last block) flag set.
1306
1307 In Deflate streams the end of the huffman codes and the start of
1308 the compressed data stream itself may occur part way through a
1309 byte. Therefore the last byte of the this block is bitwise ORed
1310 with the first byte of the data stream compressed referring back to
1311 this code-set number. Therefore all unused bits in the last byte of
1312 this block should be set to zero. Likewise if the data bit-stream in
1313 this block ends on an exact byte boundary then an additional blank
1314 byte must be added to ensure the ORing method above still works.
1315
1316
1317 DFLC
1318 ----
1319
1320 Meta-data: none present
1321
1322 Data:
1323 Byte number 0
1324 +--+---+- - - - ---+--+-- - - - - - - - - - - - --+
1325 Hex values | 0| C |code-order |FF| Deflate dynamic codes ... |
1326 +--+---+- - - - ---+--+-- - - - - - - - - - - - --+
1327
1328 Multi-context Deflate compression codes defined for use by data format
1329 78 (HUFF_MULTI).
1330
1331 This is like the DFLH format, except it encodes multiple huffman trees
1332 instead of a single tree along with the order in which the multiple
1333 trees should be used (the "code-order").
1334
1335 'C' is the code-set number referred to within that compression method.
1336 It should be 128 onwards and is used to distinguish between multiple
1337 huffman tables.
1338
1339 The code-order is a run-length encoded series of 8-bit numbers
1340 indicating which huffman code set should be used for which byte. For
1341 each byte in the input stream the HUFF_MULTI method selects the
1342 appropriate huffman code by using indexing code-order with the input
1343 data position modulo the number of values in code-order.
1344
1345 Following this is data in the Deflate format (RFC 1951). This should
1346 consist of the header component for a single block using dynamic
1347 huffman with the BFINAL (last block) flag set, up to and including
1348 the HDIST+1 code lengths for the distance alphabet. This will then be
1349 immediately followed by the next set of huffman codes, and so on until
1350 all index values containing within the code-order have been accounted
1351 for.
1352
1353 In Deflate streams the end of the huffman codes and the start of
1354 the compressed data stream itself may occur part way through a
1355 byte. Therefore the last byte of the this block is bitwise ORed
1356 with the first byte of the data stream compressed referring back to
1357 this code-set number. Therefore all unused bits in the last byte of
1358 this block should be set to zero. Likewise if the data bit-stream in
1359 this block ends on an exact byte boundary then an additional blank
1360 byte must be added to ensure the ORing method above still works.
1361
1362
1363 For example, compression of 16-bit data is sometimes best achieved by
1364 producing one set of huffman codes for the top 8 bits and another set
1365 for the bottom 8 bits, rather than mixing these together by treating
1366 the 16-bit data as a series of 8-bit quantities. In this case our
1367 code-order would consist of just two entries; (0, 1).
1368
1369 Alternatively we may have 4 1-byte confidence values stored per base
1370 in the order of the confidence of the base-called base type first
1371 followed by the 3 remaining confidence values. We observe that
1372 compressing byte 0, 4, 8, 12, ... as one set and bytes 1,2,3, 5,6,7,
1373 ... as another set yields higher compression ratios. In this case the
1374 code-order would consist of 4 entries; (0, 1, 1, 1).
1375
1376
1377 REGN
1378 ----
1379
1380 Meta-data: optional - see below
1381
1382 Data:
1383 Byte number 0 1 2 3 4 5 6 7 8
1384 +--+---+---+---+---+---+---+---+---+
1385 Hex values | 0| 1st boundary | 2nd boundary | ...
1386 +--+---+---+---+---+---+---+---+---+
1387
1388 This chunk is used to break a trace down into a series of segments. We
1389 store the boundary between segments, so the list above will contain
1390 one less boundary than there are segments with the first segment
1391 implicitly starting from the first base and the last segment implictly
1392 extending to the last base.
1393
1394 Each 4-byte unsigned value indicates a position within the sequence or
1395 trace counting from 0 as the first element and marking the first base
1396 of the next region. For example three regions of DNA may be:
1397
1398 0 1 2 3 4 5 6 7 8 9 10 11 12
1399 T A C G G A T T C G A A C
1400 |<-reg. 1->| |<--reg. 2--->| |<-reg. 3->|
1401
1402 This would give the 1st boundary as 4 and the 2nd boundary as 9.
1403
1404 The lack of a REGN chunk implies one single region extending from the
1405 first to last base in the sequence.
1406
1407 Valid identifiers for the meta-data are:
1408
1409 Ident Meaning Value(s)
1410 ---------------------------------------------------------------------
1411 COORD Coordinate system 'T' = trace coordinates
1412 'B' = base coordinations (default)
1413
1414 NAME Region names A semicolon separated list of
1415 "name:code" pairs. Eg
1416 primer1:T;read1:P;primer2:T;read2:P
1417
1418 [FIXME: NAME identifier here is the same as the REGION_LIST TEXT
1419 identifier. We need to decide where it belongs and pick one. If we can
1420 get a way to specify the default meta-data contents then logically
1421 speaking the best place to store this is in the meta-data along side
1422 the chunk data itself.]
1423
1424 The NAME identifier is used to attach a meaning to the regions
1425 described in the data chunk. It consists of a semi-colon separated
1426 list of names or name:code pairs. The codes, if present are a single
1427 character from the predefined list below and are separated from the
1428 name by a colon.
1429
1430 Code Meaning
1431 ---------------------------------------
1432 T Tech read (e.g. primer, linker)
1433 B Bio read
1434 I Inverted read
1435 D Duplicate read
1436 P Paired read
1437
1438 FIXME: I don't like the above meanings. They don't, well, "mean" much
1439 to me! What's a tech read?
1440
1441
1442
1443 Text Identifiers
1444 ================
1445
1446 These are for use in the TEXT segments. None are required, but if any of these
1447 identifiers are present they must confirm to the description below. Much
1448 (currently all) of this list has been taken from the NCBI Trace Archive [2]
1449 documentation. It is duplicated here as the ZTR spec is not tied to the same
1450 revision schedules as the NCBI trace archive (although it is intended that any
1451 suitable updates to the trace archive should be mirrored in this ZTR spec).
1452
1453 The Trace Archive specifies a maximum length of values. The ZTR spec does not
1454 have length limitations, but for compatibility these sizes should still be
1455 observed.
1456
1457 The Trace Archive also states some identifiers are mandatory; these are marked
1458 by asterisks below. These identifiers are not mandatory in the ZTR spec (but
1459 clearly they need to exist if the data is to be submitted to the NCBI).
1460
1461 Finally, some fields are not appropriate for use in the ZTR spec, such as
1462 BASE_FILE (the name of a file containing the base calls). Such fields are
1463 included only for compatibility with the Trace Arhive. It is not expected that
1464 use of ZTR would allow for the base calls to be read from an external file
1465 instead of the ZTR BASE chunk.
1466
1467 [ Quoted from TraceArchiveRFC v1.17 ]
1468
1469 Identifier Size Meaning Example value(s)
1470 ---------- ----- ---------------------------- -----------------
1471 TRACE_NAME * 250 name of the trace HBBBA1U2211
1472 as used at the center
1473 unique within the center
1474 but not among centers.
1475
1476 SUBMISSION_TYPE * - type of submission
1477
1478 CENTER_NAME * 100 name of center BCM
1479 CENTER_PROJECT 200 internal project name HBBB
1480 used within the center
1481
1482 TRACE_FILE * 200 file name of the trace ./traces/TRACE001.scf
1483 relative to the top of
1484 the volume.
1485
1486 TRACE_FORMAT * 20 format of the tracefile
1487
1488 SOURCE_TYPE * - source of the read
1489
1490 INFO_FILE 200 file name of the info file
1491 INFO_FILE_FORMAT 20
1492
1493 BASE_FILE 200 file name of the base calls
1494 QUAL_FILE 200 file name of the base calls
1495
1496
1497 TRACE_DIRECTION - direction of the read
1498 TRACE_END - end of the template
1499 PRIMER 200 primer sequence
1500 PRIMER_CODE which primer was used
1501
1502 STRATEGY - sequencing strategy
1503 TRACE_TYPE_CODE - purpose of trace
1504
1505 PROGRAM_ID 100 creator of trace file phred-0.990722.h
1506 program-version
1507
1508 TEMPLATE_ID 20 used for read pairing HBBBA2211
1509
1510 CHEMISTRY_CODE - code of the chemistry (see below)
1511 ITERATION - attempt/redo 1
1512 (int 1 to 255)
1513
1514 CLIP_QUALITY_LEFT left clip of the read in bp due to quality
1515 CLIP_QUALITY_RIGHT right " " " " "
1516 CLIP_VECTOR_LEFT left clip of the read in bp due to vector
1517 CLIP_VECTOR_RIGHT right " " " " "
1518
1519
1520 SVECTOR_CODE 40 sequencing vector used (in table)
1521 SVECTOR_ACCESSION 40 sequencing vector used (in table)
1522 CVECTOR_CODE 40 clone vector used (in table)
1523 CVECTOR_ACCESSION 40 clone vector used (in table)
1524
1525 INSERT_SIZE - expected size of insert 2000,10000
1526 in base pairs (bp)
1527 (int 1 to 2^32)
1528
1529 PLATE_ID 32 plate id at the center
1530 WELL_ID well 1-384
1531
1532
1533 SPECIES_CODE * - code for species
1534 SUBSPECIES_ID 40 name of the subspecies
1535 Is this the same as strain
1536
1537 CHROMOSOME 8 name of the chromosome ChrX, Chr01, Chr09
1538
1539
1540 LIBRARY_ID 30 the source library of the clone
1541 CLONE_ID 30 clone id RPCI11-1234
1542
1543 ACCESSION 30 NCBI accession number AC00001
1544
1545 PICK_GROUP_ID 30 an id to group traces picked
1546 at the same time.
1547 PREP_GROUP_ID 30 an id to group traces prepared
1548 at the same time
1549
1550
1551 RUN_MACHINE_ID 30 id of sequencing machine
1552 RUN_MACHINE_TYPE 30 type/model of machine
1553 RUN_LANE 30 lane or capillary of the trace
1554 RUN_DATE - date of run
1555 RUN_GROUP_ID 30 an identifier to group traces
1556 run on the same machine
1557
1558 [ End of quote from TraceArchiveRFC ]
1559
1560 More detailed information on the format of these values should be obtained
1561 from the Trace Archive RFC [2].
1562
1563 In addition to the above the following TEXT identifiers have meaning
1564 specific to the ZTR format:
1565
1566 Identifier Meaning Example value(s)
1567 ---------- ---------------------------- -------------------------------
1568 REGION_LIST A semi-colon separated list primer1:T;read1:P
1569 identifying regions of a
1570 trace. See the REGN chunk Region 1;Region 2;Region 3
1571 definition for details.
1572
1573
1574 FIXME: Should this simply be the meta-data associated with the REGN
1575 chunk?
1576
1577
1578
1579 References
1580 ==========
1581 [1] IUPAC: http://www.chem.qmw.ac.uk/iubmb/misc/naseq.html
1582
1583 [2] http://www.ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html
1584
1585 [3] J.Bonfield and R.Staden, "ZTR: a new format for DNA sequence trace
1586 data". Bioinformatics Vol. 18 no. 1 2002.
1587
1588
1589 FIXME: As an aside, not doing the final entropy encoding steps (zlib,
1590 deflate, etc) and just using bzip2 on an entire SRF archive yields a
1591 considerable saving. On tests it varied between 23% (27bp reads) and
1592 13% (74bp reads) smaller than the Deflate compressed
1593 data. Unfortunately it pretty much removes all chance of random access
1594 in the data unless I can get a working FM-Index implementation
1595 (which is very unlikely in a short time). This makes it appropriate
1596 for transmission perhaps, but not for indexing and querying random
1597 sequences.
1598
1599 A substantial chunk (5-9%) of this saving comes from the repeated ZTR
1600 block types (names like "BASE", "CNF4" and common components like 0x00000000
1601 for the meta-data size). The remainder probably comes from
1602 similarities between one ZTR file and another.