Mercurial > repos > dawe > srf2fastq

Notes: 28th May 2008

For version 2.0 consider the following:

1) Remove defunct or useless chunk types and compression formats.
2) Rationalise inconsitent behaviour (eg endianness on zlib chunk).
3) Support split header/data formats for SRF
4) Formalise meta-data use better.
5) More pie-in-the-sky ideas?

What we've described so far could easily be said to be v1.4. It's
backwards compatible and fairly minor is change. If we truely want to
go for version 2 then taking the chance to remove all those niggles
that we've kept purely for backwards compatibility would be good.

In more detail:

1) Removal of RLE and floating point chebyshev polynomials. Mark XRLE
   as deprecated?

   We may wish to add an extra option to XRLE2 to indicate the repeat
   count before specifying the remaining run-length. This breaks the
   format though. (Or add XRLE3 to allow such control?)

2) Strange things I can see are:

   2.1) All chunks use big-endian data except for zlib which has a
        little-endian length.

   2.2) The order that data is stored in differs per chunk type. For
        trace data we store all As, then all Cs, all Gs and finally
        all Ts. For confidence values we store called first followed
        by remaining. Both SMP4 and CNF4 essentially hold 1 piece of
        data per base type per base position, it's just the word size
        and packing order that differs.

	This means TSHIFT and QSHIFT compression types are tied very
	much to trace and quality value chunks, rather than being
	generic transforms. Maybe we should always have the same
	encoding order and some standard compression/transformations
	to reorder as desired.

	An example:
	All data related per call is stored in the natural order
	produced. (eg as utilised in CNF1, BPOS).

	All data related per base-type per call is stored in the order
	produced: A, C, G, T for the first base position, A, C, G, T
	for the second position, and so on.

	Then we have standard filters that can swap between
	ACGTACGTACGT... and AAA...CCC...GGG...TTT... or to
	<called><non-called * 3>... order (which requires a BASE chunk
	present to encode/decode). We'd have 1, 2 and 4 byte variants
	of such filters. They do not need to understand the nature of
	the data they're manipulating, just the word size and a
	predetermined order to shuffle the data around in.

	For CNF4 a combination of {ACGT}* to {<called><non-called*3>}*
	followed by {ACGT}* to A*C*G*T* ordering would end up with all
	<called> followed by all 3 remaining non-called. Ie as it is
	now (which we then promptly "undo" in solexa data by using
	TSHIFT).a

3) I'm wondering if there's mileage here in having negative lengths to
   indicate constant data + variable data further on.

   Eg length -10 means the next 10 bytes are the start of the data for
   this chunk. Some stage later we'll read a 4-byte length followed by
   the remaining data for this chunk.

   Rationale: often we end up with many identical bytes at the start
   of a chunk. For example, we take a solexa trace (0 0 value...), run
   it through TSHIFT (80 0 0 0 previous data => 80 0 0 0 0 0 value ...)
   and then through STHUFF (77 80(eg) data), but data is the
   compressed stream always starting with 80 0 0 0 0 0 so typically it's
   always the same starting string.

   Tested on an SRF file I see SMP4 always starting with the same 9
   bytes of data, BASE starting with the same 3 bytes and CNF4 always
   starting with the same 7 bytes. Hence we'd have lengths -9, -3 and
   -7 in the chunk headers and move that common data to the header
   block too. That's approx 3% of the size of our SRF file.

4) I propose *all* chunks have some standard meta-data fields
   available for use. These can be:

   4.1) GROUP - all chunks sharing the same GROUP value are considered
        as being related to one another. This provides a mechanism for
        multiple base-call, base position and confidence value chunks
        while still knowing which confidence values belong to which
        call. It also allows for multiple SAMP chunks (instead of the
        SMP4 chunk) to be collated together if desired.

	I don't expect many ZTR files to contain calls from multiple
	base-callers, but it's maybe a nice extension and seems quite
	a simple/clean use of meta-data.

   4.2) ENCODING - the default encoding for the chunk data is as
        described in the chunk. We may however wish to override this
        and, for example, store SMP4 data as 32-bit floating point
        values instead of 16-bit integers. This specifies that.

	Question: do we want this available universally everywhere? If
	not, we should at least use the same meta-data keyword for all
	occurrences.

   4.3)	TRANSFORM - a simple transformation description. This is
        essentially a mini-formula. It replaces the OFFS meta-data
        used in SMP4 which is simply a transform of X+value.

5) There are more generic ways to save storage by removing redundancy.

   Most probably they're not worth it, but I list them here for
   discussion still.

   5.1) Use 7-bit variable sized encodings for values instead of fixed
        32-bit sizes.

	Eg instead of storing 1000 as 0x3*0x100 + 0xe8 (00 00 03 e8)
	we could store it as 0x7*0x80 + 0x68 (80|07 68). The logic
	here being setting the top bit implies this isn't the final
	value and more data follows. It allows for variable sized
	fields so that small numbers take up fewer bytes. The same can
	be applied to data in SRF structs too.

	Realistically it saves 2 bytes per record in SRF and an
	unknown amount for ZTR - estimated 8 or so (3 for cnf4/base
	and 2 for smp4). It's only 1.5% saving though in total.

   5.2) A general purpose dictionary system. Instead of attempting to
        move headers to one area and data somewhere else, possibly
        also taking common portions of data and putting that somewhere
        too, we could provide a dictionary system whereby we
        previously remove redundancy by replacing all occurrences of a
	particular byte pattern with a new shorter code. (We'd need an
        escape mechanism for when it occurs by chance.) The dictionary
        can then be specified in it's own chunk which is stored in the
        header portion.

	This then works for portions of chunk header (eg if the
	meta-data changes) rather than full headers, where the data
	blocks always start with the same text, or where we want to
	have sensible names in text fields but don't like them taking
	up too much space.

	It's maybe a bit messy though and complex to implement, plus
	it's unknown how big an impact having to escape accidental
	dictionary codes from appearing in real data. The more formal
	way of removing redundancy is probably better.

    5.3) Lossy compression. I believe there's still room for this,
         although it needs careful thought.

	 The floating point format really isn't an ideal way to do it
	 though, so I'd much rather have an encoding system that uses
	 N*log(signal/M+1) plus a sign bit, stored in integers.

	 As we store data in integers the value of N combined with the
	 maximum value for log(signal/M+1) gives us the number of bits
	 we wish to encode to. Essentially we're storing the log value
	 to a fixed point precision.

	 The value of M dictates the slope of the errors we get from
	 logging. It's hard to describe, but basically as signal gets
	 larger our average error in storing the signal also gets
	 larger. That's true for floating point values too as there's
	 a fixed number of bits and they're being used to represent
	 larger and larger values, meaning the resolution drops.

	 I have various test code and graphs showing error profiles
	 for logs vs fixed point vs floating point. Logs or fixed
	 point are nearly always preferable to a floating point format
	 for size vs accuracy.

-----------------------------------------------------------------------------

CHANGE (since 1.2):
SAMP and SMP4 now has meta data fields indicating the zero base-line.

CLARIFICATION
The specification now explicitly states that trace samples are
unsigned, although the new OFFS meta-data can be used to turn these
into signed values.

CLARIFICATION
We explicitly state that multiple TEXT chunks maybe present in the ZTR
file and will be concatenated together. Also the trailing (nul) byte
is now optional.

CHANGE
Added CSET (character set) meta-data for BASEs so ABI SOLID encoding
can be used. This removes the requirement of IUPAC characters only.

CHANGE
Added XRLE2, QSHIFT, TSHIFT and STHUFF compression types.

INCOMPATIBLE CHANGE:
I propose for this version to make all meta-data adhere to a specific
format rather than adhoc. It'll consist of zero or more copies of
'identifier nul value nul'. See the format below for details.

The only use of meta-data in 1.2 was for SAMP (not SMP4) chunks to
indicate the channel the data came from. From now on file readers will
need to check the version number in the header to determine how to
parse the SAMP meta-data.


[Search for "FIXME" for my comments / questions to be answered. They
elaborate on the summary below and provide more context.]


QUESTION1:
Should we adapt ZTR to not be so inefficient with regards
to tiny chunks. Specifically a 5 byte chunk size, 4 byte meta-data
size (normally zero anyway) and 4 byte data length is all
wasteful. These combined comprise 5-10% of the total SRF size. Note
that changing this would break backwards compatibility.

QUESTION2:
Do I need a means to specify the "default meta-data". Specifically if
we have lots of SAMP chunks (for example) and every single one is
stating that the zero "offset" value is 32768 then we may want a
mechanism of specifying that the default OFFS value is 32768 for all
subsequent SAMP chunks.

One possible way to do this is to have a new chunk type which sets the
default. Eg for the SAMP chunk we could define a SaMP chunk to modify
the default for SAMP. This seems oddly named, but it's utilising the
bit5 of the 2nd byte which so far has been reserved as zero. (In the
first byte bit 5 set => private namespace and not part of the public spec.)

For now I'm just ignoring this issue though.

QUESTION3:
I've defined new transforms named TSHIFT and QSHIFT specifically
designed for adjusting the layout of CND4 and SMP4 chunk types to an
order more amenable for compression by interlaced deflate. They do the
job, but I'm wondering if it's better to simply redefine the input
data to be a more consistent ordering so that we can define more
general purpose transforms rather than one dedicated to the original
trace layout and one for the quality layout.

I'm ignoring this for now as it would break backwards compatibility.

QUESTION4:
For the OFFS meta-data in SMP4 and SAMP chunks I have a 16-bit offset
to specify the zero position.  Ie OFFS of 10000 means a sample of 9000
becomes -1000 after processing.

Should it be a signed or unsigned 16-bit value. Signed means we could
encode values ranging from 10000 to 70000 by specify OFFS as -10000.

Should it be 32-bit instead? Should we have OFFI and OFFF for integer
and floating point equivalents?

QUESTION5:
For region encoding where should the region name belong - the
meta-data section or the REGION_LIST TEXT identifier? It's currently
in both places. My gut instinct tells me it belongs in the meta-data
for the REGION_LIST chunk itself.

QUESTION6:
Can we have clarification on what the region code types mean,
specifically "tech read".

QUESTION7:
Should we add SAMP/SMP4 meta-data indicating a down-scale factor? For
454 data this could be 100, so we know value 123 is really 1.23. Note
this is maybe better implemented below using fixed-point precision.

QUESTION8:
How do we deal with floating point values?

I think the chunk meta-data should detail the format of the data block
itself (as it is strictly speaking data about the data so it fits
there well). A lack of meta data should imply the usual unsigned
16-bit quantities.

There's two main ways to encode fractions:

Floating point where we have a mantissa and an exponent.
	- See http://en.wikipedia.org/wiki/IEEE_floating-point_standard
	- large dynamic range
	- fixed number of significant bits
	- varying "resolution". Ie can represent tiny differences
	  between two very small floating point numbers, but not
	  between two very large floating point numbers.

Fixed point where we have a fixed number of bits for the component
before and after the decimal point.
	- See http://en.wikipedia.org/wiki/Q_%28number_format%29
	- constant resolution
	- effectively used by SFF (specified to 2 decimal places)
	- easy to treat as integers so can be fast and dealt with by
	  small embedded CPUs without FPUs.


Floating point maybe appropriate as effectively it's the same as
logging your signals and storing those. It offers large dynamic range
so can cope with abnormally large values (at the expense of precision)
while retaining lots of variation at the low end to distinguish small
values. However it's CPU intensive to cope with anything other than
the CPU provided 32-bit and 64-bit floating point formats.

Single precision 32-bit floats in IEEE-754 have:
 1 bit  (31):    Sign
 8 bits (23-30): Exponent (bias 127, so stroring 100 => -27)
23 bits (0-22):  Mantissa

Effectively we store any binary value as a normalised expression:

                    <exponent>
    1.<mantissa> * 2


Eg 1732.5:

=> 11011000100.1 (binary)
=> 1.10110001001 (binary) * 2^10

Exponent+127 => 137 => 10001001 (binary)

sign  exponent  mantissa
   0  10001001  10110001001000000000000

(17325 => 0x43ad  => 0x0010001110101101

However we probably want 16-bit and 24-bit floating point types for
efficiencies sake. Do we go with some fixed predefined floating point
formats for 8-bit, 16-bit, 24-bit and 32-bit layouts (with 32-bit
being identical to IEEE754) or do we allow for specification of the
mantissa and exponent Eg FLOAT=23.8, FLOAT=17.6 or FLOAT=5.2 in the
meta-data block?

FLOAT=17.6 (24-bit) gives ranges +/- 8.6*10^9
FLOAT=5.2  (8-bit)  gives ranges +/- 64 (I think).

Alternatively if we restrict ourselves to only using the most
significant 14 bits of the mantissa then storing as standard 32-bit
floats implies 1 in every 4 bytes is zero. This may provide for a
very crude, but fast way to implement reduced size floating point
values - ie FLOAT=15.8 (24-bit signed).

For fixed point (as in SFF values) there's already a draft standard
for implementation in C (ISO/IEC TR 18037:2004).

One benefit of fixed point over floating point is speed of
implementation. Fixed point numbers can just be dealt with as
integers. Eg subtracting two fixed point 16-bit values can be done in
integers using a-b and the result is the same as if we'd done all the
bit twiddling and maths directly simulating a real fixed-point unit.

My gut feeling is that we'd want to explicitly declare the number of
bits for integral and fractional components in the meta-data block.

Comments?

James

PS. The latest (only minor tweaks from before) ZTR draft spec
follows.


1.3 draft 3 (19 Oct 2007)

				ZTR SPEC v1.3
				=============

Header
======

The header consists of an 8 byte magic number (see below), followed by
a 1-byte major version number and 1-byte minor version number.

Changes in minor numbers should not cause problems for parsers. It indicates
a change in chunk types (different contents), but the file format is the
same.

The major number is reserved for any incompatible file format changes (which
hopefully should be never).

/* The header */
typedef struct {
    unsigned char  magic[8];	  /* 0xae5a54520d0a1a0a (b.e.) */
    unsigned char  version_major; /* 1 */
    unsigned char  version_minor; /* 3 */
} ztr_header_t;

/* The ZTR magic numbers */
#define ZTR_MAGIC		"\256ZTR\r\n\032\n"
#define ZTR_VERSION_MAJOR	1
#define ZTR_VERSION_MINOR	3

So the total header will consist of:

Byte number   0  1  2  3  4  5  6  7  8  9
            +--+--+--+--+--+--+--+--+--+--+
Hex values  |ae 5a 54 52 0d 0a 1a 0a|01 03|
            +--+--+--+--+--+--+--+--+--+--+

Chunk format
============

The basic structure of a ZTR file is (header,chunk*) - ie header followed by
zero or more chunks. Each chunk consists of a type, some meta-data and some
data, along with the lengths of both the meta-data and data.

Byte number   0  1  2  3   4    5    6   7   8  9
            +--+--+--+--+----+----+----+---+--+  -  +--+--+--+--+--+--  -  --+
Hex values  |   type    |meta-data length  | meta-data |data length| data .. |
            +--+--+--+--+----+----+----+---+--+  -  +--+--+--+--+--+--  -  --+

FIXME: For very short reads this is a large overhead. We have 8 bytes
of length information (of which typically only 1-2 are non-zero) and 4
bytes for type (which typically only has one of 4-5 values). This
means about 10 bytes wasted per chunk, or maybe 5-10% of the total
file size. Changing this would be a radical departure from ZTR; is it
justified given the savings? (est. 4.8% for 74bp reads, 8.4% for 27bp
reads).
One idea if to consider a ZTR file (the non "block" components at
least) to be a series of huffman codes, by default all 8-bit long and
matching their ASCII codes. Then a dedicated chunk could be used to
adjust these default codes. It's therefore backwards compatible, but
is that also overkill? (NB, this looks like it'd save 6% on the
overall file size.)

Ie in C:

typedef struct {
    uint4 type;			/* chunk type (b.e.) */
    uint4 mdlength;		/* length of meta-data field (b.e.) */
    char *mdata;		/* meta data */
    uint4 dlength;		/* length of data field (b.e.) */
    char *data;			/* a format byte and the data itself */
} ztr_chunk_t;

All 2 and 4-byte integer values are stored in big endian format.

The meta-data is uncompressed (and so it does not start with a format
byte). From version 1.3 onwards meta-data is defined to be in key
value pairs adhering to the same structure defined in the TEXT chunk
("key\0value\0"). Exceptions are made for this only for purposes of
backwards compatibility in the SAMP chunk type. The contents of the
meta-data is chunk specific, and many chunk types will have no
meta-data. In this case the meta-data length field will be zero and
this will be followed immediately by the data-length field.

Ie all meta-data adheres to the following structure:

Meta-data: (version 1.3 onwards only)
            +-  -  -+--+-  -  -+--+-     -+-  -  -+--+-  -  -+--+
Hex values  | ident | 0| value | 0|   -   | ident | 0| value | 0|
            +-  -  -+--+-  -  -+--+-     -+-  -  -+--+-  -  -+--+

FIXME: Can we have specify the meta-data once per ZTR file and omit it
in subsequent chunks? Eg a blank chunk with meta-data only in the
header. Chunks in the body then specify meta-data length as 0xFFFFFFFF
as an indicator meaning "use the last meta-data defined for this chunk
type". Useful when split in two, as in SRF?

Note that this means both ident and values must not themselves contain
the zero byte (a nul character), hence we generally store ident-value
pairs in ASCII string forms.

The data length ("dlength") is the length in bytes of the entire
'data' block, including the format information held within it.

The first byte of the data consists of a format byte. The most basic format is
zero - indicating that the data is "as is"; it's the real thing. Other formats
exist in order to encode various filtering and compression techniques. The
information encoded in the next bytes will depend on the format byte.


RAW (#0) - no formatting
--------

Byte number   0 1  2       N
            +--+--+--  -  --+
Hex values  | 0|  raw data  |
            +--+--+--  -  --+

Raw data has no compression or filtering. It just contains the unprocessed
data. It consists of a one byte header (0) indicating raw format followed by N
bytes of data.


RLE (#1) - simple run-length encoding
-------

Byte number   0  1    2     3     4      5     6  7  8               N
            +--+----+----+-----+-----+-------+--+--+--+--  -  --+--+--+
Hex values  | 1| Uncompressed length | guard | run length encoded data|
            +--+----+----+-----+-----+-------+--+--+--+--  -  --+--+--+

Run length encoding replaces stretches of N identical bytes (with value V)
with the guard byte G followed by N and V. All other byte values are stored
as normal, except for occurrences of the guard byte, which is stored as G 0.
For example with a guard value of 8:

Input data:
	20 9 9 9 9 9 10 9 8 7

Output data:
	1			(rle format)
	0 0 0 10		(original length)
	8			(guard)
	20 8 5 9 10 9 8 0 7	(rle data)


ZLIB (#2) - see RFC 1950
---------

Byte number   0  1    2     3     4    5  6  7         N
            +--+----+----+-----+-----+--+--+--+--  -  --+
Hex values  | 2| Uncompressed length | Zlib encoded data|
            +--+----+----+-----+-----+--+--+--+--  -  --+

This uses the zlib code to compress a data stream. The ZLIB data may itself be
encoded using a variety of methods (LZ77, Huffman), but zlib will
automatically determine the format itself. Often using zlib mode
Z_HUFFMAN_ONLY will provide best compression when combined with other
filtering techniques.


XRLE (#3) - multi-byte run-length encoding
---------

Byte number   0    1     2     3  4  5                N
            +--+------+-------+--+--+--+--  -  --+--+--+
Hex values  | 3| size | guard | run length encoded data|
            +--+------+-------+--+--+--+--  -  --+--+--+

Much standard RLE, but this mechanism has a byte to specify the length
of the data item we compare to check for runs. It is not restricted to
spotted runs aligned on 'size' byte boundaries either.

No uncompressed length is encoded here as technically this is not
required (although it does make decoding a bit slower). The compressed
length alone is sufficient to work out the uncompressed length after
decompressing.

Guard bytes in the input stream are 'escaped' by the replacing the
guard byte followed by zero. Guard bytes in a parameterised run (ie X
copies of Y where Y contains the guard) do not need to be 'escaped'

Input data:
10 12 12 13 12 13 12 13 12 13 14

Output data:
3				(xrle format)
2				(size of blocks to compare)
12			        (guard, 12 is a bad choice but illustrative)
10 12 0 12 4 12 13 14           (rle data)


XRLE2 (#4) - word aligned multi-byte run-length encoding
----------
Version 1.3 onwards

Byte number   0     1       RSZ          multiple of RSZ
            +--+-----+---------+-- - - - - - - - - - ---+
Hex values  | 4| RSZ | padding | run length encoded data|
            +--+-----+---------+-- - - - - - - - - - ---+

This achieves the same goal as XRLE, but is designed to maintain data
aligned to specific 'record size' boundaries. This sometimes has
benefits over XRLE in that subsequent a interlaced deflate entropy
encoding may work better on record-aligned data streams.

The first byte holds the format (#4) while the record size (RSZ) is
held in the second byte. In order to ensure the entire block of data
is aligned on 'RSZ' bounaries RSZ-2 padding bytes are written out
before the data itself starts. The contents of these bytes can be
anything.

Unlike XRLE it also does not use an explicit guard byte. If we term a
'word' to be a block of data of size RSZ, then whenever we read a word
which is identical to the last word written then we write out that
word (so we have two consecutive words in the output data) followed by
a counter of how many additional copies of that word are found, up to
255. This counter consists of 1 byte indicating the number of
additional copies of the word followed by RSZ-1 padding bytes to
maintain word alignment. While the contents of these padding bytes may
be anything, it is suggested that they adhere to same value
distribution as observed elsewhere in the data block in order to keep
the data entropy low. (For example repeating the previous bytes from
'word' will do.)

Example:

Input data: taken in pairs:
        1 0  2 2  2 2  3 1  3 1  3 1  2 4  2 4  2 4  2 3

Output data:
        4 2				(xrle2 format, rec size 2)
	1 0				("1 0" from input)
	2 2 2 2 0 2			("2 2" x 2)
	3 1 3 1 1 1			("3 1" x 3)
	2 4 2 4 1 4			("2 4" x 3)
	2 3				("2 3")


DELTA1 (#64) - 8-bit delta
------------

Byte number   0       1        2      N
            +--+-------------+--  -  --+
Hex values  |40| Delta level |   data  |
            +--+-------------+--  -  --+

This technique replaces successive bytes with their differences. The level
indicates how many rounds of differencing to apply, which should be between 1
and 3. For determining the first difference we compare against zero. All
differences are internally performed using unsigned values with automatic an
wrap-around (taking the bottom 8-bits). Hence 2-1 is 1 and 1-2 is 255.

For example, with level set to 1:

Input data:
      10 20 10 200 190 5

Output data:
       1			(delta1 format)
       1			(level)
       10 10 246 190 246 71	(delta data)

For level set to 2:

Input data:
      10 20 10 200 190 5

Output data:
       1			(delta1 format)
       2			(level)
       10 0 236 200 56 81	(delta data)


DELTA2 (#65) - 16-bit delta
------------

Byte number   0       1        2      N
            +--+-------------+--  -  --+
Hex values  |41| Delta level |   data  |
            +--+-------------+--  -  --+

This format is as data format 64 except that the input data is read in 2-byte
values, so we take the difference between successive 16-bit numbers. For
example "0x10 0x20 0x30 0x10" (4 8-bit numbers; 2 16-bit numbers) yields "0x10
0x20 0x1f 0xf0". All 16-bit input data is assumed to be aligned to the start
of the buffer and is assumed to be in big-endian format.


DELTA2 (#66) - 32-bit delta
------------

Byte number   0       1        2  3  4      N
            +--+-------------+--+--+--  -  --+
Hex values  |42| Delta level | 0| 0|   data  |
            +--+-------------+--+--+--  -  --+


This format is as data formats 64 and 65 except that the input data is read in
4-byte values, so we take the difference between successive 32-bit numbers.

Two padding bytes (2 and 3) should always be set to zero. Their purpose is to
make sure that the compressed block is still aligned on a 4-byte boundary
(hence making it easy to pass straight into the 32to8 filter).


Data format 67-69/0x43-0x45 - reserved
---------------------------

At present these are reserved for dynamic differencing where the 'level' field
varies - applying the appropriate level for each section of data. Experimental
at present...


16TO8 (#70) - 16 to 8 bit conversion
-----------

Byte number   0
            +--+--  -  --+
Hex values  |46|   data  |
            +--+--  -  --+

This method assumes that the input data is a series of big endian 2-byte
signed integer values. If the value is in the range of -127 to +127 inclusive
then it is written as a single signed byte in the output stream, otherwise we
write out -128 followed by the 2-byte value (in big endian format). This
method works well following one of the delta techniques as most of the 16-bit
values are typically then small enough to fit in one byte.

Example input data:
	0 10 0 5 -1 -5 0 200 -4 -32 (bytes)
	(As 16-bit big-endian values: 10 5 -5 200 -800)

Output data:
       70			(16-to-8 format)
       10 5 -5 -128 0 200 -128 -4 -32


32TO8 (#71) - 32 to 8 bit conversion
-----------

Byte number   0
            +--+--  -  --+
Hex values  |47|   data  |
            +--+--  -  --+

This format is similar to format 16TO8, but we are reducing 32-bit numbers (big
endian) to 8-bit numbers.


FOLLOW1 (#72) - "follow" predictor
-------------

Byte number   0  1     FF 100  101   N
            +--+--  -  -  - --+-- - --+
Hex values  |48| follow bytes |  data |
            +--+--  -  -  - --+-- - --+

For each symbol we compute the most frequent symbol following it. This is
stored in the "follow bytes" block (256 bytes). The first character in the
data block is stored as-is. Then for each subsequent character we store the
difference between the predicted character value (obtained by using
follow[previous_character]) and the real value. This is a very crude, but
fast, method of removing some residual non-randomness in the input data and so
will reduce the data entropy. It is best to use this prior to entropy encoding
(such as huffman encoding).


CHEB445 (#73) - floating point 16-bit chebyshev polynomial predictor
-------------
Version 1.1 only.
Deprecated: replaced by format 74 in Version 1.2.

WARNING: This method was experimental and have been replaced with an
integer equivalent. The floating point method may give system specific
results.

Byte number   0  1  2      N
            +--+--+--  -  --+
Hex values  |49| 0|   data  |
            +--+--+--  -  --+

This method takes big-endian 16-bit data and attempts to curve-fit it using
chebyshev polynomials. The exact method employed uses the 4 preceeding values
to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients
only 4 are used to predict the next value. Then we store the difference
between the predicted value and the real value. This procedure is repeated
throughout each 16-bit value in the data. The first four 16-bit values are
stored with a simple 1-level 16-bit delta function. Reversing the predictor
follows the same procedure, except now adding the differences between stored
value and predicted value to get the real value.


ICHEB (#74) - integer based 16-bit chebyshev polynomial predictor
-----------
Version 1.2 onwards
This replaces the floating point CHEB445 format in ZTR v1.1.


Byte number   0  1  2      N
            +--+--+--  -  --+
Hex values  |4A| 0|   data  |
            +--+--+--  -  --+

This method takes big-endian 16-bit data and attempts to curve-fit it using
chebyshev polynomials. The exact method employed uses the 4 preceeding values
to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients
only 4 are used to predict the next value. Then we store the difference
between the predicted value and the real value. This procedure is repeated
throughout each 16-bit value in the data. The first four 16-bit values are
stored with a simple 1-level 16-bit delta function. Reversing the predictor
follows the same procedure, except now adding the differences between stored
value and predicted value to get the real value.

STHUFF (#77) - Interlaced Deflate
------------
Version 1.3 onwards

Byte number   0  1  2                      N
            +--+--+-- - - - - - --+-- - - --+
Hex values  |4D| C| huffman codes |  data   |
            +--+--+-- - - - - - --+-- - - --+

This compresses data using huffman encoding using the Deflate
algorithm for storing the codes and data. It is analogous to using
zlib with the Z_HUFFMAN_ONLY strategy and a negative window
size. However it has a few tweaks for optimal compression of very
small data sets. See RFC 1951 for details of Deflate. If the following
text is in decrepancy with RFC 1951 then the RFC takes priority. The
following is included as additional explanatory material only.

Huffman compression works by replacing each character (or 'symbol')
with a string of bits. Common symbols have are encoded using few bits
and rare symbols need a longer string of bits. The net effect is that
the overall number of bits needed to store a message is reduced.

To uncompress a compressed data stream it is necessary to know which
symbols are present and what their bit-strings are. For brevity this
is achieved by storing only the lengths of the bit-string for each
symbol and generating bit-strings from the lengths. As long as the
same canonical algorithm is used in both the encoder and decoder then
knowing the lengths alone is sufficient. Knowledge of this algorithm
is required for uncompressing the data, so it is defined as follows:

1. Sort symbols by the length of their bit-strings, smallest first.

   The collating order for symbols sharing the same length is defined
   as ASCII values 0 to 255 inclusive followed by the EOF symbol.

2. X = 0

3. For all bit lengths 'L' from 1 to 24 inclusive:

           For all Symbols of bit length 'L', sorted as above:
                   Code(Symbol) = least significant 'L' bits of X
                   X = X + 1
           End loop

           X = X * 2

   End loop

This is the same algorithm utilised in the Deflate algorithm (RFC 1951).


For example compressing "abracadabra" gives:      /\
                                                 0  1
Symbol    bit-length    Code(X)                 /    \
-------------------------------                a     /\
a         1              0 0                        /  \
b         3              4 100                     0    1
c         3              5 101                    /      \
r         3              6 110                   /        \
d         4             14 1110                 /\        /\
EOF       4             15 1111                0  1      0  1
                                              /    \    /    \
which in turn leads to 28 bits               b      c  r     /\
of output:                                                  0  1
                                                           /    \
 0100110010101110010011001111                             d     EOF
(ab  r  ac  ad   ab  r  aEOF)


In the data format defined above, 'C' is a code-set number. If it is
zero the the huffman codes to uncompress 'data' are stored in the
following bytes using the same format describe in the DFLH chunk type
below, otherwise no huffman codes are stored and a predefined
set of huffman codes are used being either defined in a preceeding
DFLH chunk (for 128 <= 'C' <= 255) or statically defined in this
document (for 1 <= 'C' <= 127).  Immediately following this is the
compressed bit-stream itself.

The statically defined huffman code-sets are as follows. The symbols
are listed below as their printable ASCII character or hash followed
by a number, so A and #65 are the same symbol. We use the algorithm
described above to turn these bit-lengths into actual huffman codes.

C=1: CODE_DNA

    Length   Symbols
    ----------------
    2        A C T
    3        G
    4        N
    5	     #0
    6        EOF
    13       #1 to #6 inclusive
    14       #7 to #255 except where already listed above

C=2: CODE_DNA_AMBIG (DNA with IUPAC ambiguity codes)

    Length   Symbols
    ----------------
    2        A C T
    3        G
    4        N
    7	     #0 #45
    8        B D H K M R S V W Y
    11	     EOF
    14	     #226
    15	     #1 to #255 except where already listed above

C=3: CODE_ENGLISH (English text)

    Length   Symbols
    ----------------
    3	     #32 e
    4	     a i n o s t
    5	     d h l r u
    6	     #10 #13 #44 c f g m p w y
    7	     #46 b v
    8	     #34 I k
    9	     #45 A N T
    10	     #39 #59 #63 B C E H M S W x
    11	     #33 0 1 F G
    15	     #0 to #255 except where already listed above


It is recommended that this compression format is used only for small
data sizes and ZLIB is used for larger (a few K and above) data.


QSHIFT (#79) - 4-byte quality reorder
------------
Version 1.3 onwards

This reorders the quality signal to be 4-tuples of the quality for the
called base followed by the quality of the other 3 base types in the
order they appear in a,c,g,t (minus the called base).

The purpose is to allow a 4-byte interlaced deflate algorithm to
operate efficiently.


TSHIFT (#70) - 8-byte trace reorder
------------
Version 1.3 onwards

This reorders the trace signal to be 4-tuples of the 16-bit trace
signals for the called base followed by the signal from the other 3
base types in the order they appear in a,c,g,t (minus the called
base).

The purpose is to allow a 8-byte interlaced deflate algorithm to
operate efficiently.

FIXME: QSHIFT and TSHIFT could be general purpose byte rearrangements
without any knowledge of the data type they're holding. They need the
input data to be consistently ordered and not the large differences we
see between quality and trace right now.


Version 1.3 onwards
Chunk types
===========

As described above, each chunk has a type. The format of the data contained in
the chunk data field (when written in format 0) is described below.
Note that no chunks are mandatory. It is valid to have no chunks at all.
However some chunk types may depend on the existance of others. This will be
indicated below, where applicable.

Each chunk type is stored as a 4-byte value. Bit 5 of the first byte is used
to indicate whether the chunk type is part of the public ZTR spec (bit 5 of
first byte == 0) or is a private/custom type (bit 5 of first byte == 1). Bit
5 of the remaining 3 bytes is reserved - they must always be set to zero.

Practically speaking this means that public chunk types consist entirely of
upper case letters (eg TEXT) whereas private chunk types start with a
lowercase letter (eg tEXT). Note that in this example TEXT and tEXT are
completely independent types and they may have no more relationship with each
other than (for example) TEXT and BPOS types.

It is valid to have multiples of some chunks (eg text chunks), but not for
others (such as base calls). The order of chunks does not matter unless
explicitly specified.

A chunk may have meta-data associated with it. This is data about the data
chunk. For example the data chunk could be a series of 16-bit trace samples,
while the meta-data could be a label attached to that trace (to distinguish
trace A from traces C, G and T). Meta-data is typically very small and so it
is never need be compressed in any of the public chunk types (although
meta-data is specific to each chunk type and so it would be valid to have
private chunks with compressed meta-data if desirable).

The first byte of each chunk data when uncompressed must be zero, indicating
raw format. If, having read the chunk data, this is not the case then the
chunk needs decompressing or reverse filtering until the first byte is
zero. There may be a few padding bytes between the format byte and the first
element of real data in the chunk. This is to make file processing simpler
when the chunk data consists of 16 or 32-bit words; the padding bytes ensure
that the data is aligned to the appropriate word size. Any padding bytes
required will be listed in the appopriate chunk definition below.


The following lists the chunk types available in 32-bit big-endian format.
In all cases the data is presented in the uncompressed form, starting with the
raw format byte and any appropriate padding.

SAMP
----

Or Meta-data: (version 1.2 and before)
Byte number   0  1  2  3
            +--+--+--+--+
Hex values  | data name |
            +--+--+--+--+

Data:
Byte number   0  1  2  3  4  5  6  7       N
            +--+--+--+--+--+--+--+--+-     -+
Hex values  | 0| 0| data| data| data|   -   |
            +--+--+--+--+--+--+--+--+-     -+

This encodes a series of 16-bit unsigned trace samples. The first data
byte is the format (raw); the second data byte is present for padding
purposes only. After that comes a series of 16-bit big-endian
values.  Although stored as unsigned, a baseline value can be
specified which is should then be subtracted from all values to
generated signed data if required. By default the baseline is zero.

Valid identifiers for the meta-data (version 1.3 onwards) are:

Ident	     Value(s)
---------------------------------------------------------------------
TYPE	     "A", "C", "G", "T", "PYNO" or "PYRW"
OFFS         16-bit signed integer representing the 'zero' position,
	     in ASCII.

[ FIXME: signed or unsigned? Signed means we couldn't store data in
the range from -48K to +16K. Unsigned means we couldn't store data in
the range 10K to 70K. What's most useful? Or should OFFS be 32-bit
instead? ]

Versions prior to 1.3 specified meta-data consisted of a single 4-byte
block containing a 4-byte name associated with the trace. If a
type-name is shorter than 4 bytes then it should be right padded with
nul characters to 4 bytes. For sequencing traces the four lanes
representig A, C, G and T signals have names "A\0\0\0", "C\0\0\0",
"G\0\0\0" and "T\0\0\0". PYNO and PYRW refer to normalised and raw
pyrogram data (eg from 454 instruments). At present other names are
not reserved, but it is recommended that (for consistency with
elsewhere) you label private trace arrays with names starting in a
lowercase letter (specifically, bit 5 is 1).

For the purposes of backwards compatibility, readers should check the
version number in the ZTR header to determine whether the old or new
style meta-data formatting is in use.

For sequencing traces it is expected that there will be four SAMP chunks,
although the order is not specified.


SMP4
----

Meta-data: optional - see below

Data:
Byte number   0  1  2  3  4  5  6  7       N
            +--+--+--+--+--+--+--+--+-     -+
Hex values  | 0| 0| data| data| data|   -   |
            +--+--+--+--+--+--+--+--+-     -+


As per SAMP, this encodes a series of unsigned 16-bit trace values, to
be base-line corrected by the OFFS meta-data value as appropriate.

The first byte is 0 (raw format). Next is a single padding byte (also 0).
Then follows a series of 2-byte big-endian trace samples for the "A" trace,
followed by a series of 2-byte big-endian traces samples for the "C" trace,
also followed by the "G" and "T" traces (in that order). The assumption is
made that there is the same number of data points for all traces and hence the
length of each trace is simply the number of data elements divided by four.

Experimentation has shown that this gives around 3% saving over 4
separate SAMP chunks, but it lacks in flexibility.

Valid identifiers for the meta-data are:

Ident	     Value(s)
---------------------------------------------------------------------
OFFS         16-bit signed integer representing the 'zero' position
TYPE         The type of data-set encoded. Values can be:
             "PROC" - processed data for viewing, also the default
                      when no type field is found.
             "SLXI" - Illumina GA raw intensities (.int.txt files)
             "SLXN" - Illumina GA noise intensities (.nse.txt files)


BASE
----

Meta-data: optional - see below

Data:
Byte number   0  1  2  3      N
            +--+--+--+--  -  --+
Hex values  | 0| base calls    |
            +--+--+--+--  -  --+

The first byte is 0 (raw format). This is followed by the base calls in ASCII
format (one base per byte). By default it is assumed that all base
calls are stored using the IUPAC characters[1].

Valid identifiers for the meta-data are:

Ident	 Meaning         Value(s)
---------------------------------------------------------------------
CSET     Character-set   'I' (ASCII #73) => IUPAC ("ACGTUMRWSYKVHDBN")
			 '0' (ASCII #49) => ABI SOLiD ("0123N")

BPOS
----

Meta-data: none present

Data:
Byte number   0  1  2  3  4  5  6  7
            +--+--+--+--+--+--+--+--+-     -+--+--+--+--+
Hex values  | 0| padding|   data    |   -   |    data   |
            +--+--+--+--+--+--+--+--+-     -+--+--+--+--+

This chunk contains the mapping of base call (BASE) numbers to sample (SAMP)
numbers; it defines the position of each base call in the trace data. The
position here is defined as the numbering of the 16-bit positions held in the
SAMP array, counting zero as the first value.

The format is 0 (raw format) followed by three padding bytes (all 0). Next
follows a series of 4-byte big-endian numbers specifying the position of each
base call as an index into the sample arrays (when considered as a 2-byte
array with the format header stripped off).

Excluding the format and padding bytes, the number of 4-byte elements should
be identical to the number of base calls. All sample numbers are counted from
zero. No sample number in BPOS should be beyond the end of the SAMP arrays
(although it should not be assumed that the SAMP chunks will be before this
chunk). Note that the BPOS elements may not be totally in sorted order as
the base calls may be shifted relative to one another due to compressions.


CNF1
----

Meta-data: optional - see below

Data:
Byte number   0  1              N
            +--+--+--   -   --+--+
Hex values  | 0| call confidence |
            +--+--+--   -   --+--+

(N == number of bases in BASE chunk)

Valid identifiers for the meta-data are:

Ident	 Value(s)   Meaning
---------------------------------------------------------------------
SCALE    PH         Phred-scaled confidence values. (Default). i.e. for
                    a call with probability p:  -10*log10(1-p)
         LO         Log-odds scaled values. ie:  10*log10(p/(1-p))


The first byte of this chunk is 0 (raw format). This is then followed by a
series signed 8-bit confidence values for the called bases.

Either phred or log-odds (as used by the Illumina GA) scale ranges are
appropriate.


CNF4
----

Meta-data: optional - see below

Data:
Byte number   0  1              N              4N
            +--+--+--   -   --+--+----- -  -----+
Hex values  | 0| call confidence | A/C/G/T conf |
            +--+--+--   -   --+--+----- -  -----+

(N == number of bases in BASE chunk)

Valid identifiers for the meta-data are:

Ident	 Value(s)   Meaning
---------------------------------------------------------------------
SCALE    PH         Phred-scaled confidence values. i.e. for a call
		    with probability p:  -10*log10(1-p)
		    (NB: default, but often inappropriate.)
         LO         Log-odds scaled values. ie:  10*log10(p/(1-p))


The first byte of this chunk is 0 (raw format). This is then followed by a
series signed 8-bit confidence values for the called base. Next comes
all the remaining confidence values for A, C, G and T excluding those
that have already been written (ie the called base). So for a sequence
AGT we would store confidences A1 G2 T3 C1 G1 T1 A2 C2 T2 A3 C3 G3.

The purpose of this is to group the (likely) highest confidence value (those
for the called base) at the start of the chunk followed by the remaining
values. Hence if phred confidence values are written in a CNF4 chunk the first
quarter of chunk will consist of phred confidence values and the last three
quarters will (assuming no ambiguous base calls) consist entirely of zeros.

For the purposes of storage the confidence value for a base call that is not
A, C, G or T (in any case) is stored as if the base call was T.

If only one confidence value exists per base then either the phred or
log-odds scales work well. The first N bytes will be the called bases
and the remaining 3*N will be zero (optimal for run-length-encoding),
but consider using the CNF1 chunk type instead in this situation.

If all 4 base types have their own confidence value then the log-odds
scale will work well. In this case the phred scale is an inappropriate
choice as it cannot encode both very likely and very unlikely events.

Note: if this chunk exists it must exist after a BASE chunk.

TEXT
----

Meta-data: none present

Data:	      0
            +--+-  -  -+--+-  -  -+--+-     -+-  -  -+--+-  -  -+--+-----+
Hex values  | 0| ident | 0| value | 0|   -   | ident | 0| value | 0| (0) |
            +--+-  -  -+--+-  -  -+--+-     -+-  -  -+--+-  -  -+--+-----+

This contains a series of "identifier\0value\0" pairs.

The identifiers and values may be any length and may contain any data
except the nul character. The nul character marks the end of the
identifier or the end of the value. Multiple identifier-value pairs
are allowable. Prior to version 1.3 a double nul character marked the
end of the list (labeled "(0)" above), but from version 1.3 the end
of the list may also be marked by the end of chunk.

Identifiers starting with bit 5 clear (uppercase) are part of the public ZTR
spec. Any public identifier not listed as part of this spec should be
considered as reserved. Identifiers that have bit 6 set (lowercase) are for
private use and no restriction is placed on these.

Multiple TEXT chunks may exist within the ZTR file. If so they are
considered to be concatenated together.

See below for the text identifier list.

CLIP
----

Meta-data: none present

Data:
Byte number   0  1  2  3  4  5  6  7  8
            +--+--+--+--+--+--+--+--+--+
Hex values  | 0| left clip | right clip|
            +--+--+--+--+--+--+--+--+--+

This contains suggested quality clip points. These are stored as zero (raw
data) followed by a 4-byte big endian value for the left clip point and a
4-byte big endian value for the right clip point. Clip points are defined in
units of base calls, starting from 0. (Q: is that correct!?)


CR32
----

Meta-data: none present

Data:
Byte number   0  1  2  3  4
            +--+--+--+--+--+
Hex values  | 0|   CRC-32  |
            +--+--+--+--+--+

This chunk is always just 4 bytes of data containing a CRC-32 checksum,
computed according to the widely used ANSI X3.66 standard. If present, the
checksum will be a check of all of the data since the last CR32 chunk.
This will include checking the header if this is the first CR32 chunk, and
including the previous CRC32 chunk if it is not. Obviously the checksum will
not include checks on this CR32 chunk.


COMM
----

Meta-data: none present

Data:
Byte number   0  1        N
            +--+--   -   --+
Hex values  | 0| free text |
            +--+--   -   --+

This allows arbitrary textual data to be added. It does not require a
identifier-value pairing or any nul termination.


DFLH
----

Meta-data: none present

Data:
Byte number   0  1                         N
            +--+--+-- - - - - - - - - - - --+
Hex values  | 0| C| Deflate format data ... |
            +--+--+-- - - - - - - - - - - --+

'C' is the code-set number referred to within that compression method.
It should be 128 onwards and is used to distinguish between multiple
huffman tables. It is used in conjunction with the data compression
format 77 ("Deflate").

Following this is data in the Deflate format (RFC 1951). This should
consist of the header for a single block using dynamic huffman with
the BFINAL (last block) flag set.

In Deflate streams the end of the huffman codes and the start of
the compressed data stream itself may occur part way through a
byte. Therefore the last byte of the this block is bitwise ORed
with the first byte of the data stream compressed referring back to
this code-set number. Therefore all unused bits in the last byte of
this block should be set to zero. Likewise if the data bit-stream in
this block ends on an exact byte boundary then an additional blank
byte must be added to ensure the ORing method above still works.


DFLC
----

Meta-data: none present

Data:
Byte number   0
            +--+---+- - - - ---+--+-- - - - - - - - - - - - --+
Hex values  | 0| C |code-order |FF| Deflate dynamic codes ... |
            +--+---+- - - - ---+--+-- - - - - - - - - - - - --+

Multi-context Deflate compression codes defined for use by data format
78 (HUFF_MULTI).

This is like the DFLH format, except it encodes multiple huffman trees
instead of a single tree along with the order in which the multiple
trees should be used (the "code-order").

'C' is the code-set number referred to within that compression method.
It should be 128 onwards and is used to distinguish between multiple
huffman tables.

The code-order is a run-length encoded series of 8-bit numbers
indicating which huffman code set should be used for which byte. For
each byte in the input stream the HUFF_MULTI method selects the
appropriate huffman code by using indexing code-order with the input
data position modulo the number of values in code-order.

Following this is data in the Deflate format (RFC 1951). This should
consist of the header component for a single block using dynamic
huffman with the BFINAL (last block) flag set, up to and including
the HDIST+1 code lengths for the distance alphabet. This will then be
immediately followed by the next set of huffman codes, and so on until
all index values containing within the code-order have been accounted
for.

In Deflate streams the end of the huffman codes and the start of
the compressed data stream itself may occur part way through a
byte. Therefore the last byte of the this block is bitwise ORed
with the first byte of the data stream compressed referring back to
this code-set number. Therefore all unused bits in the last byte of
this block should be set to zero. Likewise if the data bit-stream in
this block ends on an exact byte boundary then an additional blank
byte must be added to ensure the ORing method above still works.


For example, compression of 16-bit data is sometimes best achieved by
producing one set of huffman codes for the top 8 bits and another set
for the bottom 8 bits, rather than mixing these together by treating
the 16-bit data as a series of 8-bit quantities. In this case our
code-order would consist of just two entries; (0, 1).

Alternatively we may have 4 1-byte confidence values stored per base
in the order of the confidence of the base-called base type first
followed by the 3 remaining confidence values. We observe that
compressing byte 0, 4, 8, 12, ... as one set and bytes 1,2,3, 5,6,7,
... as another set yields higher compression ratios. In this case the
code-order would consist of 4 entries; (0, 1, 1, 1).


REGN
----

Meta-data: optional - see below

Data:
Byte number   0   1   2   3   4   5   6   7   8
            +--+---+---+---+---+---+---+---+---+
Hex values  | 0| 1st boundary  | 2nd boundary  | ...
            +--+---+---+---+---+---+---+---+---+

This chunk is used to break a trace down into a series of segments. We
store the boundary between segments, so the list above will contain
one less boundary than there are segments with the first segment
implicitly starting from the first base and the last segment implictly
extending to the last base.

Each 4-byte unsigned value indicates a position within the sequence or
trace counting from 0 as the first element and marking the first base
of the next region. For example three regions of DNA may be:

    0  1  2  3   4  5  6  7  8   9 10 11 12
    T  A  C  G   G  A  T  T  C   G  A  A  C
   |<-reg. 1->| |<--reg. 2--->| |<-reg. 3->|

This would give the 1st boundary as 4 and the 2nd boundary as 9.

The lack of a REGN chunk implies one single region extending from the
first to last base in the sequence.

Valid identifiers for the meta-data are:

Ident	 Meaning              Value(s)
---------------------------------------------------------------------
COORD    Coordinate system    'T' = trace coordinates
	                      'B' = base coordinations (default)

NAME     Region names	      A semicolon separated list of
			      "name:code" pairs. Eg
			      primer1:T;read1:P;primer2:T;read2:P

[FIXME: NAME identifier here is the same as the REGION_LIST TEXT
identifier. We need to decide where it belongs and pick one. If we can
get a way to specify the default meta-data contents then logically
speaking the best place to store this is in the meta-data along side
the chunk data itself.]

The NAME identifier is used to attach a meaning to the regions
described in the data chunk. It consists of a semi-colon separated
list of names or name:code pairs. The codes, if present are a single
character from the predefined list below and are separated from the
name by a colon.

Code    Meaning
---------------------------------------
T	Tech read (e.g. primer, linker)
B	Bio read
I	Inverted read
D	Duplicate read
P	Paired read

FIXME: I don't like the above meanings. They don't, well, "mean" much
to me! What's a tech read?


Text Identifiers
================

These are for use in the TEXT segments. None are required, but if any of these
identifiers are present they must confirm to the description below. Much
(currently all) of this list has been taken from the NCBI Trace Archive [2]
documentation. It is duplicated here as the ZTR spec is not tied to the same
revision schedules as the NCBI trace archive (although it is intended that any
suitable updates to the trace archive should be mirrored in this ZTR spec).

The Trace Archive specifies a maximum length of values. The ZTR spec does not
have length limitations, but for compatibility these sizes should still be
observed.

The Trace Archive also states some identifiers are mandatory; these are marked
by asterisks below. These identifiers are not mandatory in the ZTR spec (but
clearly they need to exist if the data is to be submitted to the NCBI).

Finally, some fields are not appropriate for use in the ZTR spec, such as
BASE_FILE (the name of a file containing the base calls). Such fields are
included only for compatibility with the Trace Arhive. It is not expected that
use of ZTR would allow for the base calls to be read from an external file
instead of the ZTR BASE chunk.

[ Quoted from TraceArchiveRFC v1.17 ]

Identifier      Size       Meaning			 Example value(s)
----------      -----      ----------------------------  -----------------
TRACE_NAME *      250      name of the trace             HBBBA1U2211
                           as used at the center
                           unique within the center
                           but not among centers.

SUBMISSION_TYPE *   -      type of submission

CENTER_NAME *     100      name of center                BCM
CENTER_PROJECT    200      internal project name         HBBB
                           used within the center

TRACE_FILE *      200      file name of the trace	 ./traces/TRACE001.scf
                           relative to the top of
                           the volume.

TRACE_FORMAT *     20      format of the tracefile

SOURCE_TYPE *       -      source of the read

INFO_FILE         200      file name of the info file
INFO_FILE_FORMAT   20

BASE_FILE         200      file name of the base calls
QUAL_FILE         200      file name of the base calls


TRACE_DIRECTION     -      direction of the read
TRACE_END           -      end of the template
PRIMER            200      primer sequence
PRIMER_CODE                which primer was used

STRATEGY            -      sequencing strategy
TRACE_TYPE_CODE     -      purpose of trace

PROGRAM_ID         100     creator of trace file         phred-0.990722.h
                           program-version

TEMPLATE_ID         20     used for read pairing         HBBBA2211

CHEMISTRY_CODE       -     code of the chemistry         (see below)
ITERATION            -     attempt/redo                  1
                           (int 1 to 255)

CLIP_QUALITY_LEFT          left clip of the read in bp due to quality
CLIP_QUALITY_RIGHT         right " " " " "
CLIP_VECTOR_LEFT           left clip of the read in bp due to vector
CLIP_VECTOR_RIGHT          right " " " " "


SVECTOR_CODE        40     sequencing vector used        (in table)
SVECTOR_ACCESSION   40     sequencing vector used        (in table)
CVECTOR_CODE        40     clone vector used             (in table)
CVECTOR_ACCESSION   40     clone vector used             (in table)

INSERT_SIZE          -     expected size of insert       2000,10000
                           in base pairs (bp)
                           (int 1 to 2^32)

PLATE_ID            32     plate id at the center
WELL_ID                    well                          1-384


SPECIES_CODE *       -     code for species
SUBSPECIES_ID       40     name of the subspecies
                           Is this the same as strain

CHROMOSOME           8     name of the chromosome        ChrX, Chr01, Chr09


LIBRARY_ID          30     the source library of the clone
CLONE_ID            30     clone id                      RPCI11-1234

ACCESSION           30     NCBI accession number         AC00001

PICK_GROUP_ID       30     an id to group traces picked
                           at the same time.
PREP_GROUP_ID       30     an id to group traces prepared
                           at the same time


RUN_MACHINE_ID      30     id of sequencing machine
RUN_MACHINE_TYPE    30     type/model of machine
RUN_LANE            30     lane or capillary of the trace
RUN_DATE             -     date of run
RUN_GROUP_ID        30     an identifier to group traces
                           run on the same machine

[ End of quote from TraceArchiveRFC ]

More detailed information on the format of these values should be obtained
from the Trace Archive RFC [2].

In addition to the above the following TEXT identifiers have meaning
specific to the ZTR format:

Identifier     Meaning			     Example value(s)
----------     ----------------------------  -------------------------------
REGION_LIST    A semi-colon separated list   primer1:T;read1:P
               identifying regions of a
	       trace. See the REGN chunk     Region 1;Region 2;Region 3
	       definition for details.


FIXME: Should this simply be the meta-data associated with the REGN
chunk?


References
==========
[1] IUPAC: http://www.chem.qmw.ac.uk/iubmb/misc/naseq.html

[2] http://www.ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html

[3] J.Bonfield and R.Staden, "ZTR: a new format for DNA sequence trace
data". Bioinformatics Vol. 18 no. 1 2002.


FIXME: As an aside, not doing the final entropy encoding steps (zlib,
deflate, etc) and just using bzip2 on an entire SRF archive yields a
considerable saving. On tests it varied between 23% (27bp reads) and
13% (74bp reads) smaller than the Deflate compressed
data. Unfortunately it pretty much removes all chance of random access
in the data unless I can get a working FM-Index implementation
(which is very unlikely in a short time). This makes it appropriate
for transmission perhaps, but not for indexing and querying random
sequences.

A substantial chunk (5-9%) of this saving comes from the repeated ZTR
block types (names like "BASE", "CNF4" and common components like 0x00000000
for the meta-data size). The remainder probably comes from
similarities between one ZTR file and another.
author	dawe
date	Tue, 07 Jun 2011 17:48:05 -0400
parents
children