comparison srf2fastq/io_lib-1.12.2/man/man1/illumina2srf.1 @ 0:d901c9f41a6a default tip

Migrated tool version 1.0.1 from old tool shed archive to new tool shed repository
author dawe
date Tue, 07 Jun 2011 17:48:05 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:d901c9f41a6a
1 .TH illumina2srf 1 "September 29" "" "Staden io_lib"
2
3 .SH "NAME"
4
5 .PP
6 .BR illumina2srf
7 \- Builds an SRF file from an Illumina/Solexa GA run folder.
8
9 .SH "SYNOPSIS"
10 .PP
11 \fBillumina2srf\fR [\fIoptions\fR] \fItile_seq_file\fR ...
12
13 .SH "DESCRIPTION"
14 .PP
15 \fBillumina2srf\fR converts the Illumina GA-pipeline run folder output
16 into an SRF file. It should be run from the
17 Bustard\fI<version><date>\fR directory. It has a wealth of options,
18 listed below, although many have defaults and may be ommitted if the
19 run folder follows the standard directory layout. The arguments, after
20 the options, should be the filenames of the sequence files, eg
21 \fIs_8_*_seq.txt\fR. All other filenames are derived from the _seq.txt
22 filenames.
23 .PP
24 The main structure of an SRF file is as a container, much like zip or
25 tar. The contents however may be split into variable and common
26 components allowing for better compression. For \fBillumina2srf\fR
27 that means that we store trace data in ZTR format with common ZTR
28 chunks (text identifiers such as base-caller name and version, matrix
29 files and compression specifications) in an SRF \fIData Block
30 Header\fR and variable components (sequence, quality and traces) in
31 ZTR chunks held within an SRF \fIData Block\fR. Typically we have
32 10,000 Data Blocks per Data Block Header.
33 .PP
34 The most major decision in producing the SRF file is what data to put
35 in it. By default the program writes the sequence and probability
36 values along with the "processed" trace intensities. In GAPipeline
37 v1.0 and earlier these are in the \fI_seq.txt\fR, \fI_prb.txt\fR and
38 \fI_sig2.txt\fR files held within the main Bustard directory. In
39 addition to these the \fB-r\fR option requests storage of the "raw"
40 trace intensities, comprising both the pre-processed intensities and
41 noise estimates from the Firecrest \fI_int.txt\fR and \fI_nse.txt\fR
42 files respectively. To store only raw intensities, skipping processed
43 data, specify the \fB-r -P\fR options. Finally the \fB-I\fR option can
44 be used to store data from IPAR format files.
45 .PP
46 Confidence values have been a source of large variation over the
47 pipeline releases. In GAPipeline 1.0 and earlier the \fI_prb.txt\fR
48 files in the Bustard directory contain four quality values per base
49 encoded using a log-odds system: 10*log(P/(1-P)). In addition to this
50 there are various calibrated formats in the GERALD directory with one
51 Phred scale value per base. See the \fB-qf\fR, \fB-qr\fR and \fB-qc\fR
52 parameters.
53 .PP
54 There are a number of smaller ancillary data files that get stored
55 too. As there is no per-lane or per-run storage mechanism in
56 these are added for every SRF Data Block Header of which there may be
57 several per tile. However the overhead in duplicating this data is not
58 significant given the size of the individual SRF Data Blocks. The
59 ancillary data files also stored are \fI.params\fR files (for both Bustard
60 and Firecrest), matrices (specified using \fB-mf\fR and \fB-mr\fR) and
61 phasing XML files (\fB-pf\fR and \fB-pr\fR).
62
63 .SH "OPTIONS"
64 .PP
65 .SS "Trace data-source options"
66 .TP
67 \fB-r\fR, \fB-R\fR
68 Specifies to store (\fB-r\fR) or not to store (\fB-R\fR - the default)
69 "raw" data. This is currently comprised of the contents of the
70 \fI_int.txt\fR and \fI_nse.txt\fR files in the Firecrest directory.
71 .TP
72 \fB-p\fR, \fB-P\fR
73 Specifies to store (\fB-p\fR - the default) or not to store (\fB-P\fR)
74 the "processed" data. This is the contents of the \fI_sig2.txt\fR
75 files in the Bustard directory.
76 .TP
77 \fB-u\fR
78 Deprecated. Older GAPipeline releases created \fI_sig.txt\fR files
79 holding semi-processed data with compensation for the dye spectral
80 overlap, but before phasing correction steps. The \fB-u\fR argument
81 indicates that the processed data should be taken from these files
82 instead of \fI_sig2.txt\fR.
83 .TP
84 \fB-I\fR
85 Reads \fIIPAR\fR files instead of the raw trace data files. These are
86 a different format used by the incremental processing software when
87 the pipeline is run on the instrument control PC itself.
88 .SS "Quality value data-source options"
89 .TP
90 \fB-qf\fR \fIfilename\fR
91 Specifies the filename of the calibrated quality values for the
92 forward-read or both the forward and reverse read combined if
93 appropriate. \fIfilename\fR should be in Illumina's fastq derivative
94 format, with quality values stored as ASCII 64 plus the log-odds
95 score.
96 .TP
97 \fB-qr\fR \fIfilename\fR
98 If the calibrated fastq files are split into forward and reverse files
99 then \fIfilename\fR specifies the reverse sequences. Otherwise we
100 assume they are tacked onto the end of the forward sequences specified
101 in \fB-qf\fR. Like the former file, this should be in Illumina's
102 fastq-like format.
103 .TP
104 \fB-qc\fR \fIdirectory\fR
105 This is an alternative to the \fB-qf\fR and \fB-qr\fR options above
106 and is mutually exclusive with them. This specifies that the
107 calibrated data should come from files named
108 "\fIdirectory\fR/s_%d_qcal.txt" where "%d" is replaced by the current
109 tile number.
110
111 .SS "Filtering options"
112 .TP
113 \fB-c\fR \fIvalue\fR
114 Only store traces that have a "chastity" score >= \fIValue\fR.
115 This is mutually exclusive with the \fB-C\fR option.
116 .TP
117 \fB-C\fR \fIvalue\fR
118 Until the -c option, traces with a "chastity" score < \fIValue\fR are
119 still stored in the SRF file but are marked as bad reads
120 instead. \fBsrf2fasta\fR and \fBsrf2fastq\fR have options to
121 subsequently filter out bad reads using this flag.
122 This is mutually exclusive with the \fB-c\fR option.
123 .TP
124 \fB-s\fR \fIN\fR
125 This skips the first \fIN\fR cycles of a trace (including signal,
126 sequence and quality values) when writing it to an SRF file. The
127 purpose of this is to remove primer bases, but it is not
128 recommended. Instead the SRF file should be using the ZTR region chunk
129 (REGN) to indicate which potion of a trace is valid.
130 .SS "Read naming"
131 .PP
132 Read names are split into two halves, a prefix and a suffix. One
133 common prefix is stored in each and every SRF Data Block Header while
134 the suffix is stored in every Data Block. This combination allows for
135 removal of repetitive data in order to shrink the SRF file size.
136 .TP
137 \fB-n\fR \fIformat\fR
138 .RS
139 Controls the format used for creating the sequence name suffix. This
140 uses a printf style system of percent expansions that will be replaced
141 with the appropriate data. The list of percent expansions are:
142 .TP
143 %%
144 A literal percent character
145 .TP
146 %d
147 Run date (taken from parsing the current working directory)
148 .TP
149 %m
150 Machine name (taken from parsing the current working directory)
151 .TP
152 %r
153 Run number (taken from parsing the current working directory)
154 .TP
155 %l
156 lane number (%L for hexidecimal encoding)
157 .TP
158 %t
159 tile number (%T for hexidecimal encoding)
160 .TP
161 %x
162 X coordinate (%X for hexidecimal encoding)
163 .TP
164 %y
165 Y coordinate (%Y for hexidecimal encoding)
166 .TP
167 %c
168 Counter; increments by 1 for every sequence in the tile (%C for
169 hexidecimal encoding).
170 .PP
171 All the above format strings have an optional numerical value between
172 the percent and the format character. This is used to control the
173 field width. For example to print the X and Y coordinates to 3
174 hexidecimal places we could use \fB-n "%3X:%3Y"\fR.
175 .PP
176 The default format is "\fB%x:%y\fR".
177 .RE
178 .TP
179 \fB-N\fR \fIformat\fR
180 .RS
181 Specifies the format string for encoding the reading name prefix. It
182 follows the same formatting rules specified in the \fB-n\fR above.
183 .PP
184 The default format is "\fB%m_%r:%l:%t:\fR".
185 .RE
186 .SS "Ancillary data files"
187 .PP
188 These options govern the extra files stored per tile (or strictly
189 speaking per SRF Data Block Header).
190 .TP
191 \fB-2\fR \fIcycle\fR
192 This specifies the cycle number, counting from 1, of the second read
193 forming a read-pair. It is used for automatic generation of filenames
194 in several of the options below and also for construction of the ZTR
195 region (REGN) chunks.
196 .TP
197 \fB-mf\fR \fIfilename\fR
198 The filename of the forward matrix file. If a single printf numerical
199 percent rule is used (such as "%d") then it will be replaced by the
200 lane number. When not specified the default \fIfilename\fR will be
201 \fI../Matrix/s_%d_02_matrix.txt\fR.
202 .TP
203 \fB-mr\fR \fIfilename\fR
204 The filename of the reverse matrix file - only used on paired end
205 runs. If a single printf numerical percent rule is used (such as "%d")
206 then it will be replaced by the lane number. If a second printf
207 percent rule is used then it will be replaced with the cycle number
208 that the paired read starts on. This is equivalent to the cycle number
209 specified in the \fB-2\fR option plus one. (The plus one comes
210 from using the second cycle per end for matrix calibration.)
211 When \fB-mr\fR is not specified the default \fIfilename\fR will be
212 \fI../Matrix/s_%d_%02d_matrix.txt\fR.
213 .TP
214 \f-pf\fR \fIfilename\fR
215 Specifies the filename of the forward-read phasing XML file. As with
216 \fR-mf\fR a printf numerical percent rule will be replaced by the lane
217 number. The default \fIfilename\fR format is
218 \fIPhasing/s_%d_01_phasing.xml\fR.
219 .TP
220 \f-pr\fR \fIfilename\fR
221 Specifies the filename of the reverse-read phasing XML file. As with
222 \fR-mr\fR the first two printf numerical percent rules will be
223 replaced by the lane number and the cycle number. Unlike \fB-mr\fR
224 though the cycle number is the value used in the \fB-c\fR option as-is
225 instead of plus one. The default \fIfilename\fR format is
226 \fIPhasing/s_%d_%02d_phasing.xml\fR.
227 .SS "Other options"
228 .TP
229 \fB-o\fR \fIsrf_filename\fR
230 Specifies the output filename to write the SRF data too. Defaults to
231 "traces.srf".
232 .TP
233 \fB-i\fR
234 Indicates that an index should be appended to the SRF file. This
235 allows for random access based on the sequence name.
236 .TP
237 \fB-d\fR
238 Enable dots-mode. This outputs a full-stop per input tile. Most useful
239 in conjunction with quiet mode. Default is off.
240 .TP
241 \fB-q\fR
242 Quiet mode. Do not output commentary on which tile is being processed
243 and the metrics about it. Default off.
244
245 .SH "EXAMPLES"
246 .PP
247 To store a lane 4 from a paired end run with raw traces, no
248 processed data and calibrated confidence values.
249 .PP
250 .nf
251 # From Bustard directory
252 illumina2srf -o all.srf -r -P \\
253 -qf GERALD*/s_4_1_sequence.txt \\
254 -qr GERALD*/s_4_2_sequence.txt \\
255 s_4_*_seq.txt
256 .fi
257
258 .PP
259 To store and index only processed traces with chastity >= 0.6
260 .PP
261 .nf
262 illumina2srf -o s4.srf -c 0.6 s_4_*_seq.txt
263 .fi
264
265 .SH "CAVEATS"
266 .PP
267 There are many mutually exclusive options, some of which may be for
268 processing file formats that no longer exist. This is due to the
269 history of the program and the rapidly changing nature of the files
270 being processed. Some future culling of options and file formats can
271 be expected.
272 .PP
273 Some assumptions are made as to the directory layout and the ability
274 to parse the run folder directory name. There are currently no ways to
275 override some of this information, including run date, run number and
276 GAPipeline program version numbers.
277
278 .SH "AUTHOR"
279 .PP
280 James Bonfield, Wellcome Trust Sanger Institute