Mercurial > repos > dawe > srf2fastq
comparison srf2fastq/io_lib-1.12.2/man/man1/illumina2srf.1 @ 0:d901c9f41a6a default tip
Migrated tool version 1.0.1 from old tool shed archive to new tool shed repository
author | dawe |
---|---|
date | Tue, 07 Jun 2011 17:48:05 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:d901c9f41a6a |
---|---|
1 .TH illumina2srf 1 "September 29" "" "Staden io_lib" | |
2 | |
3 .SH "NAME" | |
4 | |
5 .PP | |
6 .BR illumina2srf | |
7 \- Builds an SRF file from an Illumina/Solexa GA run folder. | |
8 | |
9 .SH "SYNOPSIS" | |
10 .PP | |
11 \fBillumina2srf\fR [\fIoptions\fR] \fItile_seq_file\fR ... | |
12 | |
13 .SH "DESCRIPTION" | |
14 .PP | |
15 \fBillumina2srf\fR converts the Illumina GA-pipeline run folder output | |
16 into an SRF file. It should be run from the | |
17 Bustard\fI<version><date>\fR directory. It has a wealth of options, | |
18 listed below, although many have defaults and may be ommitted if the | |
19 run folder follows the standard directory layout. The arguments, after | |
20 the options, should be the filenames of the sequence files, eg | |
21 \fIs_8_*_seq.txt\fR. All other filenames are derived from the _seq.txt | |
22 filenames. | |
23 .PP | |
24 The main structure of an SRF file is as a container, much like zip or | |
25 tar. The contents however may be split into variable and common | |
26 components allowing for better compression. For \fBillumina2srf\fR | |
27 that means that we store trace data in ZTR format with common ZTR | |
28 chunks (text identifiers such as base-caller name and version, matrix | |
29 files and compression specifications) in an SRF \fIData Block | |
30 Header\fR and variable components (sequence, quality and traces) in | |
31 ZTR chunks held within an SRF \fIData Block\fR. Typically we have | |
32 10,000 Data Blocks per Data Block Header. | |
33 .PP | |
34 The most major decision in producing the SRF file is what data to put | |
35 in it. By default the program writes the sequence and probability | |
36 values along with the "processed" trace intensities. In GAPipeline | |
37 v1.0 and earlier these are in the \fI_seq.txt\fR, \fI_prb.txt\fR and | |
38 \fI_sig2.txt\fR files held within the main Bustard directory. In | |
39 addition to these the \fB-r\fR option requests storage of the "raw" | |
40 trace intensities, comprising both the pre-processed intensities and | |
41 noise estimates from the Firecrest \fI_int.txt\fR and \fI_nse.txt\fR | |
42 files respectively. To store only raw intensities, skipping processed | |
43 data, specify the \fB-r -P\fR options. Finally the \fB-I\fR option can | |
44 be used to store data from IPAR format files. | |
45 .PP | |
46 Confidence values have been a source of large variation over the | |
47 pipeline releases. In GAPipeline 1.0 and earlier the \fI_prb.txt\fR | |
48 files in the Bustard directory contain four quality values per base | |
49 encoded using a log-odds system: 10*log(P/(1-P)). In addition to this | |
50 there are various calibrated formats in the GERALD directory with one | |
51 Phred scale value per base. See the \fB-qf\fR, \fB-qr\fR and \fB-qc\fR | |
52 parameters. | |
53 .PP | |
54 There are a number of smaller ancillary data files that get stored | |
55 too. As there is no per-lane or per-run storage mechanism in | |
56 these are added for every SRF Data Block Header of which there may be | |
57 several per tile. However the overhead in duplicating this data is not | |
58 significant given the size of the individual SRF Data Blocks. The | |
59 ancillary data files also stored are \fI.params\fR files (for both Bustard | |
60 and Firecrest), matrices (specified using \fB-mf\fR and \fB-mr\fR) and | |
61 phasing XML files (\fB-pf\fR and \fB-pr\fR). | |
62 | |
63 .SH "OPTIONS" | |
64 .PP | |
65 .SS "Trace data-source options" | |
66 .TP | |
67 \fB-r\fR, \fB-R\fR | |
68 Specifies to store (\fB-r\fR) or not to store (\fB-R\fR - the default) | |
69 "raw" data. This is currently comprised of the contents of the | |
70 \fI_int.txt\fR and \fI_nse.txt\fR files in the Firecrest directory. | |
71 .TP | |
72 \fB-p\fR, \fB-P\fR | |
73 Specifies to store (\fB-p\fR - the default) or not to store (\fB-P\fR) | |
74 the "processed" data. This is the contents of the \fI_sig2.txt\fR | |
75 files in the Bustard directory. | |
76 .TP | |
77 \fB-u\fR | |
78 Deprecated. Older GAPipeline releases created \fI_sig.txt\fR files | |
79 holding semi-processed data with compensation for the dye spectral | |
80 overlap, but before phasing correction steps. The \fB-u\fR argument | |
81 indicates that the processed data should be taken from these files | |
82 instead of \fI_sig2.txt\fR. | |
83 .TP | |
84 \fB-I\fR | |
85 Reads \fIIPAR\fR files instead of the raw trace data files. These are | |
86 a different format used by the incremental processing software when | |
87 the pipeline is run on the instrument control PC itself. | |
88 .SS "Quality value data-source options" | |
89 .TP | |
90 \fB-qf\fR \fIfilename\fR | |
91 Specifies the filename of the calibrated quality values for the | |
92 forward-read or both the forward and reverse read combined if | |
93 appropriate. \fIfilename\fR should be in Illumina's fastq derivative | |
94 format, with quality values stored as ASCII 64 plus the log-odds | |
95 score. | |
96 .TP | |
97 \fB-qr\fR \fIfilename\fR | |
98 If the calibrated fastq files are split into forward and reverse files | |
99 then \fIfilename\fR specifies the reverse sequences. Otherwise we | |
100 assume they are tacked onto the end of the forward sequences specified | |
101 in \fB-qf\fR. Like the former file, this should be in Illumina's | |
102 fastq-like format. | |
103 .TP | |
104 \fB-qc\fR \fIdirectory\fR | |
105 This is an alternative to the \fB-qf\fR and \fB-qr\fR options above | |
106 and is mutually exclusive with them. This specifies that the | |
107 calibrated data should come from files named | |
108 "\fIdirectory\fR/s_%d_qcal.txt" where "%d" is replaced by the current | |
109 tile number. | |
110 | |
111 .SS "Filtering options" | |
112 .TP | |
113 \fB-c\fR \fIvalue\fR | |
114 Only store traces that have a "chastity" score >= \fIValue\fR. | |
115 This is mutually exclusive with the \fB-C\fR option. | |
116 .TP | |
117 \fB-C\fR \fIvalue\fR | |
118 Until the -c option, traces with a "chastity" score < \fIValue\fR are | |
119 still stored in the SRF file but are marked as bad reads | |
120 instead. \fBsrf2fasta\fR and \fBsrf2fastq\fR have options to | |
121 subsequently filter out bad reads using this flag. | |
122 This is mutually exclusive with the \fB-c\fR option. | |
123 .TP | |
124 \fB-s\fR \fIN\fR | |
125 This skips the first \fIN\fR cycles of a trace (including signal, | |
126 sequence and quality values) when writing it to an SRF file. The | |
127 purpose of this is to remove primer bases, but it is not | |
128 recommended. Instead the SRF file should be using the ZTR region chunk | |
129 (REGN) to indicate which potion of a trace is valid. | |
130 .SS "Read naming" | |
131 .PP | |
132 Read names are split into two halves, a prefix and a suffix. One | |
133 common prefix is stored in each and every SRF Data Block Header while | |
134 the suffix is stored in every Data Block. This combination allows for | |
135 removal of repetitive data in order to shrink the SRF file size. | |
136 .TP | |
137 \fB-n\fR \fIformat\fR | |
138 .RS | |
139 Controls the format used for creating the sequence name suffix. This | |
140 uses a printf style system of percent expansions that will be replaced | |
141 with the appropriate data. The list of percent expansions are: | |
142 .TP | |
143 %% | |
144 A literal percent character | |
145 .TP | |
146 %d | |
147 Run date (taken from parsing the current working directory) | |
148 .TP | |
149 %m | |
150 Machine name (taken from parsing the current working directory) | |
151 .TP | |
152 %r | |
153 Run number (taken from parsing the current working directory) | |
154 .TP | |
155 %l | |
156 lane number (%L for hexidecimal encoding) | |
157 .TP | |
158 %t | |
159 tile number (%T for hexidecimal encoding) | |
160 .TP | |
161 %x | |
162 X coordinate (%X for hexidecimal encoding) | |
163 .TP | |
164 %y | |
165 Y coordinate (%Y for hexidecimal encoding) | |
166 .TP | |
167 %c | |
168 Counter; increments by 1 for every sequence in the tile (%C for | |
169 hexidecimal encoding). | |
170 .PP | |
171 All the above format strings have an optional numerical value between | |
172 the percent and the format character. This is used to control the | |
173 field width. For example to print the X and Y coordinates to 3 | |
174 hexidecimal places we could use \fB-n "%3X:%3Y"\fR. | |
175 .PP | |
176 The default format is "\fB%x:%y\fR". | |
177 .RE | |
178 .TP | |
179 \fB-N\fR \fIformat\fR | |
180 .RS | |
181 Specifies the format string for encoding the reading name prefix. It | |
182 follows the same formatting rules specified in the \fB-n\fR above. | |
183 .PP | |
184 The default format is "\fB%m_%r:%l:%t:\fR". | |
185 .RE | |
186 .SS "Ancillary data files" | |
187 .PP | |
188 These options govern the extra files stored per tile (or strictly | |
189 speaking per SRF Data Block Header). | |
190 .TP | |
191 \fB-2\fR \fIcycle\fR | |
192 This specifies the cycle number, counting from 1, of the second read | |
193 forming a read-pair. It is used for automatic generation of filenames | |
194 in several of the options below and also for construction of the ZTR | |
195 region (REGN) chunks. | |
196 .TP | |
197 \fB-mf\fR \fIfilename\fR | |
198 The filename of the forward matrix file. If a single printf numerical | |
199 percent rule is used (such as "%d") then it will be replaced by the | |
200 lane number. When not specified the default \fIfilename\fR will be | |
201 \fI../Matrix/s_%d_02_matrix.txt\fR. | |
202 .TP | |
203 \fB-mr\fR \fIfilename\fR | |
204 The filename of the reverse matrix file - only used on paired end | |
205 runs. If a single printf numerical percent rule is used (such as "%d") | |
206 then it will be replaced by the lane number. If a second printf | |
207 percent rule is used then it will be replaced with the cycle number | |
208 that the paired read starts on. This is equivalent to the cycle number | |
209 specified in the \fB-2\fR option plus one. (The plus one comes | |
210 from using the second cycle per end for matrix calibration.) | |
211 When \fB-mr\fR is not specified the default \fIfilename\fR will be | |
212 \fI../Matrix/s_%d_%02d_matrix.txt\fR. | |
213 .TP | |
214 \f-pf\fR \fIfilename\fR | |
215 Specifies the filename of the forward-read phasing XML file. As with | |
216 \fR-mf\fR a printf numerical percent rule will be replaced by the lane | |
217 number. The default \fIfilename\fR format is | |
218 \fIPhasing/s_%d_01_phasing.xml\fR. | |
219 .TP | |
220 \f-pr\fR \fIfilename\fR | |
221 Specifies the filename of the reverse-read phasing XML file. As with | |
222 \fR-mr\fR the first two printf numerical percent rules will be | |
223 replaced by the lane number and the cycle number. Unlike \fB-mr\fR | |
224 though the cycle number is the value used in the \fB-c\fR option as-is | |
225 instead of plus one. The default \fIfilename\fR format is | |
226 \fIPhasing/s_%d_%02d_phasing.xml\fR. | |
227 .SS "Other options" | |
228 .TP | |
229 \fB-o\fR \fIsrf_filename\fR | |
230 Specifies the output filename to write the SRF data too. Defaults to | |
231 "traces.srf". | |
232 .TP | |
233 \fB-i\fR | |
234 Indicates that an index should be appended to the SRF file. This | |
235 allows for random access based on the sequence name. | |
236 .TP | |
237 \fB-d\fR | |
238 Enable dots-mode. This outputs a full-stop per input tile. Most useful | |
239 in conjunction with quiet mode. Default is off. | |
240 .TP | |
241 \fB-q\fR | |
242 Quiet mode. Do not output commentary on which tile is being processed | |
243 and the metrics about it. Default off. | |
244 | |
245 .SH "EXAMPLES" | |
246 .PP | |
247 To store a lane 4 from a paired end run with raw traces, no | |
248 processed data and calibrated confidence values. | |
249 .PP | |
250 .nf | |
251 # From Bustard directory | |
252 illumina2srf -o all.srf -r -P \\ | |
253 -qf GERALD*/s_4_1_sequence.txt \\ | |
254 -qr GERALD*/s_4_2_sequence.txt \\ | |
255 s_4_*_seq.txt | |
256 .fi | |
257 | |
258 .PP | |
259 To store and index only processed traces with chastity >= 0.6 | |
260 .PP | |
261 .nf | |
262 illumina2srf -o s4.srf -c 0.6 s_4_*_seq.txt | |
263 .fi | |
264 | |
265 .SH "CAVEATS" | |
266 .PP | |
267 There are many mutually exclusive options, some of which may be for | |
268 processing file formats that no longer exist. This is due to the | |
269 history of the program and the rapidly changing nature of the files | |
270 being processed. Some future culling of options and file formats can | |
271 be expected. | |
272 .PP | |
273 Some assumptions are made as to the directory layout and the ability | |
274 to parse the run folder directory name. There are currently no ways to | |
275 override some of this information, including run date, run number and | |
276 GAPipeline program version numbers. | |
277 | |
278 .SH "AUTHOR" | |
279 .PP | |
280 James Bonfield, Wellcome Trust Sanger Institute |