Mercurial > repos > dawe > srf2fastq
diff srf2fastq/io_lib-1.12.2/man/man1/illumina2srf.1 @ 0:d901c9f41a6a default tip
Migrated tool version 1.0.1 from old tool shed archive to new tool shed repository
author | dawe |
---|---|
date | Tue, 07 Jun 2011 17:48:05 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/srf2fastq/io_lib-1.12.2/man/man1/illumina2srf.1 Tue Jun 07 17:48:05 2011 -0400 @@ -0,0 +1,280 @@ +.TH illumina2srf 1 "September 29" "" "Staden io_lib" + +.SH "NAME" + +.PP +.BR illumina2srf +\- Builds an SRF file from an Illumina/Solexa GA run folder. + +.SH "SYNOPSIS" +.PP +\fBillumina2srf\fR [\fIoptions\fR] \fItile_seq_file\fR ... + +.SH "DESCRIPTION" +.PP +\fBillumina2srf\fR converts the Illumina GA-pipeline run folder output +into an SRF file. It should be run from the +Bustard\fI<version><date>\fR directory. It has a wealth of options, +listed below, although many have defaults and may be ommitted if the +run folder follows the standard directory layout. The arguments, after +the options, should be the filenames of the sequence files, eg +\fIs_8_*_seq.txt\fR. All other filenames are derived from the _seq.txt +filenames. +.PP +The main structure of an SRF file is as a container, much like zip or +tar. The contents however may be split into variable and common +components allowing for better compression. For \fBillumina2srf\fR +that means that we store trace data in ZTR format with common ZTR +chunks (text identifiers such as base-caller name and version, matrix +files and compression specifications) in an SRF \fIData Block +Header\fR and variable components (sequence, quality and traces) in +ZTR chunks held within an SRF \fIData Block\fR. Typically we have +10,000 Data Blocks per Data Block Header. +.PP +The most major decision in producing the SRF file is what data to put +in it. By default the program writes the sequence and probability +values along with the "processed" trace intensities. In GAPipeline +v1.0 and earlier these are in the \fI_seq.txt\fR, \fI_prb.txt\fR and +\fI_sig2.txt\fR files held within the main Bustard directory. In +addition to these the \fB-r\fR option requests storage of the "raw" +trace intensities, comprising both the pre-processed intensities and +noise estimates from the Firecrest \fI_int.txt\fR and \fI_nse.txt\fR +files respectively. To store only raw intensities, skipping processed +data, specify the \fB-r -P\fR options. Finally the \fB-I\fR option can +be used to store data from IPAR format files. +.PP +Confidence values have been a source of large variation over the +pipeline releases. In GAPipeline 1.0 and earlier the \fI_prb.txt\fR +files in the Bustard directory contain four quality values per base +encoded using a log-odds system: 10*log(P/(1-P)). In addition to this +there are various calibrated formats in the GERALD directory with one +Phred scale value per base. See the \fB-qf\fR, \fB-qr\fR and \fB-qc\fR +parameters. +.PP +There are a number of smaller ancillary data files that get stored +too. As there is no per-lane or per-run storage mechanism in +these are added for every SRF Data Block Header of which there may be +several per tile. However the overhead in duplicating this data is not +significant given the size of the individual SRF Data Blocks. The +ancillary data files also stored are \fI.params\fR files (for both Bustard +and Firecrest), matrices (specified using \fB-mf\fR and \fB-mr\fR) and +phasing XML files (\fB-pf\fR and \fB-pr\fR). + +.SH "OPTIONS" +.PP +.SS "Trace data-source options" +.TP +\fB-r\fR, \fB-R\fR +Specifies to store (\fB-r\fR) or not to store (\fB-R\fR - the default) +"raw" data. This is currently comprised of the contents of the +\fI_int.txt\fR and \fI_nse.txt\fR files in the Firecrest directory. +.TP +\fB-p\fR, \fB-P\fR +Specifies to store (\fB-p\fR - the default) or not to store (\fB-P\fR) +the "processed" data. This is the contents of the \fI_sig2.txt\fR +files in the Bustard directory. +.TP +\fB-u\fR +Deprecated. Older GAPipeline releases created \fI_sig.txt\fR files +holding semi-processed data with compensation for the dye spectral +overlap, but before phasing correction steps. The \fB-u\fR argument +indicates that the processed data should be taken from these files +instead of \fI_sig2.txt\fR. +.TP +\fB-I\fR +Reads \fIIPAR\fR files instead of the raw trace data files. These are +a different format used by the incremental processing software when +the pipeline is run on the instrument control PC itself. +.SS "Quality value data-source options" +.TP +\fB-qf\fR \fIfilename\fR +Specifies the filename of the calibrated quality values for the +forward-read or both the forward and reverse read combined if +appropriate. \fIfilename\fR should be in Illumina's fastq derivative +format, with quality values stored as ASCII 64 plus the log-odds +score. +.TP +\fB-qr\fR \fIfilename\fR +If the calibrated fastq files are split into forward and reverse files +then \fIfilename\fR specifies the reverse sequences. Otherwise we +assume they are tacked onto the end of the forward sequences specified +in \fB-qf\fR. Like the former file, this should be in Illumina's +fastq-like format. +.TP +\fB-qc\fR \fIdirectory\fR +This is an alternative to the \fB-qf\fR and \fB-qr\fR options above +and is mutually exclusive with them. This specifies that the +calibrated data should come from files named +"\fIdirectory\fR/s_%d_qcal.txt" where "%d" is replaced by the current +tile number. + +.SS "Filtering options" +.TP +\fB-c\fR \fIvalue\fR +Only store traces that have a "chastity" score >= \fIValue\fR. +This is mutually exclusive with the \fB-C\fR option. +.TP +\fB-C\fR \fIvalue\fR +Until the -c option, traces with a "chastity" score < \fIValue\fR are +still stored in the SRF file but are marked as bad reads +instead. \fBsrf2fasta\fR and \fBsrf2fastq\fR have options to +subsequently filter out bad reads using this flag. +This is mutually exclusive with the \fB-c\fR option. +.TP +\fB-s\fR \fIN\fR +This skips the first \fIN\fR cycles of a trace (including signal, +sequence and quality values) when writing it to an SRF file. The +purpose of this is to remove primer bases, but it is not +recommended. Instead the SRF file should be using the ZTR region chunk +(REGN) to indicate which potion of a trace is valid. +.SS "Read naming" +.PP +Read names are split into two halves, a prefix and a suffix. One +common prefix is stored in each and every SRF Data Block Header while +the suffix is stored in every Data Block. This combination allows for +removal of repetitive data in order to shrink the SRF file size. +.TP +\fB-n\fR \fIformat\fR +.RS +Controls the format used for creating the sequence name suffix. This +uses a printf style system of percent expansions that will be replaced +with the appropriate data. The list of percent expansions are: +.TP +%% +A literal percent character +.TP +%d +Run date (taken from parsing the current working directory) +.TP +%m +Machine name (taken from parsing the current working directory) +.TP +%r +Run number (taken from parsing the current working directory) +.TP +%l +lane number (%L for hexidecimal encoding) +.TP +%t +tile number (%T for hexidecimal encoding) +.TP +%x +X coordinate (%X for hexidecimal encoding) +.TP +%y +Y coordinate (%Y for hexidecimal encoding) +.TP +%c +Counter; increments by 1 for every sequence in the tile (%C for +hexidecimal encoding). +.PP +All the above format strings have an optional numerical value between +the percent and the format character. This is used to control the +field width. For example to print the X and Y coordinates to 3 +hexidecimal places we could use \fB-n "%3X:%3Y"\fR. +.PP +The default format is "\fB%x:%y\fR". +.RE +.TP +\fB-N\fR \fIformat\fR +.RS +Specifies the format string for encoding the reading name prefix. It +follows the same formatting rules specified in the \fB-n\fR above. +.PP +The default format is "\fB%m_%r:%l:%t:\fR". +.RE +.SS "Ancillary data files" +.PP +These options govern the extra files stored per tile (or strictly +speaking per SRF Data Block Header). +.TP +\fB-2\fR \fIcycle\fR +This specifies the cycle number, counting from 1, of the second read +forming a read-pair. It is used for automatic generation of filenames +in several of the options below and also for construction of the ZTR +region (REGN) chunks. +.TP +\fB-mf\fR \fIfilename\fR +The filename of the forward matrix file. If a single printf numerical +percent rule is used (such as "%d") then it will be replaced by the +lane number. When not specified the default \fIfilename\fR will be +\fI../Matrix/s_%d_02_matrix.txt\fR. +.TP +\fB-mr\fR \fIfilename\fR +The filename of the reverse matrix file - only used on paired end +runs. If a single printf numerical percent rule is used (such as "%d") +then it will be replaced by the lane number. If a second printf +percent rule is used then it will be replaced with the cycle number +that the paired read starts on. This is equivalent to the cycle number +specified in the \fB-2\fR option plus one. (The plus one comes +from using the second cycle per end for matrix calibration.) +When \fB-mr\fR is not specified the default \fIfilename\fR will be +\fI../Matrix/s_%d_%02d_matrix.txt\fR. +.TP +\f-pf\fR \fIfilename\fR +Specifies the filename of the forward-read phasing XML file. As with +\fR-mf\fR a printf numerical percent rule will be replaced by the lane +number. The default \fIfilename\fR format is +\fIPhasing/s_%d_01_phasing.xml\fR. +.TP +\f-pr\fR \fIfilename\fR +Specifies the filename of the reverse-read phasing XML file. As with +\fR-mr\fR the first two printf numerical percent rules will be +replaced by the lane number and the cycle number. Unlike \fB-mr\fR +though the cycle number is the value used in the \fB-c\fR option as-is +instead of plus one. The default \fIfilename\fR format is +\fIPhasing/s_%d_%02d_phasing.xml\fR. +.SS "Other options" +.TP +\fB-o\fR \fIsrf_filename\fR +Specifies the output filename to write the SRF data too. Defaults to +"traces.srf". +.TP +\fB-i\fR +Indicates that an index should be appended to the SRF file. This +allows for random access based on the sequence name. +.TP +\fB-d\fR +Enable dots-mode. This outputs a full-stop per input tile. Most useful +in conjunction with quiet mode. Default is off. +.TP +\fB-q\fR +Quiet mode. Do not output commentary on which tile is being processed +and the metrics about it. Default off. + +.SH "EXAMPLES" +.PP +To store a lane 4 from a paired end run with raw traces, no +processed data and calibrated confidence values. +.PP +.nf + # From Bustard directory + illumina2srf -o all.srf -r -P \\ + -qf GERALD*/s_4_1_sequence.txt \\ + -qr GERALD*/s_4_2_sequence.txt \\ + s_4_*_seq.txt +.fi + +.PP +To store and index only processed traces with chastity >= 0.6 +.PP +.nf + illumina2srf -o s4.srf -c 0.6 s_4_*_seq.txt +.fi + +.SH "CAVEATS" +.PP +There are many mutually exclusive options, some of which may be for +processing file formats that no longer exist. This is due to the +history of the program and the rapidly changing nature of the files +being processed. Some future culling of options and file formats can +be expected. +.PP +Some assumptions are made as to the directory layout and the ability +to parse the run folder directory name. There are currently no ways to +override some of this information, including run date, run number and +GAPipeline program version numbers. + +.SH "AUTHOR" +.PP +James Bonfield, Wellcome Trust Sanger Institute