Mercurial > repos > youngkim > ezbamqc
diff ezBAMQC/src/htslib/faidx.5 @ 0:dfa3745e5fd8
Uploaded
author | youngkim |
---|---|
date | Thu, 24 Mar 2016 17:12:52 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ezBAMQC/src/htslib/faidx.5 Thu Mar 24 17:12:52 2016 -0400 @@ -0,0 +1,147 @@ +'\" t +.TH faidx 5 "August 2013" "htslib" "Bioinformatics formats" +.SH NAME +faidx \- an index enabling random access to FASTA files +.\" +.\" Copyright (C) 2013 Genome Research Ltd. +.\" +.\" Author: John Marshall <jm18@sanger.ac.uk> +.\" +.\" Permission is hereby granted, free of charge, to any person obtaining a +.\" copy of this software and associated documentation files (the "Software"), +.\" to deal in the Software without restriction, including without limitation +.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, +.\" and/or sell copies of the Software, and to permit persons to whom the +.\" Software is furnished to do so, subject to the following conditions: +.\" +.\" The above copyright notice and this permission notice shall be included in +.\" all copies or substantial portions of the Software. +.\" +.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL +.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER +.\" DEALINGS IN THE SOFTWARE. +.\" +.SH SYNOPSIS +.IR file.fa .fai, +.IR file.fasta .fai +.SH DESCRIPTION +Using an \fBfai index\fP file in conjunction with a FASTA file containing +reference sequences enables efficient access to arbitrary regions within +those reference sequences. +The index file typically has the same filename as the corresponding FASTA +file, with \fB.fai\fP appended. +.P +An \fBfai index\fP file is a text file consisting of lines each with +five TAB-delimited columns: +.TS +lbl. +NAME Name of this reference sequence +LENGTH Total length of this reference sequence, in bases +OFFSET Offset within the FASTA file of this sequence's first base +LINEBASES The number of bases on each line +LINEWIDTH The number of bytes in each line, including the newline +.TE +.P +The \fBNAME\fP and \fBLENGTH\fP columns contain the same +data as would appear in the \fBSN\fP and \fBLN\fP fields of a +SAM \fB@SQ\fP header for the same reference sequence. +.P +The \fBOFFSET\fP column contains the offset within the FASTA file, in bytes +starting from zero, of the first base of this reference sequence, i.e., of +the character following the newline at the end of the "\fB>\fP" header line. +Typically the lines of a \fBfai index\fP file appear in the order in which the +reference sequences appear in the FASTA file, so \fB.fai\fP files are typically +sorted according to this column. +.P +The \fBLINEBASES\fP column contains the number of bases in each of the sequence +lines that form the body of this reference sequence, apart from the final line +which may be shorter. +The \fBLINEWIDTH\fP column contains the number of \fIbytes\fP in each of +the sequence lines (except perhaps the final line), thus differing from +\fBLINEBASES\fP in that it also counts the bytes forming the line terminator. +.SS FASTA Files +In order to be indexed with \fBsamtools faidx\fP, a FASTA file must be a text +file of the form +.LP +.RS +.RI > name +.RI [ description ...] +.br +ATGCATGCATGCATGCATGCATGCATGCAT +.br +GCATGCATGCATGCATGCATGCATGCATGC +.br +ATGCAT +.br +.RI > name +.RI [ description ...] +.br +ATGCATGCATGCAT +.br +GCATGCATGCATGC +.br +[...] +.RE +.LP +In particular, each reference sequence must be "well-formatted", i.e., all +of its sequence lines must be the same length, apart from the final sequence +line which may be shorter. +(While this sequence line length must be the same within each sequence, +it may vary between different reference sequences in the same FASTA file.) +.P +This also means that although the FASTA file may have Unix- or Windows-style +or other line termination, the newline characters present must be consistent, +at least within each reference sequence. +.P +The \fBsamtools\fP implementation uses the first word of the "\fB>\fP" header +line text (i.e., up to the first whitespace character) as the \fBNAME\fP column. +At present, there may be no whitespace between the +">" character and the \fIname\fP. +.SH EXAMPLE +For example, given this FASTA file +.LP +.RS +>one +.br +ATGCATGCATGCATGCATGCATGCATGCAT +.br +GCATGCATGCATGCATGCATGCATGCATGC +.br +ATGCAT +.br +>two another chromosome +.br +ATGCATGCATGCAT +.br +GCATGCATGCATGC +.br +.RE +.LP +formatted with Unix-style (LF) line termination, the corresponding fai index +would be +.RS +.TS +lnnnn. +one 66 5 30 31 +two 28 98 14 15 +.TE +.RE +.LP +If the FASTA file were formatted with Windows-style (CR-LF) line termination, +the fai index would be +.RS +.TS +lnnnn. +one 66 6 30 32 +two 28 103 14 16 +.TE +.RE +.SH SEE ALSO +.IR samtools (1) +.TP +http://en.wikipedia.org/wiki/FASTA_format +Further description of the FASTA format