diff xenome-1.0.1-r/xenome.1 @ 0:6d87470d68aa draft default tip

Uploaded
author sangok
date Thu, 23 Apr 2020 08:32:34 -0400
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xenome-1.0.1-r/xenome.1	Thu Apr 23 08:32:34 2020 -0400
@@ -0,0 +1,488 @@
+.TH xenome 1 "September 12, 2012" "Xenome User Manual"
+.SH NAME
+.PP
+xenome - a tool for classifying reads from xenograft sources.
+.PP
+Version 1.0.1
+.SH SYNOPSIS
+.PP
+xenome index -T 8 -P idx -H mouse.fa -G human.fa
+.PP
+xenome classify -T 8 -P idx \[em]pairs \[em]host-name mouse
+\[em]graft-name human -i in_1.fastq -i in_2.fastq
+.PP
+xenome help
+.SH DESCRIPTION
+.PP
+Shotgun sequence read data derived from xenograft material contains
+a mixture of reads arising from the host and reads arising from the
+graft.
+Xenome is an application for classifying the read mixture to
+separate the two, allowing for more precise analysis to be
+performed.
+.PP
+Xenome uses host and graft reference sequences to characterise the
+set of all possible k-mers according to whether they belong to:
+.IP \[bu] 2
+only the graft (and NOT the host)
+.IP \[bu] 2
+only the host (and NOT the graft)
+.IP \[bu] 2
+both references
+.IP \[bu] 2
+neither reference
+.IP \[bu] 2
+the subset of the host (or graft) k-mers which is one base
+substitution away from being in the graft (or host) - we call these
+k-mers \[lq]marginal\[rq]
+.PP
+Given a read, or read pair, xenome will calculate which of the
+above categories its k-mers belong to, and classify it as one of:
+graft, host, both, neither, or ambiguous.
+.PP
+Xenome has two distinct stages, which are embodied in two separate
+commands: `index' and `classify'.
+Before reads can be classified, an index must be constructed from
+the graft and host reference sequences.
+The references must be in FASTA format, and may optionally be
+compressed (gzip).
+.PP
+\f[CR]
+      xenome\ index\ -M\ 24\ -T\ 8\ -P\ idx\ -H\ mouse.fa\ -G\ human.fa
+\f[]
+.PP
+A xenome index consists of a number of related files which can be
+identified by a user-specified prefix, e.g.\ `idx' in the above
+command.
+The prefix may contain `/' characters, allowing the index to be in
+a sub-directory.
+(Any such sub-directory must already exist - xenome will not create
+it.)
+For example, the set of files comprising an index with prefix `idx'
+are:
+.PP
+\f[CR]
+      idx-both.header
+      idx-both.kmers-d0
+      idx-both.kmers-d1
+      idx-both.kmers.header
+      idx-both.kmers.high-bits
+      idx-both.kmers.low-bits.lwr
+      idx-both.kmers.low-bits.upr
+      idx-both.lhs-bits
+      idx-both.rhs-bits
+\f[]
+.PP
+Once an index is available, reads can be classified according to
+whether they appear to contain graft or host material.
+In the simplest case, Xenome can classify each read from a single
+source file individually.
+.PP
+\f[CR]
+      xenome\ classify\ -P\ idx\ -i\ in.fastq\ 
+\f[]
+.PP
+This step produces a file for each read category, containing all of
+the reads which have been assigned that classification:
+.PP
+\f[CR]
+      ambiguous.fastq
+      both.fastq
+      graft.fastq
+      host.fastq
+      neither.fastq
+\f[]
+.PP
+Input files are base-space reads in FASTA or FASTQ format or in a
+format with one read per line and in either plain text or
+compressed format (gzip).
+.PP
+The files produced are in the same format as the input file, with
+all of the input read data preserved.
+i.e.\ if the input reads are in FASTQ format, the reads written to
+each of the output files will also be in FASTQ format.
+.PP
+Multiple input files may be specified, but all inputs in the same
+format will be written to the same set of output files.
+.PP
+\f[CR]
+      xenome\ classify\ -P\ idx\ -i\ inA.fastq\ -i\ inB.fastq\ -I\ inC.fasta
+\f[]
+.PP
+The above will result in the following set of files:
+.PP
+\f[CR]
+      ambiguous.fasta
+      ambiguous.fastq
+      both.fasta
+      both.fastq
+      graft.fasta
+      graft.fastq
+      host.fasta
+      host.fastq
+      neither.fasta
+      neither.fastq
+\f[]
+.PP
+Each of the FASTQ files contains a mixture of reads from inA.fastq
+and inB.fastq.
+The FASTA files contain reads from inC.fasta.
+.PP
+If the combining of input reads from separate files is not desired,
+xenome should be run separately for each input.
+The output from different runs can be distinguished by prefixing
+the filenames with a distinct string.
+.PP
+\f[CR]
+      xenome\ classify\ -P\ idx\ -i\ inA.fastq\ --output-filename-prefix\ A
+      xenome\ classify\ -P\ idx\ -i\ inB.fastq\ --output-filename-prefix\ B
+\f[]
+.PP
+Running these two commands yields:
+.PP
+\f[CR]
+      A_ambiguous.fastq
+      A_both.fastq
+      A_graft.fastq
+      A_host.fastq
+      A_neither.fastq
+      B_ambiguous.fastq
+      B_both.fastq
+      B_graft.fastq
+      B_host.fastq
+      B_neither.fastq
+\f[]
+.PP
+Xenome can also process pairs of reads.
+.PP
+\f[CR]
+      xenome\ classify\ -P\ idx\ --pairs\ -i\ in_1.fastq\ -i\ in_2.fastq
+\f[]
+.PP
+This results in a pair of files for each read category.
+The two reads of each pair are written to the corresponding `_1'
+and `_2' files respectively.
+.PP
+\f[CR]
+      ambiguous_1.fastq
+      ambiguous_2.fastq
+      both_1.fastq
+      both_2.fastq
+      graft_1.fastq
+      graft_2.fastq
+      host_1.fastq
+      host_2.fastq
+      neither_1.fastq
+      neither_2.fastq
+\f[]
+.PP
+If desired, more specific names can be used in place of `host' and
+`graft'.
+.PP
+\f[CR]
+      xenome\ classify\ -P\ idx\ -i\ in.fastq\ --graft-name\ human\ --host-name\ mouse
+\f[]
+.PP
+This will cause xenome to produce the following files.
+.PP
+\f[CR]
+      ambiguous.fastq
+      both.fastq
+      human.fastq
+      mouse.fastq
+      neither.fastq
+\f[]
+.PP
+In addition to generating sets of output files, the classify
+command produces statistics about the number and proportion of
+reads assigned to each category.
+These are printed to standard out at the end of a run and look as
+follows:
+.PP
+\f[CR]
+      Statistics
+      B\ \ \ \ \ \ \ G\ \ \ \ \ \ \ H\ \ \ \ \ \ \ M\ \ \ \ \ \ \ count\ \ \ \ \ percent\ \ \ class
+      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1900\ \ \ \ \ \ 0.938267\ \ "neither"
+      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 21\ \ \ \ \ \ \ \ 0.0103703\ "both"
+      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 28491\ \ \ \ \ 14.0696\ \ \ "definitely\ host"
+      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 7366\ \ \ \ \ \ 3.63751\ \ \ "probably\ host"
+      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 91895\ \ \ \ \ 45.38\ \ \ \ \ "definitely\ graft"
+      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 30059\ \ \ \ \ 14.8439\ \ \ "probably\ graft"
+      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 282\ \ \ \ \ \ \ 0.139259\ \ "ambiguous"
+      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 330\ \ \ \ \ \ \ 0.162962\ \ "ambiguous"
+      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 2878\ \ \ \ \ \ 1.42123\ \ \ "both"
+      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 254\ \ \ \ \ \ \ 0.125431\ \ "probably\ both"
+      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 610\ \ \ \ \ \ \ 0.301233\ \ "definitely\ host"
+      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 5815\ \ \ \ \ \ 2.87159\ \ \ "probably\ host"
+      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 3843\ \ \ \ \ \ 1.89777\ \ \ "definitely\ graft"
+      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 27775\ \ \ \ \ 13.716\ \ \ \ "probably\ graft"
+      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 99\ \ \ \ \ \ \ \ 0.0488886\ "ambiguous"
+      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 883\ \ \ \ \ \ \ 0.436047\ \ "ambiguous"
+      
+      Summary
+      count\ \ \ \ \ percent\ \ \ class
+      153572\ \ \ \ 75.8377\ \ \ "graft"
+      42282\ \ \ \ \ 20.8799\ \ \ "host"
+      3153\ \ \ \ \ \ 1.55703\ \ \ "both"
+      1900\ \ \ \ \ \ 0.938267\ \ "neither"
+      1594\ \ \ \ \ \ 0.787157\ \ "ambiguous"
+\f[]
+.PP
+Both tables contain a single heading line, followed by rows of
+TAB-separated elements; a format suitable for loading into R or a
+spreadsheet.
+.PP
+Each row represents the number and proportion of reads assigned to
+a particular class.
+The B, G, H, and M fields represent the presence (1) or absence (0)
+of k-mers belonging to the both, graft, host and marginal k-mer
+subsets, according to the reference index.
+.PP
+The Statistics table contains 16 rows; one for each possible
+combination of k-mer classes present within a read.
+The first row of the above table, indicates that for the given
+input, 1,900 reads (or pairs) - 0.938267% of the total reads -
+contained no k-mers that belonged to the B, G, H, or M k-mer
+subsets, and are accordingly neither host nor graft reads.
+Similarly, the fourteenth line states that 27,775 reads (or pairs)
+- 13.716% of the total - contained k-mers that belong to the B, G,
+M, but not H subsets, and are therefore \[lq]probably graft\[rq]
+reads.
+.PP
+In the Summary table, the B, G, H, and M columns are removed, and
+the classes from the Statistics table have been collapsed into the
+five shown; the definitely/probably graft/host classes are combined
+into just graft/host classes.
+Notice that the different read output files, described earlier,
+correspond exactly to these classes.
+.SH OPTIONS COMMON TO ALL COMMANDS
+.PP
+The following options can be used with all of the \f[I]xenome\f[]
+commands and are therefore not listed separately for each command.
+.TP
+.B -h, --help
+Show a help message.
+.RS
+.RE
+.TP
+.B -l \f[I]FILE\f[], --log-file \f[I]FILE\f[]
+Place to write progress messages.
+Messages are only written if the -v flag is used.
+If omitted, messages are written to stderr.
+.RS
+.RE
+.TP
+.B -T \f[I]INT\f[], --num-threads \f[I]INT\f[]
+The maximum number of \f[I]worker\f[] threads to use.
+The actual number of threads used during the algorithms depends on
+each implementation.
+\f[I]xenome\f[] may use a small number of additional threads for
+performing non cpu-bound operations, such as file I/O.
+.RS
+.RE
+.TP
+.B --tmp-dir \f[I]DIRECTORY\f[]
+A directory to use for temporary files.
+This flag may be repeated in order to nominate multiple temporary
+directories.
+.RS
+.RE
+.TP
+.B -v, --verbose
+Show progress messages.
+.RS
+.RE
+.TP
+.B -V, --version
+Show the software version.
+.RS
+.RE
+.SH COMMANDS AND OPTIONS
+.SS xenome index
+.PP
+xenome index [-k \f[I]INT\f[]] [-M \f[I]INT\f[]] -P \f[I]PREFIX\f[]
+-G \f[I]FASTA-filename\f[] -H \f[I]FASTA-filename\f[]
+.PP
+Build the xenome reference index from the graft and host reference
+sequences.
+The input files must be in FASTA format.
+They may be gzip compressed, in which case the filename suffix must
+be \f[I]\&.gz\f[].
+.PP
+The k-mer size may be specified using the \f[I]-k\f[] flag.
+If omitted, xenome defaults to k=25.
+.PP
+During index construction, xenome maintains a hash table of the
+k-mers seen so far.
+When this table fills, its contents are written to disk, and the
+table is reinitialised.
+The more memory xenome can use, the less often it will need to
+write to disk, and the faster index construction will run.
+By default, xenome will limit itself to 2 GB during index
+construction.
+The -M, \[em]max-memory flag can be used to explicitly control the
+amount of memory available to xenome (in GB).
+To improve performance, this should generally be set close to the
+amount memory available in the system - having accounted for
+operating system and other overhead.
+.PP
+\f[I]OPTIONS\f[]
+.TP
+.B -k \f[I]INT\f[], --kmer-size \f[I]INT\f[]
+The k-mer size to use for building the graph: in version 1.0.0 this
+\f[I]must be an integer strictly less than 63\f[].
+If not supplied, the default value of 25 is used.
+.RS
+.RE
+.TP
+.B -M \f[I]INT\f[], --max-memory \f[I]INT\f[]
+The maximum amount of memory (in GB) of memory to use.
+Making more memory available will reduce the number of times xenome
+writes intermediate index data to disk.
+The default is 2 GB.
+.RS
+.RE
+.TP
+.B -P \f[I]PREFIX\f[], --prefix \f[I]PREFIX\f[]
+The path prefix for all generated reference index files.
+The prefix may contain directory separators (e.g.
+`/') in order to have the index files written to another directory.
+.RS
+.RE
+.TP
+.B -G \f[I]FILE\f[], --graft \f[I]FILE\f[]
+The name of the FASTA file containing the graft reference sequence.
+If the filename ends in \f[I]\&.gz\f[] it will be read as a gzip
+file.
+.RS
+.RE
+.TP
+.B -H \f[I]FILE\f[], --host \f[I]FILE\f[]
+The name of the FASTA file containing the host reference sequence.
+If the filename ends in \f[I]\&.gz\f[] it will be read as a gzip
+file.
+.RS
+.RE
+.SS xenome classify
+.PP
+xenome classify -P \f[I]PREFIX\f[] {-I \f[I]FASTA-filename\f[] | -i
+\f[I]FASTQ-filename\f[] | \[em]line-in \f[I]filename\f[]}+
+[\[em]pairs] [-M \f[I]INT\f[]] [\[em]graft-name \f[I]STRING\f[]]
+[\[em]host-name \f[I]STRING\f[]] [\[em]output-filename-prefix
+\f[I]STRING\f[]] [\[em]dont-write-reads] [\[em]preserve-read-order]
+.PP
+Classifies input reads according to a pre-computed k-mer index.
+The reads are written into separate files, according to their
+classification, and a breakdown of the number and proportion of
+reads in each class is printed.
+.PP
+If the total size of the index files is greater than available RAM,
+xenome will perform poorly.
+To overcome this, the -M, \[em]max-memory flag may be used to
+specify the maximum amount of memory (in GB) that xenome may use at
+any time.
+If this amount is less than the size of the index structures,
+xenome will (effectively) partition the index into multiple
+subsets, each no larger than the specified maximum memory size, and
+classify the reads in multiple passes - with each pass using a
+different index subset.
+The results from each passes are combined, and the result is
+produced as usual.
+If run with the -v, \[em]verbose flag, xenome will report the
+number of passes it will perform.
+Note that runtime will increase with the number of passes
+performed; the biggest increase will occur with the step from one
+pass to two.
+.PP
+\f[I]OPTIONS\f[]
+.TP
+.B -P \f[I]PREFIX\f[], --prefix \f[I]PREFIX\f[]
+The path prefix for all reference index files.
+The prefix may contain directory separators (e.g.
+`/') in order to have the index files written to another directory.
+.RS
+.RE
+.TP
+.B -I \f[I]FILE\f[], --fasta-in \f[I]FILE\f[]
+Input file in FASTA format.
+.RS
+.RE
+.TP
+.B -i \f[I]FILE\f[], --fastq-in \f[I]FILE\f[]
+Input file in FASTQ format.
+.RS
+.RE
+.TP
+.B \[em]line-in \f[I]FILE\f[]
+Input file with one read per line and no other annotation.
+.RS
+.RE
+.TP
+.B \[em]pairs
+Treat reads from consecutive input files of the same type as pairs.
+.RS
+.RE
+.TP
+.B -M \f[I]INT\f[], --max-memory \f[I]INT\f[]
+The maximum amount of memory (in GB) to use while classifying
+reads.
+If not specified, xenome will use as much memory as required to
+classify all reads in a single pass.
+When the maximum amount of memory is less than the size of the
+reference index files, xenome will need to perform multiple passes
+over the input data - increasing runtime.
+.RS
+.RE
+.TP
+.B \[em]graft-name \f[I]STRING\f[]
+The name of the graft reference to appear in filenames and
+statistics.
+If no explicit name is provided, the string \[lq]graft\[rq] is
+used.
+.RS
+.RE
+.TP
+.B \[em]host-name \f[I]STRING\f[]
+The name of the host reference to appear in filenames and
+statistics.
+If no explicit name is provided, the string \[lq]host\[rq] is used.
+.RS
+.RE
+.TP
+.B \[em]output-filename-prefix \f[I]STRING\f[]
+An optional prefix to apply to all output read filenames.
+The prefix is separated from the rest of the filename by an
+underscore (`_').
+.RS
+.RE
+.TP
+.B \[em]dont-write-reads
+The reads will not be written to any files after classification,
+and none of the usual per-category output files will be created.
+The classification statistics will still be printed to standard
+out.
+.RS
+.RE
+.TP
+.B \[em]preserve-read-order
+The relative ordering of reads within each output file will be the
+same as that in the input files.
+i.e.\ if read \f[I]r1\f[] precedes \f[I]r2\f[] in a single output
+file, then \f[I]r1\f[] also precedes \f[I]r2\f[] in the input.
+Note: If this flag is specified, the -T/\[em]num-threads flag is
+ignored, and xenome will only operate with a single worker thread.
+.RS
+.RE
+.SS xenome help
+.PP
+xenome help
+.PP
+Prints a summary of all of the xenome commands.
+.PP
+\[em]
+.SH FUTURE RELEASES
+.PP
+Bzip support will be introduced.
+.SH AUTHORS
+Bryan Beresford-Smith, Andrew Bromage, Thomas Conway, Jeremy Wazny.
+