view xenome-1.0.1-r/xenome.1 @ 0:6d87470d68aa draft default tip

Uploaded
author sangok
date Thu, 23 Apr 2020 08:32:34 -0400
parents
children
line wrap: on
line source

.TH xenome 1 "September 12, 2012" "Xenome User Manual"
.SH NAME
.PP
xenome - a tool for classifying reads from xenograft sources.
.PP
Version 1.0.1
.SH SYNOPSIS
.PP
xenome index -T 8 -P idx -H mouse.fa -G human.fa
.PP
xenome classify -T 8 -P idx \[em]pairs \[em]host-name mouse
\[em]graft-name human -i in_1.fastq -i in_2.fastq
.PP
xenome help
.SH DESCRIPTION
.PP
Shotgun sequence read data derived from xenograft material contains
a mixture of reads arising from the host and reads arising from the
graft.
Xenome is an application for classifying the read mixture to
separate the two, allowing for more precise analysis to be
performed.
.PP
Xenome uses host and graft reference sequences to characterise the
set of all possible k-mers according to whether they belong to:
.IP \[bu] 2
only the graft (and NOT the host)
.IP \[bu] 2
only the host (and NOT the graft)
.IP \[bu] 2
both references
.IP \[bu] 2
neither reference
.IP \[bu] 2
the subset of the host (or graft) k-mers which is one base
substitution away from being in the graft (or host) - we call these
k-mers \[lq]marginal\[rq]
.PP
Given a read, or read pair, xenome will calculate which of the
above categories its k-mers belong to, and classify it as one of:
graft, host, both, neither, or ambiguous.
.PP
Xenome has two distinct stages, which are embodied in two separate
commands: `index' and `classify'.
Before reads can be classified, an index must be constructed from
the graft and host reference sequences.
The references must be in FASTA format, and may optionally be
compressed (gzip).
.PP
\f[CR]
      xenome\ index\ -M\ 24\ -T\ 8\ -P\ idx\ -H\ mouse.fa\ -G\ human.fa
\f[]
.PP
A xenome index consists of a number of related files which can be
identified by a user-specified prefix, e.g.\ `idx' in the above
command.
The prefix may contain `/' characters, allowing the index to be in
a sub-directory.
(Any such sub-directory must already exist - xenome will not create
it.)
For example, the set of files comprising an index with prefix `idx'
are:
.PP
\f[CR]
      idx-both.header
      idx-both.kmers-d0
      idx-both.kmers-d1
      idx-both.kmers.header
      idx-both.kmers.high-bits
      idx-both.kmers.low-bits.lwr
      idx-both.kmers.low-bits.upr
      idx-both.lhs-bits
      idx-both.rhs-bits
\f[]
.PP
Once an index is available, reads can be classified according to
whether they appear to contain graft or host material.
In the simplest case, Xenome can classify each read from a single
source file individually.
.PP
\f[CR]
      xenome\ classify\ -P\ idx\ -i\ in.fastq\ 
\f[]
.PP
This step produces a file for each read category, containing all of
the reads which have been assigned that classification:
.PP
\f[CR]
      ambiguous.fastq
      both.fastq
      graft.fastq
      host.fastq
      neither.fastq
\f[]
.PP
Input files are base-space reads in FASTA or FASTQ format or in a
format with one read per line and in either plain text or
compressed format (gzip).
.PP
The files produced are in the same format as the input file, with
all of the input read data preserved.
i.e.\ if the input reads are in FASTQ format, the reads written to
each of the output files will also be in FASTQ format.
.PP
Multiple input files may be specified, but all inputs in the same
format will be written to the same set of output files.
.PP
\f[CR]
      xenome\ classify\ -P\ idx\ -i\ inA.fastq\ -i\ inB.fastq\ -I\ inC.fasta
\f[]
.PP
The above will result in the following set of files:
.PP
\f[CR]
      ambiguous.fasta
      ambiguous.fastq
      both.fasta
      both.fastq
      graft.fasta
      graft.fastq
      host.fasta
      host.fastq
      neither.fasta
      neither.fastq
\f[]
.PP
Each of the FASTQ files contains a mixture of reads from inA.fastq
and inB.fastq.
The FASTA files contain reads from inC.fasta.
.PP
If the combining of input reads from separate files is not desired,
xenome should be run separately for each input.
The output from different runs can be distinguished by prefixing
the filenames with a distinct string.
.PP
\f[CR]
      xenome\ classify\ -P\ idx\ -i\ inA.fastq\ --output-filename-prefix\ A
      xenome\ classify\ -P\ idx\ -i\ inB.fastq\ --output-filename-prefix\ B
\f[]
.PP
Running these two commands yields:
.PP
\f[CR]
      A_ambiguous.fastq
      A_both.fastq
      A_graft.fastq
      A_host.fastq
      A_neither.fastq
      B_ambiguous.fastq
      B_both.fastq
      B_graft.fastq
      B_host.fastq
      B_neither.fastq
\f[]
.PP
Xenome can also process pairs of reads.
.PP
\f[CR]
      xenome\ classify\ -P\ idx\ --pairs\ -i\ in_1.fastq\ -i\ in_2.fastq
\f[]
.PP
This results in a pair of files for each read category.
The two reads of each pair are written to the corresponding `_1'
and `_2' files respectively.
.PP
\f[CR]
      ambiguous_1.fastq
      ambiguous_2.fastq
      both_1.fastq
      both_2.fastq
      graft_1.fastq
      graft_2.fastq
      host_1.fastq
      host_2.fastq
      neither_1.fastq
      neither_2.fastq
\f[]
.PP
If desired, more specific names can be used in place of `host' and
`graft'.
.PP
\f[CR]
      xenome\ classify\ -P\ idx\ -i\ in.fastq\ --graft-name\ human\ --host-name\ mouse
\f[]
.PP
This will cause xenome to produce the following files.
.PP
\f[CR]
      ambiguous.fastq
      both.fastq
      human.fastq
      mouse.fastq
      neither.fastq
\f[]
.PP
In addition to generating sets of output files, the classify
command produces statistics about the number and proportion of
reads assigned to each category.
These are printed to standard out at the end of a run and look as
follows:
.PP
\f[CR]
      Statistics
      B\ \ \ \ \ \ \ G\ \ \ \ \ \ \ H\ \ \ \ \ \ \ M\ \ \ \ \ \ \ count\ \ \ \ \ percent\ \ \ class
      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1900\ \ \ \ \ \ 0.938267\ \ "neither"
      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 21\ \ \ \ \ \ \ \ 0.0103703\ "both"
      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 28491\ \ \ \ \ 14.0696\ \ \ "definitely\ host"
      0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 7366\ \ \ \ \ \ 3.63751\ \ \ "probably\ host"
      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 91895\ \ \ \ \ 45.38\ \ \ \ \ "definitely\ graft"
      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 30059\ \ \ \ \ 14.8439\ \ \ "probably\ graft"
      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 282\ \ \ \ \ \ \ 0.139259\ \ "ambiguous"
      0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 330\ \ \ \ \ \ \ 0.162962\ \ "ambiguous"
      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 2878\ \ \ \ \ \ 1.42123\ \ \ "both"
      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 254\ \ \ \ \ \ \ 0.125431\ \ "probably\ both"
      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 610\ \ \ \ \ \ \ 0.301233\ \ "definitely\ host"
      1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 5815\ \ \ \ \ \ 2.87159\ \ \ "probably\ host"
      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 3843\ \ \ \ \ \ 1.89777\ \ \ "definitely\ graft"
      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 27775\ \ \ \ \ 13.716\ \ \ \ "probably\ graft"
      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 0\ \ \ \ \ \ \ 99\ \ \ \ \ \ \ \ 0.0488886\ "ambiguous"
      1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 1\ \ \ \ \ \ \ 883\ \ \ \ \ \ \ 0.436047\ \ "ambiguous"
      
      Summary
      count\ \ \ \ \ percent\ \ \ class
      153572\ \ \ \ 75.8377\ \ \ "graft"
      42282\ \ \ \ \ 20.8799\ \ \ "host"
      3153\ \ \ \ \ \ 1.55703\ \ \ "both"
      1900\ \ \ \ \ \ 0.938267\ \ "neither"
      1594\ \ \ \ \ \ 0.787157\ \ "ambiguous"
\f[]
.PP
Both tables contain a single heading line, followed by rows of
TAB-separated elements; a format suitable for loading into R or a
spreadsheet.
.PP
Each row represents the number and proportion of reads assigned to
a particular class.
The B, G, H, and M fields represent the presence (1) or absence (0)
of k-mers belonging to the both, graft, host and marginal k-mer
subsets, according to the reference index.
.PP
The Statistics table contains 16 rows; one for each possible
combination of k-mer classes present within a read.
The first row of the above table, indicates that for the given
input, 1,900 reads (or pairs) - 0.938267% of the total reads -
contained no k-mers that belonged to the B, G, H, or M k-mer
subsets, and are accordingly neither host nor graft reads.
Similarly, the fourteenth line states that 27,775 reads (or pairs)
- 13.716% of the total - contained k-mers that belong to the B, G,
M, but not H subsets, and are therefore \[lq]probably graft\[rq]
reads.
.PP
In the Summary table, the B, G, H, and M columns are removed, and
the classes from the Statistics table have been collapsed into the
five shown; the definitely/probably graft/host classes are combined
into just graft/host classes.
Notice that the different read output files, described earlier,
correspond exactly to these classes.
.SH OPTIONS COMMON TO ALL COMMANDS
.PP
The following options can be used with all of the \f[I]xenome\f[]
commands and are therefore not listed separately for each command.
.TP
.B -h, --help
Show a help message.
.RS
.RE
.TP
.B -l \f[I]FILE\f[], --log-file \f[I]FILE\f[]
Place to write progress messages.
Messages are only written if the -v flag is used.
If omitted, messages are written to stderr.
.RS
.RE
.TP
.B -T \f[I]INT\f[], --num-threads \f[I]INT\f[]
The maximum number of \f[I]worker\f[] threads to use.
The actual number of threads used during the algorithms depends on
each implementation.
\f[I]xenome\f[] may use a small number of additional threads for
performing non cpu-bound operations, such as file I/O.
.RS
.RE
.TP
.B --tmp-dir \f[I]DIRECTORY\f[]
A directory to use for temporary files.
This flag may be repeated in order to nominate multiple temporary
directories.
.RS
.RE
.TP
.B -v, --verbose
Show progress messages.
.RS
.RE
.TP
.B -V, --version
Show the software version.
.RS
.RE
.SH COMMANDS AND OPTIONS
.SS xenome index
.PP
xenome index [-k \f[I]INT\f[]] [-M \f[I]INT\f[]] -P \f[I]PREFIX\f[]
-G \f[I]FASTA-filename\f[] -H \f[I]FASTA-filename\f[]
.PP
Build the xenome reference index from the graft and host reference
sequences.
The input files must be in FASTA format.
They may be gzip compressed, in which case the filename suffix must
be \f[I]\&.gz\f[].
.PP
The k-mer size may be specified using the \f[I]-k\f[] flag.
If omitted, xenome defaults to k=25.
.PP
During index construction, xenome maintains a hash table of the
k-mers seen so far.
When this table fills, its contents are written to disk, and the
table is reinitialised.
The more memory xenome can use, the less often it will need to
write to disk, and the faster index construction will run.
By default, xenome will limit itself to 2 GB during index
construction.
The -M, \[em]max-memory flag can be used to explicitly control the
amount of memory available to xenome (in GB).
To improve performance, this should generally be set close to the
amount memory available in the system - having accounted for
operating system and other overhead.
.PP
\f[I]OPTIONS\f[]
.TP
.B -k \f[I]INT\f[], --kmer-size \f[I]INT\f[]
The k-mer size to use for building the graph: in version 1.0.0 this
\f[I]must be an integer strictly less than 63\f[].
If not supplied, the default value of 25 is used.
.RS
.RE
.TP
.B -M \f[I]INT\f[], --max-memory \f[I]INT\f[]
The maximum amount of memory (in GB) of memory to use.
Making more memory available will reduce the number of times xenome
writes intermediate index data to disk.
The default is 2 GB.
.RS
.RE
.TP
.B -P \f[I]PREFIX\f[], --prefix \f[I]PREFIX\f[]
The path prefix for all generated reference index files.
The prefix may contain directory separators (e.g.
`/') in order to have the index files written to another directory.
.RS
.RE
.TP
.B -G \f[I]FILE\f[], --graft \f[I]FILE\f[]
The name of the FASTA file containing the graft reference sequence.
If the filename ends in \f[I]\&.gz\f[] it will be read as a gzip
file.
.RS
.RE
.TP
.B -H \f[I]FILE\f[], --host \f[I]FILE\f[]
The name of the FASTA file containing the host reference sequence.
If the filename ends in \f[I]\&.gz\f[] it will be read as a gzip
file.
.RS
.RE
.SS xenome classify
.PP
xenome classify -P \f[I]PREFIX\f[] {-I \f[I]FASTA-filename\f[] | -i
\f[I]FASTQ-filename\f[] | \[em]line-in \f[I]filename\f[]}+
[\[em]pairs] [-M \f[I]INT\f[]] [\[em]graft-name \f[I]STRING\f[]]
[\[em]host-name \f[I]STRING\f[]] [\[em]output-filename-prefix
\f[I]STRING\f[]] [\[em]dont-write-reads] [\[em]preserve-read-order]
.PP
Classifies input reads according to a pre-computed k-mer index.
The reads are written into separate files, according to their
classification, and a breakdown of the number and proportion of
reads in each class is printed.
.PP
If the total size of the index files is greater than available RAM,
xenome will perform poorly.
To overcome this, the -M, \[em]max-memory flag may be used to
specify the maximum amount of memory (in GB) that xenome may use at
any time.
If this amount is less than the size of the index structures,
xenome will (effectively) partition the index into multiple
subsets, each no larger than the specified maximum memory size, and
classify the reads in multiple passes - with each pass using a
different index subset.
The results from each passes are combined, and the result is
produced as usual.
If run with the -v, \[em]verbose flag, xenome will report the
number of passes it will perform.
Note that runtime will increase with the number of passes
performed; the biggest increase will occur with the step from one
pass to two.
.PP
\f[I]OPTIONS\f[]
.TP
.B -P \f[I]PREFIX\f[], --prefix \f[I]PREFIX\f[]
The path prefix for all reference index files.
The prefix may contain directory separators (e.g.
`/') in order to have the index files written to another directory.
.RS
.RE
.TP
.B -I \f[I]FILE\f[], --fasta-in \f[I]FILE\f[]
Input file in FASTA format.
.RS
.RE
.TP
.B -i \f[I]FILE\f[], --fastq-in \f[I]FILE\f[]
Input file in FASTQ format.
.RS
.RE
.TP
.B \[em]line-in \f[I]FILE\f[]
Input file with one read per line and no other annotation.
.RS
.RE
.TP
.B \[em]pairs
Treat reads from consecutive input files of the same type as pairs.
.RS
.RE
.TP
.B -M \f[I]INT\f[], --max-memory \f[I]INT\f[]
The maximum amount of memory (in GB) to use while classifying
reads.
If not specified, xenome will use as much memory as required to
classify all reads in a single pass.
When the maximum amount of memory is less than the size of the
reference index files, xenome will need to perform multiple passes
over the input data - increasing runtime.
.RS
.RE
.TP
.B \[em]graft-name \f[I]STRING\f[]
The name of the graft reference to appear in filenames and
statistics.
If no explicit name is provided, the string \[lq]graft\[rq] is
used.
.RS
.RE
.TP
.B \[em]host-name \f[I]STRING\f[]
The name of the host reference to appear in filenames and
statistics.
If no explicit name is provided, the string \[lq]host\[rq] is used.
.RS
.RE
.TP
.B \[em]output-filename-prefix \f[I]STRING\f[]
An optional prefix to apply to all output read filenames.
The prefix is separated from the rest of the filename by an
underscore (`_').
.RS
.RE
.TP
.B \[em]dont-write-reads
The reads will not be written to any files after classification,
and none of the usual per-category output files will be created.
The classification statistics will still be printed to standard
out.
.RS
.RE
.TP
.B \[em]preserve-read-order
The relative ordering of reads within each output file will be the
same as that in the input files.
i.e.\ if read \f[I]r1\f[] precedes \f[I]r2\f[] in a single output
file, then \f[I]r1\f[] also precedes \f[I]r2\f[] in the input.
Note: If this flag is specified, the -T/\[em]num-threads flag is
ignored, and xenome will only operate with a single worker thread.
.RS
.RE
.SS xenome help
.PP
xenome help
.PP
Prints a summary of all of the xenome commands.
.PP
\[em]
.SH FUTURE RELEASES
.PP
Bzip support will be introduced.
.SH AUTHORS
Bryan Beresford-Smith, Andrew Bromage, Thomas Conway, Jeremy Wazny.