Galaxy | Tool Preview

jackhmmer (version 0.1.0)
(-N)
(--acc)
(--noali)
(--notextw)
(-E)
(--domE)
(-T)
(--domT)
(--incE)
(--incdomE)
(--incT)
(--incdomT)
(--max)
(--F1)
(--F2)
(--F3)
(--nobias)
HMMER infers fragments if the sequence length L is less than or equal to a fraction x times the alignment length in columns (--fragthresh)
(--EmL)
(--EmN)
(--EvL)
(--EvN)
(--EfL)
(--EfN)
(--Eft)
(--nonull2)
(-Z)
(--domZ)
(--seed)

What it does

jackhmmer iteratively searches each query sequence in <seqfile> against the target sequence(s) in <seqdb>. The first iteration is identical to a phmmer search. For the next iteration, a multiple alignment of the query together with all target sequences satisfying inclusion thresholds is assembled, a profile is constructed from this alignment (identical to using hmmbuild on the alignment), and profile search of the <seqdb> is done (identical to an hmmsearch with the profile).

Options

@OFORMAT_WITH_OPTS_HELP_NOPFAM@

Options Controlling Single Sequence Scoring (first Iteration)

By default, the first iteration uses a search model constructed from a single query sequence. This model is constructed using a standard 20x20 substitution matrix for residue probabilities, and two additional pa- rameters for position-independent gap open and gap extend probabilities. These options allow the default single-sequence scoring parameters to be changed.

Gap Open (--popen)

Set the gap open probability for a single sequence query model to <x>

Gap Extend (--pextend)

Set the gap extend probability for a single sequence query model to <x>.

--mx/--mxfile

These options are not currently supported

Options for Reporting Thresholds

Reporting thresholds control which hits are reported in output files (the main output, --tblout, and --domtblout).

E-value (-E)

In the per-target output, report target profiles with an E-value of <= <x>. The default is 10.0, meaning that on average, about 10 false positives will be reported per query, so you can see the top of the noise and decide for yourself if it’s really noise.

Bit score (-T)

Instead of thresholding per-profile output on E-value, instead report target profiles with a bit score of >= <x>.

domain E-value (--domE)

In the per-domain output, for target profiles that have already satisfied the per-profile reporting threshold, report individual domains with a conditional E-value of <= <x>. The default is 10.0. A conditional E-value means the expected number of additional false positive domains in the smaller search space of those comparisons that already satisfied the per-profile reporting threshold (and thus must have at least one homologous domain already).

domain Bit scores (--domT)

Instead of thresholding per-domain output on E-value, instead report domains with a bit score of >= <x>.

Options for Inclusion Thresholds

Inclusion thresholds are stricter than reporting thresholds. Inclusion thresholds control which hits are considered to be reliable enough to be included in an output alignment or a subsequent search round. In hmmscan, which does not have any alignment output (like hmmsearch or phmmer) nor any iterative search steps (like jackhmmer), inclusion thresholds have little effect. They only affect what domains get marked as significant (!) or questionable (?) in domain output.

E-value of per target inclusion threshold

Use an E-value of <= <x> as the per-target inclusion threshold. The default is 0.01, meaning that on average, about 1 false positive would be expected in every 100 searches with different query sequences.

Bit score of per target inclusion threshold

Instead of using E-values for setting the inclusion threshold, instead use a bit score of >= <x> as the per-target inclusion threshold. It would be unusual to use bit score thresholds with hmmscan, because you don’t expect a single score threshold to work for different profiles; different profiles have slightly different expected score distributions.

domain E-value per target inclusion treshold

Use a conditional E-value of <= <x> as the per-domain inclusion threshold, in targets that have already satisfied the overall per-target inclusion threshold.

domain Bit score per target inclusion treshold

Instead of using E-values, instead use a bit score of >= <x> as the per-domain inclusion threshold. As with --incT above, it would be unusual to use a single bit score threshold in hmmscan.

Acceleration Heuristicts (--F1, --F2, --F3)

MSV filter

The sequence is aligned to the profile using a specialized model that allows multiple high-scoring local ungapped segments to match. The optimal alignment score (Viterbi score) is calculated under this multi- segment model, hence the term MSV, for “multi-segment Viterbi”. This is HMMER’s main speed heuristic. The MSV score is comparable to BLAST’s sum score (optimal sum of ungapped alignment segments). Roughly speaking, MSV is comparable to skipping the heuristic word hit and hit extension steps of the BLAST acceleration algorithm.

The MSV filter is very, very fast. In addition to avoiding indel calculations in the dynamic programming table, it uses reduced precision scores scaled to 8-bit integers, enabling acceleration via 16-way parallel SIMD vector instructions.

The MSV score is a true log-odds likelihood ratio, so it obeys conjectures about the expected score distribution (Eddy, 2008) that allow immediate and accurate calculation of the statistical significance (P- value) of the MSV bit score.

By default, comparisons with a P-value of ≤ 0.02 pass this filter, meaning that about 2% of nonhomol- ogous sequences are expected to pass. You can use the --F1 option to change this threshold. For example, --F1 <0.05> would pass 5% of the comparisons, making a search more sensitive but slower. Setting the threshold to ≥ 1.0 (--F1 99 for example) assures that all comparisons will pass. Shutting off the MSV filter may be worthwhile if you want to make sure you don’t miss comparisons that have a lot of scattered insertions and deletions. Alternatively, the --max option causes the MSV filter step (and all other filter steps) to be bypassed.

The MSV bit score is calculated as a log-odds score using the null model for comparison. No correction for a biased composition or repetitive sequence is done at this stage. For comparisons involving biased sequences and/or profiles, more than 2% of comparisons will pass the MSV filter. At the end of search output, there is a line like:

Passed MSV filter: 107917 (0.020272); expected 106468.8 (0.02)

which tells you how many and what fraction of comparisons passed the MSV filter, versus how many (and what fraction) were expected.

Viterbi filter

The sequence is now aligned to the profile using a fast Viterbi algorithm for optimal gapped alignment.

This Viterbi implementation is specialized for speed. It is implemented in 8-way parallel SIMD vector instructions, using reduced precision scores that have been scaled to 16-bit integers. Only one row of the dynamic programming matrix is stored, so the routine only recovers the score, not the optimal alignment itself. The reduced representation has limited range; local alignment scores will not underflow, but high scoring comparisons can overflow and return infinity, in which case they automatically pass the filter.

The final Viterbi filter bit score is then computed using the appropriate null model log likelihood (by default the biased composition filter model score, or if the biased filter is off, just the null model score). If the P-value of this score passes the Viterbi filter threshold, the sequence passes on to the next step of the pipeline.

The --F2 <x> option controls the P-value threshold for passing the Viterbi filter score. The default is 0.001. The --max option bypasses all filters in the pipeline. At the end of a search output, you will see a line like:

Passed Vit filter: 2207 (0.00443803); expected 497.3 (0.001)

which tells you how many and what fraction of comparisons passed the Viterbi filter, versus how many were expected.

Forward filter/parser

The sequence is now aligned to the profile using the full Forward algorithm, which calculates the likelihood of the target sequence given the profile, summed over the ensemble of all possible alignments.

This is a specialized time- and memory-efficient Forward implementation called the “Forward parser”. It is implemented in 4-way parallel SIMD vector instructions, in full precision (32-bit floating point). It stores just enough information that, in combination with the results of the Backward parser (below), posterior probabilities of start and stop points of alignments (domains) can be calculated in the domain definition step (below), although the detailed alignments themselves cannot be.

The Forward filter bit score is calculated by correcting this score using the appropriate null model log likelihood (by default the biased composition filter model score, or if the biased filter is off, just the null model score). If the P-value of this bit score passes the Forward filter threshold, the sequence passes on to the next step of the pipeline.

The bias filter score has no further effect in the pipeline. It is only used in filter stages. It has no effect on final reported bit scores or P-values. Biased composition compensation for final bit scores is done by a more complex domain-specific algorithm, described below.

The --F3 <x> option controls the P-value threshold for passing the Forward filter score. The default is 1e-5. The --max option bypasses all filters in the pipeline. At the end of a search output, you will see a line like:

Passed Fwd filter: 1076 (0.00216371); expected 5.0 (1e-05)

which tells you how many and what fraction of comparisons passed the Forward filter, versus how many were expected.

Bias Filter Options

The --max option bypasses all filters in the pipeline, including the bias filter.

The --nobias option turns off (bypasses) the biased composition filter. The simple null model is used as a null hypothesis for MSV and in subsequent filter steps. The biased composition filter step compromises a small amount of sensitivity. Though it is good to have it on by default, you may want to shut it off if you know you will have no problem with biased composition hits.

Advanced Documentation

A more detailed look at the internals of the various filter pipelines was posted on the developer's blog. The information posted there may be useful to those who are struggling with poor-scoring sequences.

Options Controlling Profile Construction

These options control how consensus columns are defined in an alignment.

--fast

Define consensus columns as those that have a fraction >= symfrac of residues as opposed to gaps. (See below for the --symfrac option.) This is the default.

--hand

Define consensus columns in next profile using reference annotation to the multiple alignment. This allows you to define any consensus columns you like.

--symfrac

Define the residue fraction threshold necessary to define a consensus column when using the --fast option. The default is 0.5. The symbol fraction in each column is calculated after taking relative sequence weighting into account, and ignoring gap characters corresponding to ends of sequence fragments (as opposed to internal insertions/deletions). Setting this to 0.0 means that every alignment column will be assigned as consensus, which may be useful in some cases. Setting it to 1.0 means that only columns that include 0 gaps (internal insertions/deletions) will be assigned as consensus.

--fragthresh

We only want to count terminal gaps as deletions if the aligned sequence is known to be full-length, not if it is a fragment (for instance, because only part of it was sequenced). HMMER uses a simple rule to infer fragments: if the sequence length L is less than or equal to a fraction <x> times the alignment length in columns, then the sequence is handled as a fragment. The default is 0.5. Setting --fragthresh0 will define no (nonempty) sequence as a fragment; you might want to do this if you know you’ve got a carefully curated alignment of full-length sequences. Setting --fragthresh1 will define all sequences as fragments; you might want to do this if you know your alignment is entirely composed of fragments, such as translated short reads in metagenomic shotgun data.

Options Controlling Relative Weights

HMMER uses an ad hoc sequence weighting algorithm to downweight closely related sequences and up-weight distantly related ones. This has the effect of making models less biased by uneven phylogenetic representation. For example, two identical sequences would typically each receive half the weight that one sequence would. These options control which algorithm gets used.

--wpb

Use the Henikoff position-based sequence weighting scheme [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is the default.

--wgsc

Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Gerstein et al, J. Mol. Biol. 235:1067, 1994].

--wblosum

Use the same clustering scheme that was used to weight data in calculating BLOSUM subsitution matrices [Henikoff and Henikoff, Proc. Natl. Acad. Sci 89:10915, 1992]. Sequences are single-linkage clustered at an identity threshold (default 0.62; see --wid) and within each cluster of c sequences, each sequence gets rela- tive weight 1/c.

--wnone

No relative weights. All sequences are assigned uniform weight.

--wid

Sets the identity threshold used by single-linkage clustering when using --wblosum. Invalid with any other weighting scheme. Default is 0.62.

Effective Sequence Number

After relative weights are determined, they are normalized to sum to a total effective sequence number, eff nseq. This number may be the actual number of sequences in the alignment, but it is almost always smaller than that. The default entropy weighting method (--eent) reduces the effective sequence num- ber to reduce the information content (relative entropy, or average expected score on true homologs) per consensus position. The target relative entropy is controlled by a two-parameter function, where the two parameters are settable with --ere and --esigma.

--eent

Adjust effective sequence number to achieve a specific relative entropy per position (see --ere). This is the default.

--eclust

Set effective sequence number to the number of single-linkage clusters at a specific identity threshold (see --eid). This option is not recommended; it’s for experiments evaluating how much better --eent is.

--enone

Turn off effective sequence number determination and just use the actual number of sequences. One reason you might want to do this is to try to maximize the relative entropy/position of your model, which may be useful for short models.

--eset

Explicitly set the effective sequence number for all models to <x>.

--ere

Set the minimum relative entropy/position target to <x>. Requires --eent. Default depends on the sequence alphabet. For protein sequences, it is 0.59 bits/position; for nucleotide sequences, it is 0.45 bits/position.

--esigma

Sets the minimum relative entropy contributed by an entire model alignment, over its whole length. This has the effect of making short models have higher relative entropy per position than --ere alone would give. The default is 45.0 bits.

--eid

Sets the fractional pairwise identity cutoff used by single linkage clustering with the --eclust option. The default is 0.62.

Options Controlling Priors

By default, weighted counts are converted to mean posterior probability parameter estimates using mixture Dirichlet priors. Default mixture Dirichlet prior parameters for protein models and for nucleic acid (RNA and DNA) models are built in. The following options allow you to override the default priors.

No priors (--pnone)

Don’t use any priors. Probability parameters will simply be the observed frequencies, after relative sequence weighting.

Laplace +1 prior

Use a Laplace +1 prior in place of the default mixture Dirichlet prior.

Options Controlling H3 Parameter Estimation Methods

H3 uses three short random sequence simulations to estimating the location parameters for the expected score distributions for MSV scores, Viterbi scores, and Forward scores. These options allow these simulations to be modified.

--EmL

Sets the sequence length in simulation that estimates the location parameter mu for MSV E-values. Default is 200.

--EmN

Sets the number of sequences in simulation that estimates the location parameter mu for MSV E-values. Default is 200.

--EvL

Sets the sequence length in simulation that estimates the location parameter mu for Viterbi E-values. Default is 200.

--EvN

Sets the number of sequences in simulation that estimates the location parameter mu for Viterbi E-values. Default is 200.

--EfL

Sets the sequence length in simulation that estimates the location parameter tau for Forward E-values. Default is 100.

--EfN

Sets the number of sequences in simulation that estimates the location parameter tau for Forward E-values. Default is 200.

--Eft

Sets the tail mass fraction to fit in the simulation that estimates the location param- eter tau for Forward evalues. Default is 0.04.

Advanced Options

nonull2

can be too aggressive sometimes, causing you to miss homologs. You can turn the biased-composition score correction off with the --nonull2 option (and if you’re doing that, you may also want to set --nobias, to turn off another biased composition step called the bias filter, which affects which sequences get scored at all).

domZ

Assert that the total number of targets in your searches is <x>, for the purposes of per-domain conditional E-value calculations, rather than the number of targets that passed the reporting thresholds.

Z

Assert that the total number of targets in your searches is <x>, for the purposes of per-sequence E-value calculations, rather than the actual number of targets seen.

Random Seeding

Seed the random number generator with <n>, an integer >= 0. If <n> is nonzero, any stochastic simulations will be reproducible; the same command will give the same results. If <n> is 0, the random number generator is seeded arbitrarily, and stochastic simulations will vary from run to run of the same command.

Attribution

This Galaxy tool relies on HMMER3 from http://hmmer.janelia.org/ Internally the software is cited as:

# hmmscan :: search sequence(s) against a profile database
# HMMER 3.1 (February 2013); http://hmmer.org/
# Copyright (C) 2011 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The wrappers were written by Eric Rasche and is licensed under Apache2. The documentation is copied from the HMMER3 documentation.