# HG changeset patch
# User bgruening
# Date 1457094293 18000
# Node ID 2c2c5e5e495b092906f2f1ae7ccb6abecbbb99f2
# Parent fac157e22e1b370c102a8551c9660db3ec0ed846
planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/infernal commit 9eeedfaf35c069d75014c5fb2e42046106bf813c-dirty
diff -r fac157e22e1b -r 2c2c5e5e495b cmalign._x_m_l_todo
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/cmalign._x_m_l_todo Fri Mar 04 07:24:53 2016 -0500
@@ -0,0 +1,256 @@
+
+ against a sequence database (cmsearch)
+
+
+ infernal
+ infernal
+ gnu_coreutils
+
+
+&1
+ ;
+
+ ## 1. replace all lines starting # (comment lines)
+ ## 2. replace the first 18 spaces with tabs, 18th field is a free text field (can contain spaces)
+ sed -e 's/#.*$//' -e '/^$/d' -e 's/ /\t/g' -e 's/\t/ /18g' \$temp_tabular_output > $outfile
+
+]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A is True
+
+
+
+
+`_.
+
+
+]]>
+
+
+
+ 10.1093/bioinformatics/btt509
+
+ @ARTICLE{bgruening_galaxytools,
+ Author = {Björn Grüning, Cameron Smith, Torsten Houwaart, Nicola Soranzo, Eric Rasche},
+ keywords = {bioinformatics, ngs, galaxy, cheminformatics, rna},
+ title = {{Galaxy Tools - A collection of bioinformatics and cheminformatics tools for the Galaxy environment}},
+ url = {https://github.com/bgruening/galaxytools}
+ }
+
+
+
+
+
diff -r fac157e22e1b -r 2c2c5e5e495b cmalign.xml
--- a/cmalign.xml Fri Feb 13 03:10:51 2015 -0500
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,378 +0,0 @@
-
- against a sequence database (cmsearch)
-
-
- infernal
- infernal
- gnu_coreutils
-
-
-&1
- ;
-
- ## 1. replace all lines starting # (comment lines)
- ## 2. replace the first 18 spaces with tabs, 18th field is a free text field (can contain spaces)
- sed -e 's/#.*$//' -e '/^$/d' -e 's/ /\t/g' -e 's/\t/ /18g' \$temp_tabular_output > $outfile
-
-]]>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- A is True
-
-
-
-
- to the covariance model (CM) in . The new alignment is
-output to stdout in Stockholm format, but can be redirected to a file with the -o option.
-Either or (but not both) may be ’-’ (dash), which means reading this input from stdin rather than a
-file.
-The sequence file must be in FASTA or Genbank format.
-cmalign uses an HMM banding technique to accelerate alignment by default as described below for the --hbanded
-option. HMM banding can be turned off with the --nonbanded option.
-By default, cmalign computes the alignment with maximum expected accuracy that is consistent with constraints
-(bands) derived from an HMM, using a banded version of the Durbin/Holmes optimal accuracy algorithm. This be-
-havior can be changed with the --cyk or --sample options.
-cmalign takes special care to correctly align truncated sequences, where some nucleotides from the beginning (5’)
-and/or end (3’) of the actual full length biological sequence are not present in the input sequence (see DL Kolbe and
-SR Eddy, Bioinformatics, 25:1236-1243, 2009). This behavior is on by default, but can be turned off with --notrunc. In
-previous versions of cmalign the --sub option was required to appropriately handle truncated sequences. The --sub
-option is still available in this version, but the new default method for handling truncated sequences should be as good
-or superior to the sub method in nearly all cases.
-The --mapali option allows inclusion of the fixed training alignment used to build the CM from file within the
-output alignment of cmalign.
-It is possible to merge two or more alignments created by the same CM using the Easel miniapp esl-alimerge (included
-in the easel/miniapps/ subdirectory of Infernal). Previous versions of cmalign included options to merge alignments
-but they were deprecated upon development of esl-alimerge, which is significantly more memory efficient.
-By default, cmalign will output the alignment to stdout. The alignment can be redirected to an output file with the
--o option. With -o, information on each aligned sequence, including score and model alignment boundaries will be
-printed to stdout (more on this below).
-The output alignment will be in Stockholm format by default. This can be changed to Pfam, aligned FASTA (AFA), A2M,
-Clustal, or Phylip format using the --outformat option, where is the name of the desired format. As a special
-case, if the output alignment is large (more than 10,000 sequences or more than 10,000,000 total nucleotides) than the
-output format will be Pfam format, with each sequence appearing on a single line, for reasons of memory efficiency. For
-alignments larger than this, using --ileaved will force interleaved Stockholm format, but the user should be aware that
-this may require a lot of memory. --ileaved will only work for alignments up to 100,000 sequences or 100,000,000 total
-nucleotides.
-If the output alignment format is Stockholm or Pfam, the output alignment will be annotated with posterior probabilities
-which estimate the confidence level of each aligned nucleotide. This annotation appears as lines beginning with ”#=GR
- PP”, one per sequence, each immediately below the corresponding aligned sequence ””.
-Characters in PP lines have 12 possible values: ”0-9”, ”*”, or ”.”. If ”.”, the position corresponds to a gap in the sequence.
-A value of ”0” indicates a posterior probability of between 0.0 and 0.05, ”1” indicates between 0.05 and 0.15, ”2”
-indicates between 0.15 and 0.25 and so on up to ”9” which indicates between 0.85 and 0.95. A value of ”*” indicates
-a posterior probability of between 0.95 and 1.0. Higher posterior probabilities correspond to greater confidence that
-the aligned nucleotide belongs where it appears in the alignment. With --nonbanded, the calculation of the posterior
-probabilities considers all possible alignments of the target sequence to the CM. Without --nonbanded (i.e. in default
-mode), the calculation considers only possible alignments within the HMM bands. Further, the posterior probabilities
-are conditional on the truncation mode of the alignment. For example, if the sequence alignment is truncated 5’, a PP
-value of ”9” indicates between 0.85 and 0.95 of all 5’ truncated alignments include the given nucleotide at the given
-position. The posterior annotation can be turned off with the --noprob option. If --small is enabled, posterior annotation
-must also be turned off using --noprob.
-The tabular output that is printed to stdout if the -o option is used includes one line per sequence and twelve fields
-per line: ”idx”: the index of the sequence in the input file, ”seq name”: the sequence name; ”length”: the length of the
-sequence; ”cm from” and ”cm to”: the model start and end positions of the alignment; ”trunc”: ”no” if the sequence is
-not truncated, ”5’” if the beginning of the sequence truncated 5’, ”3’” if the end of the sequence is truncated, and ”5’&3’”
-if both the beginning and the end are truncated; ”bit sc”: the bit score of the alignment, ”avg pp” the average posterior
-probability of all aligned nucleotides in the alignment; ”band calc”, ”alignment” and ”total”: the time in seconds required
-for calculating HMM bands, computing the alignment, and complete processing of the sequence, respectively; ”mem
-(Mb)”: the size in Mb of all dynamic programming matrices required for aligning the sequence. This tabular data can be
-saved to file with the --sfile option.
-
-
-Options for controlling the alignment algorithm
---optacc Align sequences using the Durbin/Holmes optimal accuracy algorithm. This is the default.
-The optimal accuracy alignment will be constrained by HMM bands for acceleration unless
-the --nonbanded option is enabled. The optimal accuracy algorithm determines the align-
-ment that maximizes the posterior probabilities of the aligned nucleotides within it. The
-posterior probabilites are determined using (possibly HMM banded) variants of the Inside
-and Outside algorithms.
---cyk Do not use the Durbin/Holmes optimal accuracy alignment to align the sequences, instead
-use the CYK algorithm which determines the optimally scoring (maximum likelihood) align-
-ment of the sequence to the model, given the HMM bands (unless --nonbanded is also
-enabled).
---sample Sample an alignment from the posterior distribution of alignments. The posterior distribution
-is determined using an HMM banded (unless --nonbanded) variant of the Inside algorithm.
---seed Seed the random number generator with , an integer >= 0. This option can only be
-used in combination with --sample. If is nonzero, stochastic sampling of alignments
-will be reproducible; the same command will give the same results. If is 0, the random
-number generator is seeded arbitrarily, and stochastic samplings may vary from run to run
-of the same command. The default seed is 181.
---notrunc Turn off truncated alignment algorithms. All sequences in the input file will be assumed to be
-full length, unless --sub is also used, in which case the program can still handle truncated
-sequences but will use an alternative strategy for their alignment.
---sub Turn on the sub model construction and alignment procedure. For each sequence, an HMM
-is first used to predict the model start and end consensus columns, and a new sub CM is
-constructed that only models consensus columns from start to end. The sequence is then
-aligned to this sub CM. Sub alignment is an older method than the default one for aligning
-sequences that are possibly truncated. By default, cmalign uses special DP algorithms to
-handle truncated sequences which should be more accurate than the sub method in most
-cases. --sub is still included as an option mainly for testing against this default truncated
-sequence handling. This ”sub CM” procedure is not the same as the ”sub CMs” described
-by Weinberg and Ruzzo.
-
-
-Other options
---mapali Reads the alignment from file used to build the model aligns it as a single object to
-the CM; e.g. the alignment in is held fixed. This allows you to align sequences to a
-model with cmalign and view them in the context of an existing trusted multiple alignment.
- must be the alignment file that the CM was built from. The program verifies that the
-checksum of the file matches that of the file used to construct the CM. A similar option to
-this one was called --withali in previous versions of cmalign.
---mapstr Must be used in combination with --mapali . Propogate structural information for any
-pseudoknots that exist in to the output alignment. A similar option to this one was called
---withstr in previous versions of cmalign.
---informat Assert that the input is in format . Do not run Babelfish format autodec-
-tion. This increases the reliability of the program somewhat, because the Babelfish can
-make mistakes; particularly recommended for unattended, high-throughput runs of Infernal.
-Acceptable formats are: FASTA, GENBANK, and DDBJ. is case-insensitive.
---outformat Specify the output alignment format as . Acceptable formats are: Pfam, AFA, A2M,
-Clustal, and Phylip. AFA is aligned fasta. Only Pfam and Stockholm alignment formats
-will include consensus structure annotation and posterior probability annotation of aligned
-residues.
---dnaout Output the alignments as DNA sequence alignments, instead of RNA ones.
---noprob Do not annotate the output alignment with posterior probabilities.
---matchonly Only include match columns in the output alignment, do not include any insertions relative
-to the consensus model. This option may be useful when creating very large alignments
-that require a lot of memory and disk space, most of which is necessary only to deal with
-insert columns that are gaps in most sequences.
---ileaved Output the alignment in interleaved Stockholm format of a fixed width that may be more con-
-venient for examination. This was the default output alignment format of previous versions
-of cmalign. Note that cmalign requires more memory when this option is used. For this
-reason, --ileaved will only work for alignments of up to 100,000 sequences or a total of
-100,000,000 aligned nucleotides.
---regress Save an additional copy of the output alignment with no author information to file .
---verbose Output additional information in the tabular scores output (output to stdout if -o is used, or
-to if --sfile is used). These are mainly useful for testing and debugging.
---cpu Specify that parallel CPU workers be used. If is set as ”0”, then the program will
-be run in serial mode, without using threads. You can also control this number by setting an
-environment variable, INFERNAL NCPU. This option will only be available if the machine on
-which Infernal was built is capable of using POSIX threading (see the Installation section of
-the user guide for more information).
---mpi Run as an MPI parallel program. This option will only be available if Infernal has been
-configured and built with the ”--enable-mpi” flag (see the Installation section of the user
-guide for more information).
-
-
-
-
-
-
-
-Output format
--------------
-
-(1) target name: The name of the target sequence or profile.
-(2) accession: The accession of the target sequence or profile, or ’-’ if none.
-(3) query name: The name of the query sequence or profile.
-(4) accession: The accession of the query sequence or profile, or ’-’ if none.
-(5) mdl (model): Which type of model was used to compute the final score. Either ’cm’ or ’hmm’. A CM is used to compute the final hit scores unless the model has zero basepairs or the --hmmonly option is used, in which case a HMM will be used.
-(6) mdl from (model coord): The start of the alignment of this hit with respect to the profile (CM or HMM), numbered 1..N for a profile of N consensus positions.
-(7) mdl to (model coord): The end of the alignment of this hit with respect to the profile (CM or HMM), numbered 1..N for a profile of N consensus positions.
-(8) seq from (ali coord): The start of the alignment of this hit with respect to the sequence, numbered 1..L for a sequence of L residues.
-(9) seq to (ali coord): The end of the alignment of this hit with respect to the sequence, numbered 1..L for a sequence of L residues.
-(10) strand: The strand on which the hit occurs on the sequence. ’+’ if the hit is on the top (Watson) strand, ’-’ if the hit is on the bottom (Crick) strand. If on the top strand, the “seq from” value will be less than or equal to the “seq to” value, else it will be greater than or equal to it.
-(11) trunc: Indicates if this is predicted to be a truncated CM hit or not. This will be “no” if it is a CM hit that is not predicted to be truncated by the end of the sequence, “5’ ” or “3’ ” if the hit is predicted to have one or more 5’ or 3’ residues missing due to a artificial truncation of the sequence, or “5’&3”’ if the hit is predicted to have one or more 5’ residues missing and one or more 3’ residues missing. If the hit is an HMM hit, this will always be ’-’.
-(12) pass: Indicates what “pass” of the pipeline the hit was detected on. This is probably only useful for testing and debugging. Non-truncated hits are found on the first pass, truncated hits are found on successive passes.
-(13) gc: Fraction of G and C nucleotides in the hit.
-(14) bias: The biased-composition correction: the bit score difference contributed by the null3 model for CM hits, or the null2 model for HMM hits. High bias scores may be a red flag for a false positive. It is difficult to correct for all possible ways in which a nonrandom but nonhomologous biological sequences can appear to be similar, such as short-period tandem repeats, so there are cases where the bias correction is not strong enough (creating false positives).
-(15) score: The score (in bits) for this target/query comparison. It includes the biased-composition cor-rection (the “null3” model for CM hits, or the “null2” model for HMM hits).
-(16) E-value: The expectation value (statistical significance) of the target. This is a per query E-value; i.e. calculated as the expected number of false positives achieving this comparison’s score for a single query against the search space Z. For cmsearch Z is defined as the total number of nucleotides in the target dataset multiplied by 2 because both strands are searched. For cmscan Z is the total number of nucleotides in the query sequence multiplied by 2 because both strands are searched and multiplied by the number of models in the target database. If you search with multiple queries and if you want to control the overall false positive rate of that search rather than the false positive rate per query, you will want to multiply this per-query E-value by how many queries you’re doing.
-(17) inc: Indicates whether or not this hit achieves the inclusion threshold: ’!’ if it does, ’?’ if it does not (and rather only achieves the reporting threshold). By default, the inclusion threshold is an E-value of 0.01 and the reporting threshold is an E-value of 10.0, but these can be changed with command line options as described in the manual pages.
-(18) description of target: The remainder of the line is the target’s description line, as free text.
-
-
-For further questions please refere to the Infernal Userguide_.
-
-.. _Userguide: http://selab.janelia.org/software/infernal/Userguide.pdf
-
-
-How do I cite Infernal?
------------------------
-
-The recommended citation for using Infernal 1.1 is E. P. Nawrocki and S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches , Bioinformatics 29:2933-2935 (2013).
-
-**Galaxy Wrapper Author**::
-
- * Bjoern Gruening, University of Freiburg
-
-]]>
-
-
diff -r fac157e22e1b -r 2c2c5e5e495b cmbuild.xml
--- a/cmbuild.xml Fri Feb 13 03:10:51 2015 -0500
+++ b/cmbuild.xml Fri Mar 04 07:24:53 2016 -0500
@@ -71,7 +71,7 @@
-
@@ -93,7 +93,7 @@
-
@@ -106,16 +106,16 @@
-
-
-
-
@@ -186,90 +186,114 @@
**What it does**
-For each multiple sequence alignment build a covariance model.
-The alignment file must be in Stockholm or SELEX format, and must contain consensus secondary structure annotation.
+cmbuild belongs to the INFERNAL software package that allows you to make consensus RNA secondary structure profiles, and use them to search nucleic acid sequence databases for homologous RNAs, or to create new structure-based multiple sequence alignments.
+
+cm build builds a covariance model of an RNA multiple alignment. cmbuild uses the consensus structure to determine the architecture of the CM.
+
+
+**Input**
+
+Input file is a multiple sequence alignment file in Stockholm or SELEX format, and must contain consensus secondary structure annotation.
cmbuild uses the consensus structure to determine the architecture of the CM.
-In addition to writing CM(s) to CMFILE_OUT, cmbuild also outputs a single line for each model created to stdout. Each
-line has the following fields: ”aln”: the index of the alignment used to build the CM; ”idx”: the index of the CM in the
-CMFILE_OUT; ”name”: the name of the CM; ”nseq”: the number of sequences in the alignment used to build the CM;
-”eff nseq”: the effective number of sequences used to build the model; ”alen”: the length of the alignment used to build
-the CM; ”clen”: the number of columns from the alignment defined as consensus (match) columns; ”bps”: the number
-of basepairs in the CM; ”bifs”: the number of bifurcations in the CM; ”rel entropy: CM”: the total relative entropy of the
-model divided by the number of consensus columns; ”rel entropy: HMM”: the total relative entropy of the model ignoring
-secondary structure divided by the number of consensus columns. ”description”: description of the model/alignment.
+Example: simple example of a multiple RNA sequence alignment with secondary structure annotation
+
+# STOCKHOLM 1.0
+tRNA1 GCGGAUUUAGCUCAGUUGGG.AGAGCGCCAGACUGAAGAUCUGGAGGUCC
+tRNA2 UCCGAUAUAGUGUAAC.GGCUAUCACAUCACGCUUUCACCGUGGAGA.CC
+tRNA3 UCCGUGAUAGUUUAAU.GGUCAGAAUGGGCGCUUGUCGCGUGCCAGA.UC
+tRNA4 GCUCGUAUGGCGCAGU.GGU.AGCGCAGCAGAUUGCAAAUCUGUUGGUCC
+tRNA5 GGGCACAUGGCGCAGUUGGU.AGCGCGCUUCCCUUGCAAGGAAGAGGUCA
+#=GC SS_cons <<<<<<<..<<<<.........>>>>.<<<<<.......>>>>>.....<
-Options controlling model construction
---------------------------------------
+**Output**
+
+The output of cmbuild contains information about the size of your input alignment (in aligned columns
+and # of sequences), and about the size of the resulting model.
+
+In addition to writing CM(s) to the output file, cmbuild also outputs a single line for each model created to stdout.
+Each line has the following fields:
+- aln: the index of the alignment used to build the CM
+- idx: the index of the CM in the output file
+- name: the name of the CM
+- nseq: the number of sequences in the alignment used to build the CM
+- eff nseq: the effective number of sequences used to build the model
+- alen: the length of the alignment used to build the CM
+- clen: the number of columns from the alignment defined as consensus (match) columns
+- bps: the number of basepairs in the CM
+- bifs: the number of bifurcations in the CM
+- rel entropy: CM: the total relative entropy of the model divided by the number of consensus columns
+- rel entropy: HMM: the total relative entropy of the model ignoring secondary structure divided by the number of consensus columns
+- description: description of the model/alignment.
+
+
+**Options controlling model construction**
+
These options control how consensus columns are defined in an alignment.
- * --fast Define consensus columns automatically as those that have a fraction >= symfrac of residues as opposed to gaps. (See below for the --symfrac option.) This is the default.
- * --hand Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which columns are consensus, and which are inserts. Any non-gap character indicates a consensus column. (For example, mark consensus columns with ”x”, and insert columns with ”.”.)
- * --symfrac Define the residue fraction threshold necessary to define a consensus column when not using --hand. The default is 0.5. The symbol fraction in each column is calculated after taking relative sequence weighting into account. Setting this to 0.0 means that every alignment column will be assigned as consensus, which may be useful in some cases. Setting it to 1.0 means that only columns that include 0 gaps will be assigned as consensus.
- * --noss Ignore the secondary structure annotation, if any, in MSA-Infile and build a CM with zero basepairs. This model will be similar to a profile HMM and the cmsearch and cmscan programs will use HMM algorithms which are faster than CM ones for this model. Additionally, a zero basepair model need not be calibrated with cmcalibrate prior to running cmsearch with it. The --noss option must be used if there is no secondary structure annotation in MSA-Infile.
+ - *--fast*: Define consensus columns automatically as those that have a fraction >= symfrac of residues as opposed to gaps. (See below for the --symfrac option.) This is the default.
+ - *--hand*: Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which columns are consensus, and which are inserts. Any non-gap character indicates a consensus column. (For example, mark consensus columns with ”x”, and insert columns with ”.”.)
+ - *--symfrac*: Define the residue fraction threshold necessary to define a consensus column when not using --hand. The default is 0.5. The symbol fraction in each column is calculated after taking relative sequence weighting into account. Setting this to 0.0 means that every alignment column will be assigned as consensus, which may be useful in some cases. Setting it to 1.0 means that only columns that include 0 gaps will be assigned as consensus.
+ - *--noss*: Ignore the secondary structure annotation, if any, in MSA-Infile and build a CM with zero basepairs. This model will be similar to a profile HMM and the cmsearch and cmscan programs will use HMM algorithms which are faster than CM ones for this model. Additionally, a zero basepair model need not be calibrated with cmcalibrate prior to running cmsearch with it. The --noss option must be used if there is no secondary structure annotation in MSA-Infile.
+
+
+**Options controlling relative weights**
-Options controlling relative weights
-------------------------------------
+cmbuild uses an ad hoc sequence weighting algorithm to downweight closely related sequences and upweight distantly related ones. This has the effect of making models less biased by uneven phylogenetic representation. For example, two identical sequences would typically each receive half the weight that one sequence would. These options control which algorithm gets used.
-cmbuild uses an ad hoc sequence weighting algorithm to downweight closely related sequences and upweight distantly
-related ones. This has the effect of making models less biased by uneven phylogenetic representation. For example,
-two identical sequences would typically each receive half the weight that one sequence would. These options control
-which algorithm gets used.
-
- * --wpb Use the Henikoff position-based sequence weighting scheme [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is the default.
- * --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Gerstein et al, J. Mol. Biol. 235:1067, 1994].
- * --wnone Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.
- * --wgiven Use sequence weights as given in annotation in the input alignment file. If no weights were given, assume they are all 1.0. The default is to determine new sequence weights by the Gerstein/Sonnhammer/Chothia algorithm, ignoring any annotated weights.
- * --wblosum Use the BLOSUM filtering algorithm to weight the sequences, instead of the default GSC weighting. Cluster the sequences at a given percentage identity (see --wid); assign each cluster a total weight of 1.0, distributed equally amongst the members of that cluster.
- * --wid Controls the behavior of the --wblosum weighting option by setting the percent identity for clustering the alignment.
+ - *--wgb*: Use the Henikoff position-based sequence weighting scheme ([Henikoff and Henikoff](http://zhanglab.ccmb.med.umich.edu/literature/henikoff_weight_1994.pdf), J. Mol. Biol. 243:574, 1994). This is the default.
+ - *--wgsc*: Use the Gerstein/Sonnhammer/Chothia weighting algorithm ([Gerstein et al.](http://ac.els-cdn.com/0022283694900124/1-s2.0-0022283694900124-main.pdf?_tid=6ed29974-3044-11e5-8949-00000aacb35f&acdnat=1437550798_aaa62caa2c812bb81013f967e7b119ee), J. Mol. Biol. 236:1067, 1994).
+ - *--wnone*: Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.
+ - *--wgiven*: Use sequence weights as given in annotation in the input alignment file. If no weights were given, assume they are all 1.0. The default is to determine new sequence weights by the Gerstein/Sonnhammer/Chothia algorithm, ignoring any annotated weights.
+ - *--wblosum*: Use the BLOSUM filtering algorithm to weight the sequences, instead of the default GSC weighting. Cluster the sequences at a given percentage identity (see --wid); assign each cluster a total weight of 1.0, distributed equally amongst the members of that cluster.
-Options controlling effective sequence number
----------------------------------------------
+**Options controlling effective sequence number**
+
+
+After relative weights are determined, they are normalized to sum to a total effective sequence number, eff nseq. This number may be the actual number of sequences in the alignment, but it is almost always smaller than that. The default entropy weighting method (--eent) reduces the effective sequence number to reduce the information content (relative entropy, or average expected score on true homologs) per consensus position. The target relative entropy is controlled by a two-parameter function, where the two parameters are settable with --ere and --esigma.
-After relative weights are determined, they are normalized to sum to a total effective sequence number, eff nseq. This
-number may be the actual number of sequences in the alignment, but it is almost always smaller than that. The default
-entropy weighting method (--eent) reduces the effective sequence number to reduce the information content (relative
-entropy, or average expected score on true homologs) per consensus position. The target relative entropy is controlled
-by a two-parameter function, where the two parameters are settable with --ere and --esigma.
-
- * --eent Use the entropy weighting strategy to determine the effective sequence number that gives a target mean match state relative entropy. This option is the default, and can be turned off with --enone. The default target mean match state relative entropy is 0.59 bits for models with at least 1 basepair and 0.38 bits for models with zero basepairs, but changed with --ere. The default of 0.59 or 0.38 bits is automatically changed if the total relative entropy of the model (summed match state relative entropy) is less than a cutoff, which is is 6.0 bits by default, but can be changed with the expert, undocumented --eX option. If you really want to play with that option, consult the source code.
- * --enone Turn off the entropy weighting strategy. The effective sequence number is just the number of sequences in the alignment.
- * --ere Set the target mean match state relative entropy. By default the target relative entropy per match position is 0.59 bits for models with at least 1 basepair and 0.38 for models with zero basepairs.
- * --eminseq Define the minimum allowed effective sequence number.
- * --ehmmre Set the target HMM mean match state relative entropy. Entropy for basepairing match states is calculated using marginalized basepair emission probabilities.
- * --eset Set the effective sequence number for entropy weighting.
+ - *--eent*: Use the entropy weighting strategy to determine the effective sequence number that gives a target mean match state relative entropy. This option is the default, and can be turned off with --enone. The default target mean match state relative entropy is 0.59 bits for models with at least 1 basepair and 0.38 bits for models with zero basepairs, but changed with --ere. The default of 0.59 or 0.38 bits is automatically changed if the total relative entropy of the model (summed match state relative entropy) is less than a cutoff, which is is 6.0 bits by default, but can be changed with the expert, undocumented --eX option. If you really want to play with that option, consult the source code.
+ - *--enone*: Turn off the entropy weighting strategy. The effective sequence number is just the number of sequences in the alignment.
+ - *--ere*: Set the target mean match state relative entropy. By default the target relative entropy per match position is 0.59 bits for models with at least 1 basepair and 0.38 for models with zero basepairs.
+ - *--eminseq*: Define the minimum allowed effective sequence number.
+ - *--ehmmre*: Set the target HMM mean match state relative entropy. Entropy for basepairing match states is calculated using marginalized basepair emission probabilities.
+ - *--eset*: Set the effective sequence number for entropy weighting.
-Options for refining the input alignment
-----------------------------------------
+**Options for refining the input alignment**
- * --refine Attempt to refine the alignment before building the CM using expectation-maximization (EM). A CM is first built from the initial alignment as usual. Then, the sequences in the alignment are realigned optimally (with the HMM banded CYK algorithm, optimal means optimal given the bands) to the CM, and a new CM is built from the resulting alignment. The sequences are then realigned to the new CM, and a new CM is built from that alignment. This is continued until convergence, specifically when the alignments for two successive iterations are not significantly different (the summed bit scores of all the sequences in the alignment changes less than 1% between two successive iterations).
- * -l Turn on the local alignment algorithm, which allows the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. The default is to globally align the query model to the target sequences.
- * --gibbs Modifies the behavior of --refine so Gibbs sampling is used instead of EM. The difference is that during the alignment stage the alignment is not necessarily optimal, instead an alignment (parsetree) for each sequences is sampled from the posterior distribution of alignments as determined by the Inside algorithm. Due to this sampling step --gibbs is non- deterministic, so different runs with the same alignment may yield different results. This is not true when --refine is used without the --gibbs option, in which case the final alignment and CM will always be the same. When --gibbs is enabled, the --seed "number" option can be used to seed the random number generator predictably, making the results reproducible. The goal of the --gibbs option is to help expert RNA alignment curators refine structural alignments by allowing them to observe alternative high scoring alignments.
- * --seed Seed the random number generator with an integer >= 0. This option can only be used in combination with --gibbs. If the given number is nonzero, stochastic sampling of alignments will be reproducible; the same command will give the same results. If the given number is 0, the random number generator is seeded arbitrarily, and stochastic samplings may vary from run to run of the same command. The default seed is 0.
- * --cyk With --refine, align with the CYK algorithm. By default the optimal accuracy algorithm is used. There is more information on this in the cmalign manual page.
- * --notrunc With --refine, turn off the truncated alignment algorithm. There is more information on this in the cmalign manual page.
-
+ - *--refine*: Attempt to refine the alignment before building the CM using expectation-maximization (EM). A CM is first built from the initial alignment as usual. Then, the sequences in the alignment are realigned optimally (with the HMM banded CYK algorithm, optimal means optimal given the bands) to the CM, and a new CM is built from the resulting alignment. The sequences are then realigned to the new CM, and a new CM is built from that alignment. This is continued until convergence, specifically when the alignments for two successive iterations are not significantly different (the summed bit scores of all the sequences in the alignment changes less than 1% between two successive iterations).
+ - *Turn on the local alignment algorithm*: allows the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. The default is to globally align the query model to the target sequences.
+ - *--gibbs sampling*: Modifies the behavior of --refine so Gibbs sampling is used instead of EM. The difference is that during the alignment stage the alignment is not necessarily optimal, instead an alignment (parsetree) for each sequences is sampled from the posterior distribution of alignments as determined by the Inside algorithm. Due to this sampling step --gibbs is non- deterministic, so different runs with the same alignment may yield different results. This is not true when --refine is used without the --gibbs option, in which case the final alignment and CM will always be the same. When --gibbs is enabled, the --seed "number" option can be used to seed the random number generator predictably, making the results reproducible. The goal of the --gibbs option is to help expert RNA alignment curators refine structural alignments by allowing them to observe alternative high scoring alignments.
+ - *--Random seed*: Seed the random number generator with an integer >= 0. This option can only be used in combination with --gibbs. If the given number is nonzero, stochastic sampling of alignments will be reproducible; the same command will give the same results. If the given number is 0, the random number generator is seeded arbitrarily, and stochastic samplings may vary from run to run of the same command. The default seed is 0.
+ - *--Turn off the truncated alignment algorithm*: With --refine, turn off the truncated alignment algorithm. There is more information on this in the cmalign manual page.
+ - *--cyk algorithm*: With --refine, align with the CYK algorithm. By default the optimal accuracy algorithm is used. There is more information on this in the cmalign manual page.
+
+
For further questions please refere to the Infernal Userguide_.
.. _Userguide: http://selab.janelia.org/software/infernal/Userguide.pdf
-How do I cite Infernal?
------------------------
-
-The recommended citation for using Infernal 1.1 is E. P. Nawrocki and S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches , Bioinformatics 29:2933-2935 (2013).
-
-**Galaxy Wrapper Author**::
-
- * Bjoern Gruening, University of Freiburg
-
]]>
+
+
+ 10.1093/bioinformatics/btt509
+
+ @ARTICLE{bgruening_galaxytools,
+ Author = {Björn Grüning, Cameron Smith, Torsten Houwaart, Nicola Soranzo, Eric Rasche},
+ keywords = {bioinformatics, ngs, galaxy, cheminformatics, rna},
+ title = {{Galaxy Tools - A collection of bioinformatics and cheminformatics tools for the Galaxy environment}},
+ url = {https://github.com/bgruening/galaxytools}
+ }
+
+
+
diff -r fac157e22e1b -r 2c2c5e5e495b cmsearch.xml
--- a/cmsearch.xml Fri Feb 13 03:10:51 2015 -0500
+++ b/cmsearch.xml Fri Mar 04 07:24:53 2016 -0500
@@ -135,7 +135,7 @@
-
+
@@ -144,7 +144,7 @@
-
+
@@ -165,7 +165,7 @@
-
+
@@ -174,7 +174,7 @@
-
+
@@ -202,57 +202,88 @@
**What it does**
-Infernal is used to search sequence databases for homologs of structural RNA sequences, and to make
-sequence- and structure-based RNA sequence alignments. Infernal needs a profile from a structurally
-annotated multiple sequence alignment of an RNA family with a position-specific scoring system for substitutions,
-insertions, and deletions. Positions in the profile that are basepaired in the consensus secondary
-structure of the alignment are modeled as dependent on one another, allowing Infernal’s scoring system to
-consider the secondary structure, in addition to the primary sequence, of the family being modeled. Infernal
-profiles are probabilistic models called “covariance models”, a specialized type of stochastic context-free
-grammar (SCFG) (Lari and Young, 1990).
+cmsearch belongs to the INFERNAL software package that allows you to make consensus RNA secondary structure profiles, and use them to search nucleic acid sequence databases for homologous RNAs, or to create new structure-based multiple sequence alignments.
+You can use your model to search for new homologues of your RNA family. cmsearch is used to search one or more covariance models (CMs) against a sequence database. cmsearch searches both strands of each sequence in the target database, and returns alignments for high scoring hits.
+
+To build CMs from multiple alignments, see cmbuild (build covariance models).
-Compared to other alignment and database search tools based only on sequence comparison, Infernal
-aims to be significantly more accurate and more able to detect remote homologs because it models sequence
-and structure.
+
+**Input**
+
+The CM query file must have been calibrated for E-values with cmcalibrate. As a special exception, any models CM query files that have zero basepairs need not be calibrated.
-Output format
--------------
+**Options**
+
+- *Turn on the glocal alignment algorithm*: global with respect to the query model and local with respect to the target database. By default, the local alignment algorithm is used which is local with respect to both the target sequence and the model. In local mode, the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. Local mode performs better on empirical benchmarks and is significantly more sensitive for remote homology detection. Empirically, glocal searches return many fewer hits than local searches, so glocal may be desired for some applications. With *Turn on the glocal alignment algorithm*, all models must be calibrated, even those with zero basepairs.
+
+- *Only search the bottom (Crick) strand of target sequences*: Hits can occur on either the top (Watson) or bottom (Crick) strand of the target sequence. By default, both strands are searched.
+
+- *Only search the top (Watson) strand of target sequences*: Hits can occur on either the top (Watson) or bottom (Crick) strand of the target sequence. By default, both strands are searched.
+
+- *Use the CYK algorithm, not Inside, to determine the final score of all hits*: If selecting "yes", the CYK algorithm instead of the CM Inside algorithm (the SCFG analog of the HMM Forward algorithm) is used.
-(1) target name: The name of the target sequence or profile.
-(2) accession: The accession of the target sequence or profile, or ’-’ if none.
-(3) query name: The name of the query sequence or profile.
-(4) accession: The accession of the query sequence or profile, or ’-’ if none.
-(5) mdl (model): Which type of model was used to compute the final score. Either ’cm’ or ’hmm’. A CM is used to compute the final hit scores unless the model has zero basepairs or the --hmmonly option is used, in which case a HMM will be used.
-(6) mdl from (model coord): The start of the alignment of this hit with respect to the profile (CM or HMM), numbered 1..N for a profile of N consensus positions.
-(7) mdl to (model coord): The end of the alignment of this hit with respect to the profile (CM or HMM), numbered 1..N for a profile of N consensus positions.
-(8) seq from (ali coord): The start of the alignment of this hit with respect to the sequence, numbered 1..L for a sequence of L residues.
-(9) seq to (ali coord): The end of the alignment of this hit with respect to the sequence, numbered 1..L for a sequence of L residues.
-(10) strand: The strand on which the hit occurs on the sequence. ’+’ if the hit is on the top (Watson) strand, ’-’ if the hit is on the bottom (Crick) strand. If on the top strand, the “seq from” value will be less than or equal to the “seq to” value, else it will be greater than or equal to it.
-(11) trunc: Indicates if this is predicted to be a truncated CM hit or not. This will be “no” if it is a CM hit that is not predicted to be truncated by the end of the sequence, “5’ ” or “3’ ” if the hit is predicted to have one or more 5’ or 3’ residues missing due to a artificial truncation of the sequence, or “5’&3”’ if the hit is predicted to have one or more 5’ residues missing and one or more 3’ residues missing. If the hit is an HMM hit, this will always be ’-’.
-(12) pass: Indicates what “pass” of the pipeline the hit was detected on. This is probably only useful for testing and debugging. Non-truncated hits are found on the first pass, truncated hits are found on successive passes.
-(13) gc: Fraction of G and C nucleotides in the hit.
-(14) bias: The biased-composition correction: the bit score difference contributed by the null3 model for CM hits, or the null2 model for HMM hits. High bias scores may be a red flag for a false positive. It is difficult to correct for all possible ways in which a nonrandom but nonhomologous biological sequences can appear to be similar, such as short-period tandem repeats, so there are cases where the bias correction is not strong enough (creating false positives).
-(15) score: The score (in bits) for this target/query comparison. It includes the biased-composition cor-rection (the “null3” model for CM hits, or the “null2” model for HMM hits).
-(16) E-value: The expectation value (statistical significance) of the target. This is a per query E-value; i.e. calculated as the expected number of false positives achieving this comparison’s score for a single query against the search space Z. For cmsearch Z is defined as the total number of nucleotides in the target dataset multiplied by 2 because both strands are searched. For cmscan Z is the total number of nucleotides in the query sequence multiplied by 2 because both strands are searched and multiplied by the number of models in the target database. If you search with multiple queries and if you want to control the overall false positive rate of that search rather than the false positive rate per query, you will want to multiply this per-query E-value by how many queries you’re doing.
-(17) inc: Indicates whether or not this hit achieves the inclusion threshold: ’!’ if it does, ’?’ if it does not (and rather only achieves the reporting threshold). By default, the inclusion threshold is an E-value of 0.01 and the reporting threshold is an E-value of 10.0, but these can be changed with command line options as described in the manual pages.
-(18) description of target: The remainder of the line is the target’s description line, as free text.
+- *Use the CYK algorithm to align hits*: By default, the Durbin/Holmes optimal accuracy algo-
+rithm is used, which finds the alignment that maximizes the expected accuracy of all aligned
+residues.
+
+- *Turn off truncated hit detection*: Turns off truncated hit detection and will reduce the running time most significantly for target files that include many short sequences.
+
+- *Turn off all filters, and run non-banded Inside on every full-length target sequence*: This
+increases sensitivity somewhat, at an extremely large cost in speed.
+
+- *Turn off all HMM filter stages*: The CYK filter, using QDBs, will be run on every full-length target sequence and will enforce a P-value threshold of 0.0001. Each subsequence that survives CYK will be passed to Inside, which will also use QDBs (but a looser set). This increases sensitivity somewhat, at a very large cost in speed.
+
+-*Turn off the HMM SSV and Viterbi filter stages*:Sets remaining HMM filter
+thresholds to 0.02 by default. This may increase sensitivity, at a significant cost in speed.
+
+- *Inclusion thresholds*: *Use E-value* - Use an E-value as the hit inclusion threshold. The default is 0.01, meaning that on average, about 1 false positive would be expected in every 100 searches with different
+query sequences. *Use Bit Score* - Instead of using E-values for setting the inclusion threshold, instead use a bit score as the hit inclusion threshold. By default this option is unset.
-For further questions please refere to the Infernal Userguide_.
+**Output Options**
-.. _Userguide: http://selab.janelia.org/software/infernal/Userguide.pdf
+- *reporting thresholds*: Hits are ranked by statistical significance (E-value). By *default*, all hits with an E-value <= 10 are reported. The following options allow you to change the default *E-value* reporting thresholds, or to use *bit score* thresholds instead.
+
+
+Output Example:
-How do I cite Infernal?
------------------------
+# cmsearch :: search CM(s) against a sequence database
+# INFERNAL 1.1.1 (July 2014)
+# Copyright (C) 2014 Howard Hughes Medical Institute.
+# Freely distributed under the GNU General Public License (GPLv3).
+# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+# query CM file: tRNA5.cm
+# target sequence database: tutorial/mrum-genome.fa
+# number of worker threads: 8
+# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
-The recommended citation for using Infernal 1.1 is E. P. Nawrocki and S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches , Bioinformatics 29:2933-2935 (2013).
+The second section is a list of ranked top hits (sorted by E-value, most significant hit first):
-**Galaxy Wrapper Author**::
+rank E-value score bias sequence start end mdl trunc gc description
+---- --------- ------ ----- ----------- ------- ------- --- ----- ---- -----------
+(1) ! 1.3e-18 71.5 0.0 NC_013790.1 362026 361955 - cm no 0.50 Methanobrevibacter ruminantium M1
+(2) ! 3.3e-18 70.2 0.0 NC_013790.1 2585265 2585193 - cm no 0.60 Methanobrevibacter ruminantium M1
- * Bjoern Gruening, University of Freiburg
+
+
+For further questions please refere to the Infernal `Userguide `_.
]]>
+
+ 10.1093/bioinformatics/btt509
+
+ @ARTICLE{bgruening_galaxytools,
+ Author = {Björn Grüning, Cameron Smith, Torsten Houwaart, Nicola Soranzo, Eric Rasche},
+ keywords = {bioinformatics, ngs, galaxy, cheminformatics, rna},
+ title = {{Galaxy Tools - A collection of bioinformatics and cheminformatics tools for the Galaxy environment}},
+ url = {https://github.com/bgruening/galaxytools}
+ }
+
+
+
+
diff -r fac157e22e1b -r 2c2c5e5e495b cmstat.xml
--- a/cmstat.xml Fri Feb 13 03:10:51 2015 -0500
+++ b/cmstat.xml Fri Mar 04 07:24:53 2016 -0500
@@ -60,8 +60,8 @@
The cmstat utility prints out a tabular file of summary statistics for each given covariance model.
-Output format
--------------
+**Output format**
+
By default, cmstat prints general statistics of the model and the alignment it was built from, one line per model in a
tabular format.
@@ -86,20 +86,21 @@
relative entropy, the more the model will rely on structural conservation relative sequence conservation when identifying homologs.
-For further questions please refere to the Infernal Userguide_.
-
-.. _Userguide: http://selab.janelia.org/software/infernal/Userguide.pdf
-
-
-How do I cite Infernal?
------------------------
-
-The recommended citation for using Infernal 1.1 is E. P. Nawrocki and S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches , Bioinformatics 29:2933-2935 (2013).
-
-**Galaxy Wrapper Author**::
-
- * Bjoern Gruening, University of Freiburg
+For further questions please refere to the Infernal `Userguide `_.
]]>
+
+ 10.1093/bioinformatics/btt509
+
+ @ARTICLE{bgruening_galaxytools,
+ Author = {Björn Grüning, Cameron Smith, Torsten Houwaart, Nicola Soranzo, Eric Rasche},
+ keywords = {bioinformatics, ngs, galaxy, cheminformatics, rna},
+ title = {{Galaxy Tools - A collection of bioinformatics and cheminformatics tools for the Galaxy environment}},
+ url = {https://github.com/bgruening/galaxytools}
+ }
+
+
+
+
diff -r fac157e22e1b -r 2c2c5e5e495b infernal.py
--- a/infernal.py Fri Feb 13 03:10:51 2015 -0500
+++ b/infernal.py Fri Mar 04 07:24:53 2016 -0500
@@ -53,7 +53,7 @@
dataset.blurb = "1 model"
else:
dataset.blurb = "%s models" % dataset.metadata.number_of_models
- dataset.peek = data.get_file_peek( dataset.file_name, is_multi_byte=is_multi_byte )
+ dataset.peek = get_file_peek( dataset.file_name, is_multi_byte=is_multi_byte )
else:
dataset.peek = 'file does not exist'
dataset.blurb = 'file purged from disc'
diff -r fac157e22e1b -r 2c2c5e5e495b infernal.tar.gz
Binary file infernal.tar.gz has changed
diff -r fac157e22e1b -r 2c2c5e5e495b tool_dependencies.xml
--- a/tool_dependencies.xml Fri Feb 13 03:10:51 2015 -0500
+++ b/tool_dependencies.xml Fri Mar 04 07:24:53 2016 -0500
@@ -1,7 +1,7 @@
-
+