view mothur/README @ 25:bfbaf823be4c

Change metagenomics datatypes to include labels and groups metadata. change Mothur tool configs to get label and group select options from a data_meta filter rather than using the options from_dataset attribute. This grealty decreases memory demand for the galaxy server.
author Jim Johnson <jj@umn.edu>
date Wed, 16 May 2012 12:28:44 -0500
parents 09740be2bc9c
children 5c77423823cb
line wrap: on
line source

Provides galaxy tools for the Mothur metagenomics package -  http://www.mothur.org/wiki/Main_Page 

(The environment variable MOTHUR_MAX_PROCESSORS can be used to limit the number of cpu processors used be mothur commands)

Install mothur v.1.24.1 on your galaxy system so galaxy can execute the mothur command
  ( This version of wrappers is designed for Mothur version 1.24 - it may work on later versions )
  http://www.mothur.org/wiki/Download_mothur
  http://www.mothur.org/wiki/Installation
  ( This Galaxy Mothur wrapper will invoke Mothur in command line mode: http://www.mothur.org/wiki/Command_line_mode )

TreeVector is also packaged with this Mothur package to view phylogenetic trees:
  TreeVector is a utility to create and integrate phylogenetic trees as Scalable Vector Graphics (SVG) files.
  TreeVector was written by Ralph_Pethica, Department_of_Computer_Science, University_of_Bristol
  TreeVector: http://supfam.cs.bris.ac.uk/TreeVector/about.html
  Install in galaxy:  tool-data/shared/jars/TreeVector.jar

Install reference data from silva and greengenes
 RDP reference file (modified for mothur):
  http://www.mothur.org/wiki/RDP_reference_files
   - 16S rRNA reference (RDP): A collection of 9,662 bacterial and 384 archaeal 16S rRNA gene sequences with an improved taxonomy compared to version 6.
     http://www.mothur.org/w/images/2/29/Trainset7_112011.rdp.zip
   - 16S rRNA reference (PDS): The RDP reference with three sequences reversed and 119 mitochondrial 16S rRNA gene sequences added as members of the Rickettsiales
     http://www.mothur.org/w/images/4/4a/Trainset7_112011.pds.zip
   - 28S rRNA reference (RDP): A collection of 8506 reference 28S rRNA gene sequences from the Fungi that were curated by the Kuske lab
     http://www.mothur.org/w/images/3/36/FungiLSU_train_v7.zip
 Silva reference:
  http://www.mothur.org/wiki/Silva_reference_files
  - Bacterial references (14,956 sequences)
    http://www.mothur.org/w/images/9/98/Silva.bacteria.zip
  - Archaeal references (2,297 sequences)
    http://www.mothur.org/w/images/3/3c/Silva.archaea.zip
  - Eukaryotic references (1,238 sequences)
    http://www.mothur.org/w/images/1/1a/Silva.eukarya.zip
  - Silva-based alignment of template file for chimera.slayer (5,181 sequences)
    http://www.mothur.org/w/images/f/f1/Silva.gold.bacteria.zip
 Alignment database rRNA gene sequences:
  http://www.mothur.org/wiki/Alignment_database
  - greengenes reference alignment
    http://www.mothur.org/w/images/7/72/Greengenes.alignment.zip
  - SILVA (Silva reference)
    http://www.mothur.org/w/images/f/f1/Silva.gold.bacteria.zip
 Secondary structure mapping files:
  http://www.mothur.org/wiki/Secondary_structure_map
    http://www.mothur.org/w/images/6/6d/Silva_ss_map.zip
    http://www.mothur.org/w/images/4/4b/Gg_ss_map.zip
 Lane masks:
  http://www.mothur.org/wiki/Lane_mask
  greengenes-compatible mask:
     - lane1241.gg.filter - A Lane Masks that comes with the greengenes arb database
       http://www.mothur.org/w/images/2/2a/Lane1241.gg.filter
     - lane1287.gg.filter - A Lane Masks that comes with the greengenes arb database
       http://www.mothur.org/w/images/a/a0/Lane1287.gg.filter
     - lane1349.gg.filter - Pat Schloss's transcription of the mask from the Lane paper
       http://www.mothur.org/w/images/3/3d/Lane1349.gg.filter
  SILVA-compatible mask:
     - lane1349.silva.filter - Pat Schloss's transcription of the mask from the Lane paper
       http://www.mothur.org/w/images/6/6d/Lane1349.silva.filter
 Lookup Files for sff flow analysis using shhh.flows:
  http://www.mothur.org/wiki/Alignment_database

 Example from UMN installation: (We also made these available in a Galaxy public data library)
    /project/db/galaxy/mothur/Silva.bacteria.zip
    /project/db/galaxy/mothur/silva.eukarya.fasta
    /project/db/galaxy/mothur/Greengenes.alignment.zip
    /project/db/galaxy/mothur/Silva.archaea.zip
    /project/db/galaxy/mothur/Silva_ss_map.zip
    /project/db/galaxy/mothur/silva.eukarya.ncbi.tax
    /project/db/galaxy/mothur/Silva.gold.bacteria.zip
    /project/db/galaxy/mothur/Silva.archaea/silva.archaea.silva.tax
    /project/db/galaxy/mothur/Silva.archaea/silva.archaea.gg.tax
    /project/db/galaxy/mothur/Silva.archaea/silva.archaea.rdp.tax
    /project/db/galaxy/mothur/Silva.archaea/nogap.archaea.fasta
    /project/db/galaxy/mothur/Silva.archaea/silva.archaea.ncbi.tax
    /project/db/galaxy/mothur/Silva.archaea/silva.archaea.fasta
    /project/db/galaxy/mothur/nogap.eukarya.fasta
    /project/db/galaxy/mothur/silva.eukarya.silva.tax
    /project/db/galaxy/mothur/silva.gold.align
    /project/db/galaxy/mothur/silva.ss.map
    /project/db/galaxy/mothur/gg.ss.map
    /project/db/galaxy/mothur/silva.bacteria/silva.bacteria.silva.tax
    /project/db/galaxy/mothur/silva.bacteria/silva.bacteria.rdp6.tax
    /project/db/galaxy/mothur/silva.bacteria/nogap.bacteria.fasta
    /project/db/galaxy/mothur/silva.bacteria/silva.bacteria.gg.tax
    /project/db/galaxy/mothur/silva.bacteria/silva.bacteria.ncbi.tax
    /project/db/galaxy/mothur/silva.bacteria/silva.bacteria.fasta
    /project/db/galaxy/mothur/silva.bacteria/silva.bacteria.rdp.tax
    /project/db/galaxy/mothur/Silva.eukarya.zip
    /project/db/galaxy/mothur/Gg_ss_map.zip
    /project/db/galaxy/mothur/core_set_aligned.imputed.fasta
    /project/db/galaxy/mothur/RDP/FungiLSU_train_1400bp_8506_mod.fasta
    /project/db/galaxy/mothur/RDP/FungiLSU_train_1400bp_8506_mod.tax
    /project/db/galaxy/mothur/RDP/trainset6_032010.rdp.fasta
    /project/db/galaxy/mothur/RDP/trainset6_032010.rdp.tax
    /project/db/galaxy/mothur/RDP/trainset7_112011.pds.fasta
    /project/db/galaxy/mothur/RDP/trainset7_112011.pds.tax
    /project/db/galaxy/mothur/RDP/trainset7_112011.rdp.fasta
    /project/db/galaxy/mothur/RDP/trainset7_112011.rdp.tax



Add tool-data:  (contains  pointers to silva, greengenes, and RDP reference data)
  tool-data/mothur_aligndb.loc
  tool-data/mothur_map.loc
  tool-data/mothur_taxonomy.loc
  tool-data/shared/jars/TreeVector.jar


add config files (*.xml) and wrapper code (*.py) from tools/mothur/*  to your galaxy installation 


add datatype definition file: lib/galaxy/datatypes/metagenomics.py

add the following import line to:  lib/galaxy/datatypes/registry.py
import metagenomics # added for metagenomics mothur


add datatypes to:  datatypes_conf.xml
        <!-- Start Mothur Datatypes -->
        <datatype extension="otu" type="galaxy.datatypes.metagenomics:Otu" display_in_upload="true"/>
        <datatype extension="list" type="galaxy.datatypes.metagenomics:OtuList" display_in_upload="true"/>
        <datatype extension="sabund" type="galaxy.datatypes.metagenomics:Sabund" display_in_upload="true"/>
        <datatype extension="rabund" type="galaxy.datatypes.metagenomics:Rabund" display_in_upload="true"/>
        <datatype extension="shared" type="galaxy.datatypes.metagenomics:SharedRabund" display_in_upload="true"/>
        <datatype extension="relabund" type="galaxy.datatypes.metagenomics:RelAbund" display_in_upload="true"/>
        <datatype extension="names" type="galaxy.datatypes.metagenomics:Names" display_in_upload="true"/>
        <datatype extension="design" type="galaxy.datatypes.metagenomics:Design" display_in_upload="true"/>
        <datatype extension="summary" type="galaxy.datatypes.metagenomics:Summary" display_in_upload="true"/>
        <datatype extension="groups" type="galaxy.datatypes.metagenomics:Group" display_in_upload="true"/>
        <datatype extension="oligos" type="galaxy.datatypes.metagenomics:Oligos" display_in_upload="true"/>
        <datatype extension="align" type="galaxy.datatypes.metagenomics:SequenceAlignment" display_in_upload="true"/>
        <datatype extension="accnos" type="galaxy.datatypes.metagenomics:AccNos" display_in_upload="true"/>
        <datatype extension="map" type="galaxy.datatypes.metagenomics:SecondaryStructureMap" display_in_upload="true"/>
        <datatype extension="align.check" type="galaxy.datatypes.metagenomics:AlignCheck" display_in_upload="true"/>
        <datatype extension="align.report" type="galaxy.datatypes.metagenomics:AlignReport" display_in_upload="true"/>
        <datatype extension="filter" type="galaxy.datatypes.metagenomics:LaneMask" display_in_upload="true"/>
        <datatype extension="dist" type="galaxy.datatypes.metagenomics:DistanceMatrix" display_in_upload="true"/>
        <datatype extension="pair.dist" type="galaxy.datatypes.metagenomics:PairwiseDistanceMatrix" display_in_upload="true"/>
        <datatype extension="square.dist" type="galaxy.datatypes.metagenomics:SquareDistanceMatrix" display_in_upload="true"/>
        <datatype extension="lower.dist" type="galaxy.datatypes.metagenomics:LowerTriangleDistanceMatrix" display_in_upload="true"/>
        <datatype extension="ref.taxonomy" type="galaxy.datatypes.metagenomics:RefTaxonomy" display_in_upload="true">
            <converter file="ref_to_seq_taxonomy_converter.xml" target_datatype="seq.taxonomy"/>
        </datatype>
        <datatype extension="seq.taxonomy" type="galaxy.datatypes.metagenomics:SequenceTaxonomy" display_in_upload="true"/>
        <datatype extension="rdp.taxonomy" type="galaxy.datatypes.metagenomics:RDPSequenceTaxonomy" display_in_upload="true"/>
        <datatype extension="cons.taxonomy" type="galaxy.datatypes.metagenomics:ConsensusTaxonomy" display_in_upload="true"/>
        <datatype extension="tax.summary" type="galaxy.datatypes.metagenomics:TaxonomySummary" display_in_upload="true"/>
        <datatype extension="freq" type="galaxy.datatypes.metagenomics:Frequency" display_in_upload="true"/>
        <datatype extension="quan" type="galaxy.datatypes.metagenomics:Quantile" display_in_upload="true"/>
        <datatype extension="filtered.quan" type="galaxy.datatypes.metagenomics:FilteredQuantile" display_in_upload="true"/>
        <datatype extension="masked.quan" type="galaxy.datatypes.metagenomics:MaskedQuantile" display_in_upload="true"/>
        <datatype extension="filtered.masked.quan" type="galaxy.datatypes.metagenomics:FilteredMaskedQuantile" display_in_upload="true"/>
        <datatype extension="axes" type="galaxy.datatypes.metagenomics:Axes" display_in_upload="true"/>
        <datatype extension="sff.flow" type="galaxy.datatypes.metagenomics:SffFlow" display_in_upload="true"/>
        <datatype extension="tre" type="galaxy.datatypes.data:Newick" display_in_upload="true"/>
        <!-- End Mothur Datatypes -->

add mothur tools to:   tool_conf.xml
  <section name="Metagenomics Mothur" id="metagenomics_mothur">
    <label text="Mothur Utilities" id="mothur_utilities"/>
      <tool file="mothur/merge.files.xml"/>
      <tool file="mothur/make.group.xml"/>
      <tool file="mothur/get.groups.xml"/>
      <tool file="mothur/remove.groups.xml"/>
      <tool file="mothur/merge.groups.xml"/>
      <tool file="mothur/count.groups.xml"/>
      <tool file="mothur/make.design.xml"/>
      <tool file="mothur/sub.sample.xml"/>
    <label text="Mothur Sequence Analysis" id="mothur_sequence_analysis"/>
      <tool file="mothur/sffinfo.xml"/>
      <tool file="mothur/trim.flows.xml"/>
      <tool file="mothur/shhh.flows.xml"/>
      <tool file="mothur/make.fastq.xml"/>
      <tool file="mothur/fastq.info.xml"/>
      <tool file="mothur/summary.seqs.xml"/>
      <tool file="mothur/count.seqs.xml"/>
      <tool file="mothur/reverse.seqs.xml"/>
      <tool file="mothur/list.seqs.xml"/>
      <tool file="mothur/get.seqs.xml"/>
      <tool file="mothur/remove.seqs.xml"/>
      <tool file="mothur/trim.seqs.xml"/>
      <tool file="mothur/unique.seqs.xml"/>
      <tool file="mothur/deunique.seqs.xml"/>
      <tool file="mothur/chop.seqs.xml"/>
      <tool file="mothur/screen.seqs.xml"/>
      <tool file="mothur/filter.seqs.xml"/>
      <tool file="mothur/degap.seqs.xml"/>
      <tool file="mothur/consensus.seqs.xml"/>
      <tool file="mothur/align.seqs.xml"/>
      <tool file="mothur/align.check.xml"/>
      <tool file="mothur/dist.seqs.xml"/>
      <tool file="mothur/pairwise.seqs.xml"/>
      <tool file="mothur/split.abund.xml"/>
      <tool file="mothur/split.groups.xml"/>
      <tool file="mothur/pcoa.xml"/>
      <tool file="mothur/pca.xml"/>
      <tool file="mothur/nmds.xml"/>
      <tool file="mothur/corr.axes.xml"/>
      <tool file="mothur/classify.seqs.xml"/>
      <tool file="mothur/seq.error.xml"/>
    <label text="Mothur Sequence Chimera Detection" id="mothur_sequence_chimera"/>
      <tool file="mothur/chimera.bellerophon.xml"/>
      <tool file="mothur/chimera.ccode.xml"/>
      <tool file="mothur/chimera.check.xml"/>
      <tool file="mothur/chimera.pintail.xml"/>
      <tool file="mothur/chimera.slayer.xml"/>
      <tool file="mothur/chimera.uchime.xml"/>
    <label text="Mothur Operational Taxonomy Unit" id="mothur_taxonomy_unit"/>
      <tool file="mothur/pre.cluster.xml"/>
      <tool file="mothur/cluster.fragments.xml"/>
      <tool file="mothur/cluster.xml"/>
      <tool file="mothur/hcluster.xml"/>
      <tool file="mothur/cluster.classic.xml"/>
      <tool file="mothur/cluster.split.xml"/>
      <tool file="mothur/metastats.xml"/>
      <tool file="mothur/sens.spec.xml"/>
      <tool file="mothur/classify.otu.xml"/>
      <tool file="mothur/parse.list.xml"/>
      <tool file="mothur/get.otus.xml"/>
      <tool file="mothur/remove.otus.xml"/>
      <tool file="mothur/remove.rare.xml"/>
      <tool file="mothur/get.otulist.xml"/>
      <tool file="mothur/get.oturep.xml"/>
      <tool file="mothur/otu.hierarchy.xml"/>
      <tool file="mothur/get.rabund.xml"/>
      <tool file="mothur/get.sabund.xml"/>
      <tool file="mothur/get.relabund.xml"/>
      <tool file="mothur/make.shared.xml"/>
      <tool file="mothur/get.group.xml"/>
      <tool file="mothur/bin.seqs.xml"/>
      <tool file="mothur/get.sharedseqs.xml"/>
      <tool file="mothur/summary.tax.xml"/>
    <label text="Mothur Single Sample Analysis" id="mothur_single_sample_analysis"/>
      <tool file="mothur/collect.single.xml"/>
      <tool file="mothur/rarefaction.single.xml"/>
      <tool file="mothur/summary.single.xml"/>
      <tool file="mothur/heatmap.bin.xml"/>
    <label text="Mothur Multiple Sample Analysis" id="mothur_multiple_sample_analysis"/>
      <tool file="mothur/collect.shared.xml"/>
      <tool file="mothur/rarefaction.shared.xml"/>
      <tool file="mothur/normalize.shared.xml"/>
      <tool file="mothur/summary.shared.xml"/>
      <tool file="mothur/dist.shared.xml"/>
      <tool file="mothur/heatmap.bin.xml"/>
      <tool file="mothur/heatmap.sim.xml"/>
      <tool file="mothur/venn.xml"/>
      <tool file="mothur/tree.shared.xml"/>
    <label text="Mothur Hypothesis Testing" id="mothur_hypothesis_testing"/>
      <tool file="mothur/parsimony.xml"/>
      <tool file="mothur/unifrac.weighted.xml"/>
      <tool file="mothur/unifrac.unweighted.xml"/>
      <tool file="mothur/libshuff.xml"/>
      <tool file="mothur/amova.xml"/>
      <tool file="mothur/homova.xml"/>
      <tool file="mothur/mantel.xml"/>
      <tool file="mothur/anosim.xml"/>
    <label text="Mothur Phylotype Analysis" id="mothur_phylotype_analysis"/>
      <tool file="mothur/get.lineage.xml"/>
      <tool file="mothur/remove.lineage.xml"/>
      <tool file="mothur/phylotype.xml"/>
      <tool file="mothur/phylo.diversity.xml"/>
      <tool file="mothur/clearcut.xml"/>
      <tool file="mothur/indicator.xml"/>
      <tool file="mothur/deunique.tree.xml"/>
      <tool file="mothur/TreeVector.xml"/>
  </section> <!-- metagenomics_mothur -->

############ DESIGN NOTES #########################################################################################################
Each mothur command has it's own tool_config (.xml) file, but all call the same python wrapper code: mothur_wrapper.py

  (The environment variable MOTHUR_MAX_PROCESSORS can be used to limit the number of cpu processors used be mothur commands)

* Every mothur tool will call mothur_wrapper.py script with a --cmd= parameter that gives the mothur command name.
* Every tool will produce the logfile of the mothur run as an output.
* When the outputs of a mothur command could be determined in advance, they are included in the --result= parameter to mothur_wrapper.py
* When the number of outputs cannot be determined in advance, the name patterns and datatypes of the ouputs 
  are included in the --new_datasets parameter to mothur_wrapper.py
 
Here is an example call to the mothur_wrapper.py script with an explanation before each param :
 mothur_wrapper.py
 # name of a mothur command, this is required
 --cmd='summary.shared'
 # Galaxy output dataset list, these are output files that can be determined before the command is run
 # The items in the list are separated by commas
 # Each item contains a regex to match the output filename and a galaxy dataset filepath in which to copy the data (separated by :)
 --result='^mothur.\S+\.logfile$:'/home/galaxy/data/database/files/002/dataset_2613.dat,'^\S+\.summary$:'/home/galaxy/data/database/files/002/dataset_2614.dat
 # Galaxy output dataset extra_files_path direcotry in which to put all output files (usually the logfile extra_file path)
 --outputdir='/home/galaxy/data/database/files/002/dataset_2613_files'
 # The id of one of the galaxy outputs (e.g. the mothur logfile) used for dynamic dataset generation (when number of outputs not known in advance)
 #  see: ttp://bitbucket.org/galaxy/galaxy-central/wiki/ToolsMultipleOutput
 --datasetid='2578'
 # The galaxy directory in which to copy all output files for dynamic dataset generation (special galaxy tool param: $__new_file_path__)
 --new_file_path='$__new_file_path__'
 # specifies files to copy to the new_file_path
 # The list is separated by commas
 # Each item  conatins:   a regex pattern for matching filenames and  a galaxy datatype (separated by :)
 # The regex match.groups()[0] is used as the id name of the dataset, and must result in  unique name for each output
 --new_datasets='^\S+?\.((\S+)\.(unique|[0-9.]*)\.dist)$:lower.dist'

 ## 
 ## NOTE:   The "read" commands were eliminated with Mothur version 1.18
 ##