# HG changeset patch # User pjbriggs # Date 1510240409 18000 # Node ID 47ec9c6f44b8be9984fd6ec487574cf15b684dc5 planemo upload for repository https://github.com/pjbriggs/Amplicon_analysis-galaxy commit b63924933a03255872077beb4d0fde49d77afa92 diff -r 000000000000 -r 47ec9c6f44b8 README.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.rst Thu Nov 09 10:13:29 2017 -0500 @@ -0,0 +1,249 @@ +Amplicon_analysis-galaxy +======================== + +A Galaxy tool wrapper to Mauro Tutino's ``Amplicon_analysis`` pipeline +script at https://github.com/MTutino/Amplicon_analysis + +The pipeline can analyse paired-end 16S rRNA data from Illumina Miseq +(Casava >= 1.8) and performs the following operations: + + * QC and clean up of input data + * Removal of singletons and chimeras and building of OTU table + and phylogenetic tree + * Beta and alpha diversity of analysis + +Usage documentation +=================== + +Usage of the tool (including required inputs) is documented within +the ``help`` section of the tool XML. + +Installing the tool in a Galaxy instance +======================================== + +The following sections describe how to install the tool files, +dependencies and reference data, and how to configure the Galaxy +instance to detect the dependencies and reference data correctly +at run time. + +1. Install the dependencies +--------------------------- + +The ``install_tool_deps.sh`` script can be used to fetch and install the +dependencies locally, for example:: + + install_tool_deps.sh /path/to/local_tool_dependencies + +This can take some time to complete. When finished it should have +created a set of directories containing the dependencies under the +specified top level directory. + +2. Install the tool files +------------------------- + +The core tool is hosted on the Galaxy toolshed, so it can be installed +directly from there (this is the recommended route): + + * https://toolshed.g2.bx.psu.edu/view/pjbriggs/amplicon_analysis_pipeline/ + +Alternatively it can be installed manually; in this case there are two +files to install: + + * ``amplicon_analysis_pipeline.xml`` (the Galaxy tool definition) + * ``amplicon_analysis_pipeline.py`` (the Python wrapper script) + +Put these in a directory that is visible to Galaxy (e.g. a +``tools/Amplicon_analysis/`` folder), and modify the ``tools_conf.xml`` +file to tell Galaxy to offer the tool by adding the line e.g.:: + + + +3. Install the reference data +----------------------------- + +The script ``References.sh`` from the pipeline package at +https://github.com/MTutino/Amplicon_analysis can be run to install +the reference data, for example:: + + cd /path/to/pipeline/data + wget https://github.com/MTutino/Amplicon_analysis/raw/master/References.sh + /bin/bash ./References.sh + +will install the data in ``/path/to/pipeline/data``. + +**NB** The final amount of data downloaded and uncompressed will be +around 6GB. + +4. Configure dependencies and reference data in Galaxy +------------------------------------------------------ + +The final steps are to make your Galaxy installation aware of the +tool dependencies and reference data, so it can locate them both when +the tool is run. + +To target the tool dependencies installed previously, add the +following lines to the ``dependency_resolvers_conf.xml`` file in the +Galaxy ``config`` directory:: + + + ... + + + ... + + +(NB it is recommended to place these *before* the ```` +resolvers) + +(If you're not familiar with dependency resolvers in Galaxy then +see the documentation at +https://docs.galaxyproject.org/en/master/admin/dependency_resolvers.html +for more details.) + +The tool locates the reference data via an environment variable called +``AMPLICON_ANALYSIS_REF_DATA_PATH``, which needs to set to the parent +directory where the reference data has been installed. + +There are various ways to do this, depending on how your Galaxy +installation is configured: + + * **For local instances:** add a line to set it in the + ``config/local_env.sh`` file of your Galaxy installation, e.g.:: + + export AMPLICON_ANALYSIS_REF_DATA_PATH=/path/to/pipeline/data + + * **For production instances:** set the value in the ``job_conf.xml`` + configuration file, e.g.:: + + + /path/to/pipeline/data + + + and then specify that the pipeline tool uses this destination:: + + + + (For more about job destinations see the Galaxy documentation at + https://galaxyproject.org/admin/config/jobs/#job-destinations) + +5. Enable rendering of HTML outputs from pipeline +------------------------------------------------- + +To ensure that HTML outputs are displayed correctly in Galaxy +(for example the Vsearch OTU table heatmaps), Galaxy needs to be +configured not to sanitize the outputs from the ``Amplicon_analysis`` +tool. + +Either: + + * **For local instances:** set ``sanitize_all_html = False`` in + ``config/galaxy.ini`` (nb don't do this on production servers or + public instances!); or + + * **For production instances:** add the ``Amplicon_analysis`` tool + to the display whitelist in the Galaxy instance: + + - Set ``sanitize_whitelist_file = config/whitelist.txt`` in + ``config/galaxy.ini`` and restart Galaxy; + - Go to ``Admin>Manage Display Whitelist``, check the box for + ``Amplicon_analysis`` (hint: use your browser's 'find-in-page' + search function to help locate it) and click on + ``Submit new whitelist`` to update the settings. + +Additional details +================== + +Some other things to be aware of: + + * Note that using the Silva database requires a minimum of 18Gb RAM + +Known problems +============== + + * Only the ``VSEARCH`` pipeline in Mauro's script is currently + available via the Galaxy tool; the ``USEARCH`` and ``QIIME`` + pipelines have yet to be implemented. + * The images in the tool help section are not visible if the + tool has been installed locally, or if it has been installed in + a Galaxy instance which is served from a subdirectory. + + These are both problems with Galaxy and not the tool, see + https://github.com/galaxyproject/galaxy/issues/4490 and + https://github.com/galaxyproject/galaxy/issues/1676 + +Appendix: availability of tool dependencies +=========================================== + +The tool takes its dependencies from the underlying pipeline script (see +https://github.com/MTutino/Amplicon_analysis/blob/master/README.md +for details). + +As noted above, currently the ``install_tool_deps.sh`` script can be +used to manually install the dependencies for a local tool install. + +In principle these should also be available if the tool were installed +from a toolshed. However it would be preferrable in this case to get as +many of the dependencies as possible via the ``conda`` dependency +resolver. + +The following are known to be available via conda, with the required +version: + + - cutadapt 1.8.1 + - sickle-trim 1.33 + - bioawk 1.0 + - fastqc 0.11.3 + - R 3.2.0 + +Some dependencies are available but with the "wrong" versions: + + - spades (need 3.5.0) + - qiime (need 1.8.0) + - blast (need 2.2.26) + - vsearch (need 1.1.3) + +The following dependencies are currently unavailable: + + - fasta_number (need 02jun2015) + - fasta-splitter (need 0.2.4) + - rdp_classifier (need 2.2) + - microbiomeutil (need r20110519) + +(NB usearch 6.1.544 and 8.0.1623 are special cases which must be +handled outside of Galaxy's dependency management systems.) + +History +======= + +========== ====================================================================== +Version Changes +---------- ---------------------------------------------------------------------- +1.1.0 First official version on Galaxy toolshed. +1.0.6 Expand inline documentation to provide detailed usage guidance. +1.0.5 Updates including: + + - Capture read counts from quality control as new output dataset + - Capture FastQC per-base quality boxplots for each sample as + new output dataset + - Add support for -l option (sliding window length for trimming) + - Default for -L set to "200" +1.0.4 Various updates: + + - Additional outputs are captured when a "Categories" file is + supplied (alpha diversity rarefaction curves and boxplots) + - Sample names derived from Fastqs in a collection of pairs + are trimmed to SAMPLE_S* (for Illumina-style Fastq filenames) + - Input Fastqs can now be of more general ``fastq`` type + - Log file outputs are captured in new output dataset + - User can specify a "title" for the job which is copied into + the dataset names (to distinguish outputs from different runs) + - Improved detection and reporting of problems with input + Metatable +1.0.3 Take the sample names from the collection dataset names when + using collection as input (this is now the default input mode); + collect additional output dataset; disable ``usearch``-based + pipelines (i.e. ``UPARSE`` and ``QIIME``). +1.0.2 Enable support for FASTQs supplied via dataset collections and + fix some broken output datasets. +1.0.1 Initial version +========== ====================================================================== diff -r 000000000000 -r 47ec9c6f44b8 amplicon_analysis_pipeline.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/amplicon_analysis_pipeline.py Thu Nov 09 10:13:29 2017 -0500 @@ -0,0 +1,329 @@ +#!/usr/bin/env python +# +# Wrapper script to run Amplicon_analysis_pipeline.sh +# from Galaxy tool + +import sys +import os +import argparse +import subprocess +import glob + +class PipelineCmd(object): + def __init__(self,cmd): + self.cmd = [str(cmd)] + def add_args(self,*args): + for arg in args: + self.cmd.append(str(arg)) + def __repr__(self): + return ' '.join([str(arg) for arg in self.cmd]) + +def ahref(target,name=None,type=None): + if name is None: + name = os.path.basename(target) + ahref = " + +Amplicon analysis pipeline: log files + + +

Amplicon analysis pipeline: log files

Amplicon analysis pipeline: Per-base Quality Boxplots (FastQC)

%s

No FastQC outputs found

%s

Amplicon analysis pipeline: Alpha Diversity Boxplots (Shannon)

+ analyse 16S rRNA data from Illumina Miseq paired-end reads + + amplicon_analysis_pipeline + cutadapt + sickle + bioawk + pandaseq + spades + fastqc + qiime + blast + fasta-splitter + rdp-classifier + R + vsearch + microbiomeutil + fasta_number + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + categories_file_in is not None + + + + + + = 1.8) paired-end reads. + +Usage +----- + +1. Preparation of the mapping file and format of unique sample id +***************************************************************** + +Before using the amplicon analysis pipeline it would be necessary to +follow the steps as below to avoid analysis failures and ensure samples +are labelled appropriately. Sample names for the labelling are derived +from the fastq files names that are generated from the sequencing. The +labels will include everything between the beginning of the name and +the sample number (from C11 to S19 in Fig. 1) + +.. image:: Pipeline_description_Fig1.png + :height: 46 + :width: 382 + +**Figure 1** + +If analysing 16S data from multiple runs: + +The samples from different runs may have identical IDs. For example, +when sequencing the same samples twice, by chance, these could be at +the same position in both the runs. This would cause the fastq files +to have exactly the same IDs (Fig. 2). + +.. image:: Pipeline_description_Fig2.png + :height: 100 + :width: 463 + +**Figure 2** + +In case of identical sample IDs the pipeline will fail to run and +generate an error at the beginning of the analysis. + +To avoid having to change the file names, before uploading the files, +ensure that the samples IDs are not repeated. + +2. To upload the file +********************* + +Click on **Get Data/Upload File** from the Galaxy tool panel on the +left hand side. + +From the pop-up window, choose how to upload the file. The +**Choose local file** option can be used for files up to 4Gb. Fastq files +from Illumina MiSeq will rarely be bigger than 4Gb and this option is +recommended. + +After choosing the files click **Start** to begin the upload. The window can +now be closed and the files will be uploaded onto the Galaxy server. You +will see the progress on the ``HISTORY`` panel on the right +side of the screen. The colour will change from grey (queuing), to yellow +(uploading) and finally green (uploaded). + +Once all the files are uploaded, click on the operations on multiple +datasets icon and select the fastq files that need to be analysed. +Click on the tab **For all selected...** and on the option +**Build List of Dataset pairs** (Fig. 3). + +.. image:: Pipeline_description_Fig3.png + :height: 247 + :width: 586 + +**Figure 3** + +Change the filter parameter ``_1`` and ``_2`` to be ``_R1`` and ``_R2``. +The fastq files forward R1 and reverse R2 should now appear in the +corresponding columns. + +Select **Autopair**. This creates a collection of paired fastq files for +the forward and reverse reads for each sample. The name of the pairs will +be the ones used by the pipeline. You are free to change the names at this +point as long as they are the same used in the Metatable file +(see section 3). + +Name the collection and click on **create list**. This reduces the time +required to input the forward and reverse reads for each individual sample. + +3. Create the Metatable files +***************************** + +Metatable.txt +~~~~~~~~~~~~~ + +Click on the list of pairs you just created to see the name of the single +pairs. The name of the pairs will be the ones used by the pipeline, +therefore, these are the names that need to be used in the Metatable file. + +The Metatable file has to be in QIIME format. You can find a description +of it on QIIME website http://qiime.org/documentation/file_formats.html + +EXAMPLE:: + + #SampleID BarcodeSequence LinkerPrimerSequence Disease Gender Description + Mock-RUN1 TAAGGCGAGCGTAAGA PsA Male Control + Mock-RUN2 CGTACTAGGCGTAAGA PsA Male Control + Mock-RUN3 AGGCAGAAGCGTAAGA PsC Female Control + +Briefly: the column ``LinkerPrimerSequence`` is empty but it cannot be +deleted. The header is very important. ``#SampleID``, ``Barcode``, +``LinkerPrimerSequence`` and ``Description`` are mandatory. Between +``LinkerPrimerSequence`` and ``Description`` you can add as many columns +as you want. For every column a PCoA plot will be created (see +**Results** section). You can create this file in Excel and it will have +to be saved as ``Text(Tab delimited)``. + +During the analysis the Metatable.txt will be checked to ensure that the +file has the correct format. If necessary, this will be modified and will +be available as Metatable_corrected.txt in the history panel. If you are +going to use the metatable file for any other statistical analyses, +remember to use the ``Metatable_mod.txt`` one, otherwise the sample +names might not match! + +Categories.txt (optional) +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This file is required if you want to get box plots for comparison of +alpha diversity indices (see **Results** section). The file is a list +(without header and IN ONE COLUMN) of categories present in the +Metatable.txt file. THE NAMES YOU ARE USING HAVE TO BE THE SAME AS THE +ONES USED IN THE METATABLE.TXT. You can create this file in Excel and +will have to be saved as ``Text(Tab delimited)``. + +EXAMPLE:: + + Disease + Gender + +Metatable and categories files can be uploaded using Get Data as done +with the fatsq files. + +4. Analysis +*********** + +Under **Amplicon_Analysis_Pipeline** + + * **Title** Name to distinguish between the runs. It will be shown at + the beginning of each output file name. + + * **Input Metatable.txt file** Select the Metatable.txt file related to + this analysis + + * **Input Categories.txt file (Optional)** Select the Categories.txt file + related to this analysis + + * **Input FASTQ type** select *Dataset pairs in a collection* and, then, + the collection of pairs you created earlier. + + * **Forward/Reverse PCR primer sequence** if the PCR primer sequences + have not been removed from the MiSeq during the fastq creation, they + have to be removed before the analysis. Insert the PCR primer sequence + in the corresponding field. DO NOT include any barcode or adapter + sequence. If the PCR primers have been already trimmed by the MiSeq, + and you include the sequence in this field, this would lead to an error. + Only include the sequences if still present in the fastq files. + + * **Threshold quality below which reads will be trimmed** Choose the + Phred score used by Sickle to trim the reads at the 3’ end. + + * **Minimum length to retain a read after trimming** If the read length + after trimming is shorter than a user defined length, the read, along + with the corresponding read pair, will be discarded. + + * **Minimum overlap in bp between forward and reverse reads** Choose the + minimum basepair overlap used by Pandaseq to assemble the reads. + Default is 10. + + * **Minimum length in bp to keep a sequence after overlapping** Choose the + minimum sequence length used by Pandaseq to keep a sequence after the + overlapping. This depends on the expected amplicon length. Default is + 380 (used for V3-V4 16S sequencing; expected length ~440bp) + + * **Pipeline to use for analysis** Choose the pipeline to use for OTU + clustering and chimera removal. The Galaxy tool currently supports + ``Vsearch`` only. ``Uparse`` and ``QIIME`` are planned to be added + shortly (the tools are already available for the stand-alone pipeline). + + * **Reference database** Choose between ``GreenGenes`` and ``Silva`` + databases for taxa assignment. + +Click on **Execute** to start the analysis. + +5. Results +********** + +Results are entirely generated using QIIME scripts. The results will +appear in the History panel when the analysis is completed + + * **Vsearch_tax_OTU_table (biom format)** The OTU table in BIOM format + (http://biom-format.org/) + + * **Vsearch_OTUs.tree** Phylogenetic tree constructed using + ``make_phylogeny.py`` (fasttree) QIIME script + (http://qiime.org/scripts/make_phylogeny.html) + + * **Vsearch_phylum_genus_dist_barcharts_HTML** HTML file with bar + charts at Phylum, Genus and Species level + (http://qiime.org/scripts/summarize_taxa.html and + http://qiime.org/scripts/plot_taxa_summary.html) + + * **Vsearch_OTUs_count_file** Summary of OTU counts per sample + (http://biom-format.org/documentation/summarizing_biom_tables.html) + + * **Vsearch_table_summary_file** Summary of sequences counts per sample + (http://biom-format.org/documentation/summarizing_biom_tables.html) + + * **Vsearch_multiplexed_linearized_dereplicated_mc2_repset_nonchimeras_OTUs.fasta** + Fasta file with OTU sequences + + * **Vsearch_heatmap_OTU_table_HTML** Interactive OTU heatmap + (http://qiime.org/1.8.0/scripts/make_otu_heatmap_html.html ) + + * **Vsearch_beta_diversity_weighted_2D_plots_HTML** PCoA plots in HTML + format using weighted Unifrac distance measure. Samples are grouped + by the column names present in the Metatable file. The samples are + firstly rarefied to the minimum sequencing depth + (http://qiime.org/scripts/beta_diversity_through_plots.html ) + + * **Vsearch_beta_diversity_unweighted_2D_plots_HTML** PCoA plots in HTML + format using Unweighted Unifrac distance measure. Samples are grouped + by the column names present in the Metatable file. The samples are + firstly rarefied to the minimum sequencing depth + (http://qiime.org/scripts/beta_diversity_through_plots.html ) + +Code availability +----------------- + +**Code is available at** https://github.com/MTutino/Amplicon_analysis + +Credits +------- + +Pipeline author: Mauro Tutino + +Galaxy tool: Peter Briggs + + ]]> + + + @misc{githubAmplicon_analysis, + author = {Tutino, Mauro}, + year = {2017}, + title = {Amplicon Analysis Pipeline}, + publisher = {GitHub}, + journal = {GitHub repository}, + url = {https://github.com/MTutino/Amplicon_analysis}, +} + +

>$INSTALL_DIR/INSTALLATION.log 2>&1 +EOF + popd + rm -rf $wd/* + rmdir $wd +} +function install_amplicon_analysis_pipeline_1_1() { + install_amplicon_analysis_pipeline $1 1.1 +} +function install_amplicon_analysis_pipeline_1_0() { + install_amplicon_analysis_pipeline $1 1.0 +} +function install_amplicon_analysis_pipeline() { + version=$2 + echo Installing Amplicon_analysis $version + install_dir=$1/amplicon_analysis_pipeline/$version + if [ -f $install_dir/env.sh ] ; then + return + fi + mkdir -p $install_dir + echo Moving to $install_dir + pushd $install_dir + wget -q https://github.com/MTutino/Amplicon_analysis/archive/v${version}.tar.gz + tar zxf v${version}.tar.gz + mv Amplicon_analysis-${version} Amplicon_analysis + rm -rf v${version}.tar.gz + popd + # Make setup file + cat > $install_dir/env.sh < $install_dir/env.sh < $INSTALL_DIR/env.sh <$INSTALL_DIR/INSTALLATION.log 2>&1 + mv sickle $INSTALL_DIR/bin + popd + rm -rf $wd/* + rmdir $wd + # Make setup file + cat > $INSTALL_DIR/env.sh <$INSTALL_DIR/INSTALLATION.log 2>&1 + mv bioawk $INSTALL_DIR/bin + mv maketab $INSTALL_DIR/bin + popd + rm -rf $wd/* + rmdir $wd + # Make setup file + cat > $INSTALL_DIR/env.sh <$install_dir/INSTALLATION.log 2>&1 + ./configure --prefix=$install_dir >>$install_dir/INSTALLATION.log 2>&1 + make; make install >>$install_dir/INSTALLATION.log 2>&1 + popd + rm -rf $wd/* + rmdir $wd + # Make setup file + cat > $1/pandaseq/2.8.1/env.sh < $1/spades/3.5.0/env.sh < $1/fastqc/0.11.3/env.sh < test.f90 + gfortran -o test test.f90 + LGF=`ldd test | grep libgfortran | awk '{print $3}'` + LGF_CANON=`readlink -f $LGF` + LGF_VERS=`objdump -p $LGF_CANON | grep GFORTRAN_1 | sed -r 's/.*GFORTRAN_1\.([0-9])+/\1/' | sort -n | tail -1` + if [ $LGF_VERS -gt $BUNDLED_LGF_VERS ]; then + cp -p $BUNDLED_LGF_CANON ${BUNDLED_LGF_CANON}.bundled + cp -p $LGF_CANON $BUNDLED_LGF_CANON + fi + popd + rm -rf $wd/* + rmdir $wd + # Atlas 3.10 (build from source) + # NB this stolen from galaxyproject/iuc-tools + ##local wd=$(mktemp -d) + ##echo Moving to $wd + ##pushd $wd + ##wget -q https://depot.galaxyproject.org/software/atlas/atlas_3.10.2+gx0_src_all.tar.bz2 + ##wget -q https://depot.galaxyproject.org/software/lapack/lapack_3.5.0_src_all.tar.gz + ##wget -q https://depot.galaxyproject.org/software/atlas/atlas_patch-blas-lapack-1.0_src_all.diff + ##wget -q https://depot.galaxyproject.org/software/atlas/atlas_patch-shared-lib-1.0_src_all.diff + ##wget -q https://depot.galaxyproject.org/software/atlas/atlas_patch-cpu-throttle-1.0_src_all.diff + ##tar -jxvf atlas_3.10.2+gx0_src_all.tar.bz2 + ##cd ATLAS + ##mkdir build + ##patch -p1 < ../atlas_patch-blas-lapack-1.0_src_all.diff + ##patch -p1 < ../atlas_patch-shared-lib-1.0_src_all.diff + ##patch -p1 < ../atlas_patch-cpu-throttle-1.0_src_all.diff + ##cd build + ##../configure --prefix="$INSTALL_DIR" -D c -DWALL -b 64 -Fa alg '-fPIC' --with-netlib-lapack-tarfile=../../lapack_3.5.0_src_all.tar.gz -v 2 -t 0 -Si cputhrchk 0 + ##make + ##make install + ##popd + ##rm -rf $wd/* + ##rmdir $wd + export ATLAS_LIB_DIR=$INSTALL_DIR/lib + export ATLAS_INCLUDE_DIR=$INSTALL_DIR/include + export ATLAS_BLAS_LIB_DIR=$INSTALL_DIR/lib/atlas + export ATLAS_LAPACK_LIB_DIR=$INSTALL_DIR/lib/atlas + export ATLAS_ROOT_PATH=$INSTALL_DIR + export LD_LIBRARY_PATH=$INSTALL_DIR/lib:$LD_LIBRARY_PATH + export LD_LIBRARY_PATH=$INSTALL_DIR/lib/atlas:$LD_LIBRARY_PATH + # Numpy 1.7.1 + local wd=$(mktemp -d) + echo Moving to $wd + pushd $wd + wget -q https://depot.galaxyproject.org/software/numpy/numpy_1.7_src_all.tar.gz + tar -zxvf numpy_1.7_src_all.tar.gz + cd numpy-1.7.1 + cat > site.cfg < $INSTALL_DIR/env.sh < $install_dir/env.sh <$install_dir/INSTALLATION.log 2>&1 + mv * $install_dir + popd + # Clean up + rm -rf $wd/* + rmdir $wd + # Make setup file +cat > $install_dir/env.sh < $install_dir/env.sh < $install_dir/env.sh <>$install_dir/INSTALLATION.log +EOF + done + # Install fasta-splitter + wget -q http://kirill-kryukov.com/study/tools/fasta-splitter/files/fasta-splitter-0.2.4.zip + unzip -qq fasta-splitter-0.2.4.zip + chmod 0755 fasta-splitter.pl + mv fasta-splitter.pl $install_dir/bin + popd + # Clean up + rm -rf $wd/* + rmdir $wd + # Make setup file +cat > $install_dir/env.sh < $install_dir/env.sh < $install_dir/env.sh <$install_dir/bin/uc2otutab.py + cat uc2otutab.py >>$install_dir/bin/uc2otutab.py + chmod +x $install_dir/bin/uc2otutab.py + popd + # Clean up + rm -rf $wd/* + rmdir $wd + # Make setup file +cat > $install_dir/env.sh <