Mercurial > repos > triasteran > catdc_docker_test
changeset 9:81f36745bc9d draft
Uploaded
author | triasteran |
---|---|
date | Tue, 08 Mar 2022 11:43:08 +0000 |
parents | 7e70e7a0f4bb |
children | b75227b2833e |
files | smalt/Dockerfile smalt/README.md smalt/instructions.sh smalt/reads.fastq smalt/reference.fasta smalt/smalt_wrapper.xml |
diffstat | 6 files changed, 523 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/smalt/Dockerfile Tue Mar 08 11:43:08 2022 +0000 @@ -0,0 +1,26 @@ +# Galaxy - SMALT +# From https://toolshed.g2.bx.psu.edu/repository/view_repository?sort=Repository.name&operation=view_or_manage_repository&changeset_revision=54855bd8d107&id=ec70d959cc6d865d + +FROM debian:wheezy + +MAINTAINER Aaron Petkau, aaron.petkau@gmail.com + +# make sure the package repository is up to date +RUN DEBIAN_FRONTEND=noninteractive apt-get -qq update + +RUN DEBIAN_FRONTEND=noninteractive apt-get install -y python +RUN DEBIAN_FRONTEND=noninteractive apt-get install -y wget +RUN DEBIAN_FRONTEND=noninteractive apt-get install -y mercurial + +RUN mkdir /tmp/smalt +WORKDIR /tmp/smalt + +RUN wget ftp://ftp.sanger.ac.uk/pub4/resources/software/smalt/smalt-0.7.3.tgz +RUN tar -xvvzf smalt-0.7.3.tgz +RUN cp smalt-0.7.3/smalt_x86_64 /usr/bin/smalt_unknown + +RUN hg clone https://toolshed.g2.bx.psu.edu/repos/cjav/smalt smalt_deps +RUN cp smalt_deps/smalt_wrapper.py /usr/bin/smalt_wrapper.py +RUN chmod a+x /usr/bin/smalt_wrapper.py + +RUN apt-get clean && rm -rf /tmp/smalt && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/smalt/README.md Tue Mar 08 11:43:08 2022 +0000 @@ -0,0 +1,162 @@ +Integration of SMALT into Galaxy and Docker +=========================================== + +This describes a way to construct a docker image and modify a tool within Galaxy to work with Docker. The tool I use is SMALT from https://toolshed.g2.bx.psu.edu/repository/find_tools?sort=Repository.name&operation=view_or_manage_repository&id=61631a13f8a13237. + +1. Building a Docker Image +========================== + +There are two different methods to construct a docker image for a tool: + +1. Interactively +2. Using a `Dockerfile` + +1. Building a Docker Image Interactively +---------------------------------------- + +The interactive method of constructing a docker image is a good method for installing all the dependencies for a tool if you're not quite sure exactly what commands you need to run to get the tool working. This method involves starting up a docker container, and running commands within this container to get your tool working. Then, you can **commit** this container to an image and save to dockerhub for re-use. + +### Step 1: Starting up a Docker base Image + +To build a docker image we need a starting point. The base image we will use is **debian/wheezy** since it contains `apt-get` for installing more software. To start up this image in an interactive mode please run: + +```bash +$ sudo docker run -i -t debian:wheezy +``` + +This should bring up a prompt that looks like: + +``` +root@2c1077a38bce:/# +``` + +Note that the id `2c1077a38bce` can be used later for saving this container to an image. + +### Step 2: Installing Basic Dependencies + +A few dependencies are required to install SMALT. These can be installed with: + +```bash +$ apt-get update +$ apt-get install python +$ apt-get install wget +$ apt-get install mercurial +``` + +### Step 3: Installing SMALT + +We will install SMALT by downloading the necessary software and copying to `/usr/bin`. This can be accomplished with: + +```bash +$ mkdir tool +$ cd tool + +$ wget ftp://ftp.sanger.ac.uk/pub4/resources/software/smalt/smalt-0.7.3.tgz +$ tar -xvvzf smalt-0.7.3.tgz + +# because the smalt_wrapper.py finds the binary name based on `uname -i` which is unknown in docker +$ mv smalt-0.7.3/smalt_x86_64 smalt_unknown + +$ hg clone https://toolshed.g2.bx.psu.edu/repos/cjav/smalt smalt_deps +$ cp smalt_deps/smalt_wrapper.py . + +# add smalt tools to /usr/bin so they're on the PATH +$ ln -s /tool/smalt_unknown /usr/bin +$ ln -s /tool/smalt_wrapper.py /usr/bin + +# make smalt_wrapper executable +$ chmod a+x /tool/smalt_wrapper.py +``` + +You can test out SMALT by trying to run `smalt_unknown` and `smalt_wrapper.py` in the docker container. + +### Step 4: Building an Image + +To build the docker image from the container, please exit the container and run the following: + +```bash +$ sudo docker commit -m "make smalt image" -a "Aaron Petkau" 2c1077a38bce apetkau/smalt-galaxy +``` + +Please fill in the appropriate information for your image. In particular, make sure the container id `2c1077a38bce` is correct. + +To push this image to dockerhub you can run: + +```bash +$ sudo docker push apetkau/smalt-galaxy +``` + +2. Building an image using a `Dockerfile` +----------------------------------------- + +Alternatively, instead of building an image interactively, you can build an image with a `Dockerfile`. An example Dockerfile can be found in this repository at [Dockerfile](Dockerfile). To build an image please run: + +```bash +$ sudo docker build -t apetkau/smalt-galaxy . +``` + +More information on Dockerfiles can be found at https://docs.docker.com/reference/builder/. + +2. Installing Tool configuration +================================ + +### Installing Example file + +Once a Docker image has been built, it can be integrated into a tool by modifying the tool configuration XML file. For SMALT, the configuration file is [smalt_wrapper.xml](smalt_wrapper.xml). This is based on https://toolshed.g2.bx.psu.edu/repos/cjav/smalt/file/54855bd8d107/smalt_wrapper.xml. And can be installed by running: + +```bash +$ cp smalt_wrapper.xml galaxy-central/tools/docker/smalt_wrapper.xml +``` + +Then, please add this tool to the `tool_conf.xml` by adding: + +```xml + <tool file="docker/smalt_wrapper.xml"/> +``` + +### List of Changes + +The exact changes you I needed to make are: + +1. I added the specific docker image name to the requirements by changing: + + ```xml + <requirements> + <requirement type="package" version="0.7.3">smalt</requirement> + </requirements> + ``` + + To + + ```xml + <requirements> + <container type="docker">apetkau/smalt-galaxy</container> + </requirements> + ``` + +2. I had to remove `interpreter` from the command attribute by changing + + ```xml + <command interpreter="python"> + smalt_wrapper.py + ``` + + To + + ```xml + <command> + smalt_wrapper.py + ``` + +3. Running Galaxy +================= + +Once the tool is installed, please run Galaxy. And test out the tool. Some example files (reference.fasta and reads.fastq) are included. To make sure it's running in docker you can look for the following `sudo docker run` in the logs: + +``` +galaxy.jobs.runners DEBUG 2014-06-28 16:50:00,930 (18) command is: sudo docker run -e "GALAXY_SLOTS=$GALAXY_SLOTS" -v /home/aaron/Projects/galaxy-central:/home/aaron/Proj +ects/galaxy-central:ro -v /home/aaron/Projects/galaxy-central/tools/docker:/home/aaron/Projects/galaxy-central/tools/docker:ro -v /home/aaron/Projects/galaxy-central/data +base/job_working_directory/000/18:/home/aaron/Projects/galaxy-central/database/job_working_directory/000/18:rw -v /home/aaron/Projects/galaxy-central/database/files:/home +/aaron/Projects/galaxy-central/database/files:rw -w /home/aaron/Projects/galaxy-central/database/job_working_directory/000/18 --net none apetkau/smalt:v3 /home/aaron/Proj +ects/galaxy-central/database/job_working_directory/000/18/container.sh +```
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/smalt/instructions.sh Tue Mar 08 11:43:08 2022 +0000 @@ -0,0 +1,34 @@ +# Run docker image 'debian/wheezy' in interactive mode +sudo docker run -i -t debian:wheezy + +# Run the below in this image + +apt-get update +apt-get install python +apt-get install wget +apt-get install mercurial + +mkdir tool +cd tool + +wget ftp://ftp.sanger.ac.uk/pub4/resources/software/smalt/smalt-0.7.3.tgz +tar -xvvzf smalt-0.7.3.tgz + +# because the smalt_wrapper.py finds the binary name based on `uname -i` which is unknown in docker +mv smalt-0.7.3/smalt_x86_64 smalt_unknown + +hg clone https://toolshed.g2.bx.psu.edu/repos/cjav/smalt smalt_deps +cp smalt_deps/smalt_wrapper.py . + +# add smalt tools to PATH (probably different ways to do this) +ln -s /tool/smalt_unknown /usr/bin +ln -s /tool/smalt_wrapper.py /usr/bin + +# make smalt_wrapper executable +chmod a+x /tool/smalt_wrapper.py + +# exit out of docker image and run the below to commit to new container. replace the number '07b...' with container id for the above docker container. +sudo docker commit -m "make smalt_wrapper executable" -a "Aaron Petkau" 07b937918961 apetkau/smalt:v3 + +# push to dockerhub +# please see instructions at http://docs.docker.com/userguide/dockerimages/#push-an-image-to-docker-hub
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/smalt/reads.fastq Tue Mar 08 11:43:08 2022 +0000 @@ -0,0 +1,8 @@ +@1 +AAAA ++ +IIII +@2 +AAAA ++ +IIII
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/smalt/reference.fasta Tue Mar 08 11:43:08 2022 +0000 @@ -0,0 +1,2 @@ +>a +AAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/smalt/smalt_wrapper.xml Tue Mar 08 11:43:08 2022 +0000 @@ -0,0 +1,291 @@ +<tool id="smalt_wrapper (docker)" name="SMALT" version="0.0.3"> + <requirements> + <container type="docker">apetkau/smalt-galaxy</container> + </requirements> + <description>maps query reads onto the reference sequences</description> + <command> + smalt_wrapper.py + --threads="4" + + ## reference source + --fileSource=$genomeSource.refGenomeSource + #if $genomeSource.refGenomeSource == "history": + ##build index on the fly + --ref="${genomeSource.ownFile}" + --dbkey=$dbkey + #else: + ##use precomputed indexes + --ref="${genomeSource.indices.fields.path}" + --do_not_build_index + #end if + + ## input file(s) + --input1=$paired.input1 + #if $paired.sPaired == "paired": + --input2=$paired.input2 + #end if + + ## output file + --output=$output + + ## run parameters + --genAlignType=$paired.sPaired + --params=$params.source_select + #if $params.source_select != "pre_set": + --scorDiff=$params.scorDiff + #if $paired.sPaired == "paired": + --insertMax=$params.insertMax + --insertMin=$params.insertMin + --pairTyp=$params.pairTyp + #end if + --minScor=$params.minScor + --partialAlignments=$params.partialAlignments + --minBasq=$params.minBasq + --seed=$params.seed + --complexityWeighted=$params.complexityWeighted + --exhaustiveSearch=$params.cExhaustiveSearch.exhaustiveSearch + #if $params.cExhaustiveSearch.exhaustiveSearch == "true" + --minCover=$params.cExhaustiveSearch.minCover + #end if + --minId=$params.minId + #end if + + ## suppress output SAM header + --suppressHeader=$suppressHeader + </command> + <inputs> + <conditional name="genomeSource"> + <param name="refGenomeSource" type="select" label="Will you select a reference genome from your history or use a built-in index?"> + <option value="indexed">Use a built-in index</option> + <option value="history">Use one from the history</option> + </param> + <when value="indexed"> + <param name="indices" type="select" label="Select a reference genome"> + <options from_data_table="smalt_indexes"> + <filter type="sort_by" column="2" /> + <validator type="no_options" message="No indexes are available" /> + </options> + </param> + </when> + <when value="history"> + <param name="ownFile" type="data" format="fasta" metadata_name="dbkey" label="Select a reference from history" /> + </when> + </conditional> + <conditional name="paired"> + <param name="sPaired" type="select" label="Is this library mate-paired?"> + <option value="single">Single-end</option> + <option value="paired">Paired-end</option> + </param> + <when value="single"> + <param name="input1" type="data" format="fastqsanger" label="FASTQ file" help="FASTQ with Sanger-scaled quality values (fastqsanger)" /> + </when> + <when value="paired"> + <param name="input1" type="data" format="fastqsanger" label="Forward FASTQ file" help="FASTQ with Sanger-scaled quality values (fastqsanger)" /> + <param name="input2" type="data" format="fastqsanger" label="Reverse FASTQ file" help="FASTQ with Sanger-scaled quality values (fastqsanger)" /> + </when> + </conditional> + <conditional name="params"> + <param name="source_select" type="select" label="Smalt settings to use" help="For most mapping needs use Commonly Used settings. If you want full control use Full Parameter List"> + <option value="pre_set">Commonly Used</option> + <option value="full">Full Parameter List</option> + </param> + <when value="pre_set" /> + <when value="full"> + <conditional name="cExhaustiveSearch"> + <param name="exhaustiveSearch" type="boolean" truevalue="true" falsevalue="false" checked="no" label="Do exhaustive search? (map -x)" help="This flag triggers a more exhaustive search for alignments at the cost of decreased speed." /> + <when value="true"> + <param name="minCover" type="float" value="0" label="Minimum cover (map -c)" help="Only consider mappings where the k-mer word seeds cover the query read to a minimum extent." /> + </when> + <when value="no" /> + </conditional> + <param name="scorDiff" type="integer" value="0" label="Score diff (map -d)" help="Set a threshold of the Smith-Waterman alignment score relative to the maximum score." /> + <param name="insertMax" type="integer" value="500" label="Maximum insert size (map -i)" help="Only in paired-end mode." /> + <param name="insertMin" type="integer" value="0" label="Minimum insert size (map -j)" help="Only in paired-end mode." /> + <param name="pairTyp" type="text" size="2" value="pe" label="Type of read pair library (map -l)" help="Can be either 'pe', 'mp' or 'pp'." /> + <param name="minScor" type="integer" value="0" label="Minimum score (map -m)" help="Sets an absolute threshold of the Smith-Waterman scores." /> + <param name="partialAlignments" type="boolean" truevalue="true" falsevalue="false" checked="no" label="Partial alignments (map -p)" help="Report partial alignments if they are complementary on the read (split reads)." /> + <param name="minBasq" type="integer" value="0" label="Base quality threshold (map -q)" help="Sets a base quality threshold (0 <= minbasq <= 10, default 0)." /> + <param name="seed" type="integer" value="0" label="Seed (map -r)" help="See below." /> + <param name="complexityWeighted" type="boolean" truevalue="true" falsevalue="false" checked="no" label="Complexity weighted (map -w)" help="Smith-Waterman scores are complexity weighted." /> + <param name="minId" type="float" value="0" label="Identity threshold (map -y)" help="Sets an identity threshold for a mapping to be reported." /> + </when> + </conditional> + <param name="suppressHeader" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Suppress the header in the output SAM file" help="Smalt produces SAM with several lines of header information" /> + </inputs> + <outputs> + <data format="sam" name="output" label="${tool.name} on ${on_string}: mapped reads"> + <actions> + <conditional name="genomeSource.refGenomeSource"> + <when value="indexed"> + <action type="metadata" name="dbkey"> + <option type="from_data_table" name="smalt_indexes" column="1"> + <filter type="param_value" column="0" value="#" compare="startswith" keep="False"/> + <filter type="param_value" ref="genomeSource.indices" column="0"/> + </option> + </action> + </when> + <when value="history"> + <action type="metadata" name="dbkey"> + <option type="from_param" name="genomeSource.ownFile" param_attribute="dbkey" /> + </action> + </when> + </conditional> + </actions> + </data> + </outputs> + <help> + +**What it does** + +SMALT is a pairwise sequence alignment program for the experimentingcient mapping of DNA sequencing reads onto genomic reference sequences. It uses a combination of short-word hashing and dynamic programming. Most types of sequencing platforms are supported including paired-end sequencing reads. + +------ + +Please cite the website "http://www.sanger.ac.uk/resources/software/smalt/". + +------ + +**Know what you are doing** + +.. class:: warningmark + +There is no such thing (yet) as an automated gearshift in short read mapping. It is all like stick-shift driving in San Francisco. In other words = running this tool with default parameters will probably not give you meaningful results. A way to deal with this is to **understand** the parameters by carefully reading the `documentation`__ and experimenting. Fortunately, Galaxy makes experimenting easy. + + .. __: http://www.sanger.ac.uk/resources/software/smalt/ + +------ + +**Input formats** + +SMALT accepts files in Sanger FASTQ format (galaxy type *fastqsanger*). Use the FASTQ Groomer to prepare your files. + +------ + +**A Note on Built-in Reference Genomes** + +The default variant for all genomes is "Full", defined as all primary chromosomes (or scaffolds/contigs) including mitochondrial plus associated unmapped, plasmid, and other segments. When only one version of a genome is available in this tool, it represents the default "Full" variant. Some genomes will have more than one variant available. The "Canonical Male" or sometimes simply "Canonical" variant contains the primary chromosomes for a genome. For example a human "Canonical" variant contains chr1-chr22, chrX, chrY, and chrM. The "Canonical Female" variant contains the primary chromosomes excluding chrY. + +------ + +**Outputs** + +The output is in SAM format. + +------- + +**SMALT parameter list** + +This is an exhaustive list of SMALT options: + +For **map**:: + + -a + Output explicit alignments along with the mappings. + + -c <mincover> + Only consider mappings where the k-mer word seeds cover the query read to + a minimum extent. If <mincover> is an integer or floating point > 1.0, at + least this many bases of the read must be covered by k-mer word seeds. If + <mincover> is a floating point <= 1.0, it specifies the fraction of the + query read length that must be covered by k-mer word seeds. This option + is only valid in conjunction with the '-x' flag. + + -d <scordiff> + Set a threshold of the Smith-Waterman alignment score relative to the + maximum score. When mapping single reads, all alignments are reported + that have Smith-Waterman scores within <scorediff> of the maximum. + Mappings with lower scores are skipped. If <scorediff> is set to to a + value < 0, all alignments are printed that have scores above the + threshold specified with the '-m <minscor>' option. + For paired reads, only a value of 0 is supported. With the option '-d 0' + all aligments (pairings) with the best score are output. By default + (without the option '-d 0') single reads/mates with multiple best mappings + are reported as 'not mapped'. + + -f <format> + Specifies the output format. <format> can be either 'bam', 'cigar', 'gff', + 'sam' (default), 'samsoft' or 'ssaha'. Optional extension 'sam:nohead,clip' + (see manual) + + -F <inform> + Specifies the input format. <inform> can be either 'fastq' (default), + 'sam' or 'bam' (see: samtools.sourceforge.net). SAM and BAM formats + require additional libraries to be installed. + + -g <insfil> + Use the distribution of insert sizes stored in the file <insfil>. This + file is in ASCII format and can be generated using the 'sample' task see + 'smalt sample -H' for help). + + -H + Print these instructions. + + -i <insertmax> + Maximum insert size (only in paired-end mode). The default is 500. + + -j <insertmin> + Minimum insert size (only in paired-end mode). The default is 0. + + -l <pairtyp> + Type of read pair library. <pairtyp> can be either 'pe', i.e. for + the Illumina paired-end library for short inserts (|--> <--|). 'mp' + for the Illumina mate-pair library for long inserts (<--| |-->) or + 'pp' for mates sequenced on the same strand (|--> |-->). 'pe' is the + default. + + -m <minscor> + Sets an absolute threshold of the Smith-Waterman scores. Mappings with + scores below that threshold will not be reported. The default is + <minscor> = <wordlen> + <stepsiz> - 1 + + -n <nthreads> + Run smalt using mutiple threads. <nthread> is the number of additional + threads forked from the main thread. The order of the reads in the + input files is not preserved for the output unless '-O' is also specified. + + -o <oufilnam> + Write mapping output (e.g. SAM lines) to a separate file. If this option + is not specified, mappings are written to standard output together with + other messages. + + -O + Output mappings in the order of the reads in the input files when using + multiple threads (option '-n <nthreads>'). + + -p + Report partial alignments if they are complementary on the read (split + reads). + + -q <minbasq> + Sets a base quality threshold (0 <= minbasq <= 10, default 0). + K-mer words of the read with nucleotides that have a base quality below + this threshold are not looked up in the hash index. + + -r <seed> + If <seed> >= 0 report an alignment selected at random where there are + multiple mappings with the same best alignment score. With <seed> = 0 + (default) a seed is derived from the current calendar time. If <seed> + < 0 reads with multiple best mappings are reported as 'not mapped'. + + -T <tmp_dir> + Write temporary files to directory <tmp_dir> (used with input files in + SAM/BAM format). + + -w + Smith-Waterman scores are complexity weighted. + + -x + This flag triggers a more exhaustive search for alignments at the cost + of decreased speed. In paired-end mode each mate is mapped independently. + (By default the mate with fewer hits in the hash index is mapped first + and the vicinity is searched for mappings of its mate.) + + -y <minid> + Sets an identity threshold for a mapping to be reported (default: 0). + <minid> specifies the number of exactly matching nucleotides either as + a positive integer or as a fraction of the read length (<= 1.0). + + </help> +</tool> + +