# HG changeset patch # User crs4 # Date 1413380470 14400 # Node ID 244073d9abc1ce5d04343edf6bf888e4855a8f01 Uploaded diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/README.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/README.md Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,73 @@ + +Galaxy wrapper for the Seal toolkit +==================================== + +These are the Galaxy wrappers for the Seal toolkit for Hadoop-based processing +of sequencing data (http://biodoop-seal.sf.net). + + +Installation +------------------- + +You can install the Seal-Galaxy wrappers through the Galaxy toolshed or like +any other Galaxy tool. The installation process will try to fetch and build +Seal and some of its dependencies. However, you'll need to make sure that +the build process can find any required headers, libraries and executables, +such as: + +* javac +* protobuf +* maven +* ant +* zlib +* git +* hadoop + +For details on Seal's installation process refer directly to [its +documentation](http://biodoop-seal.sourceforge.net/installation.html). + +Hadoop-Galaxy integration +---------------------------- + +These wrappers use the [Hadoop-Galaxy](https://github.com/crs4/hadoop-galaxy) +tool to implement the integration between Hadoop and Galaxy. You should have a +look at its documentation. + +An important issue +----------------------- + +An implication of the integration provided by Hadoop-Galaxy is that Galaxy +knows nothing about your actual data. Because of this, removing the Galaxy +datasets does not delete the files produced by your Hadoop runs, potentially +resulting in the waste of a lot of space. Also, be careful with situations +where you may end up with multiple pathsets pointing to the same data, or where +they point to data that you want to access from Hadoop but would not want to +delete (e.g., your run directories). + +Have a look at the Hadoop-Galaxy README for more details. + + +Authors +------------- + +Luca Pireddu + + +Support +------------- + +No support is provided. + + + +License +-------------- + +This code is release under the GPLv3. + + + +Copyright +-------------- + +Copyright CRS4, 2011-2014. diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/make_release.sh --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/make_release.sh Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,93 @@ +#!/bin/bash + +#set -x +set -o errexit +set -o nounset +set -o pipefail + +PackageName="seal-galaxy" + + +function error() { + if [ $# -ge 1 ]; then + echo $* >&1 + fi + exit 1 +} + +function usage_error() { + echo "Usage: $0 version" + echo "Specify version as a git revid (id or tag) for the Seal repository, and " >&2 + echo "optionally a '-n' suffix for the wrapper version; e.g., 0.4.1, 0.4.1-1, 0.4.1-2" >&2 + error +} + +function confirm() { + local prompt="${1}" + echo "${prompt} [Y/n]" + read -p "Answer: " yn + case "${yn}" in + ''|[Yy]) # do nothing and keep going + ;; + [Nn]) echo "Aborting"; exit 0 + ;; + *) usage_error "Unrecognized answer. Please specify Y or n" + ;; + esac + return 0 +} + +function rewrite_seal_version() { + local grep_expr='' + if ! grep "${grep_expr}" tool_dependencies.xml >/dev/null ; then + error "Couldn't find expected package line in tool_dependencies.xml" + fi + + printf -v sed_expr1 '//s/git reset --hard $[^<]\+$\s*/git reset --hard %s/' "${seal_version}" + sed -i -e "${sed_expr1}" -e "${sed_expr2}" tool_dependencies.xml + echo "Edited tool_dependencies.xml" >&2 + + # edit the tools as well + printf -v sed_expr3 '/\s*seal\s*&2 +} + +############# main ###############3 + +if [ $# -eq 1 ]; then + wrapper_version="${1}" +else + usage_error +fi + +echo "Will rewrite tool_dependencies.xml setting the the package version to '${wrapper_version}'." +confirm "Are you sure you want to proceed? [Y/n]" + +# ensure the tag doesn't already exist +if git tag -l | grep -w "${wrapper_version}" ; then + error "A release tag called '${wrapper_version}' already exists" +fi + +# remove the wrapper suffix, if it's there +seal_version=$(echo ${wrapper_version} | sed -e 's/-[^-]\+$//') +echo "Using seal version ${seal_version}" + +rewrite_seal_version "${seal_version}" + +git commit -a --allow-empty -m "Wrappers release for Seal '${seal_version}'" +git tag "${wrapper_version}" + +revid=$(git rev-parse HEAD) + +echo "Tagged new commit ${revid} with tag '${wrapper_version}'" + +short_revid=${revid::8} +archive_name=${PackageName}-${short_revid}.tar.gz + +git archive --format tar.gz --prefix ${PackageName}-${short_revid}/ HEAD -o "${archive_name}" + +echo "Don't forget to upload the archive to the toolshed!" diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/bcl2qseq.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/bcl2qseq.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,112 @@ + + + + + + Convert Illumina bcl files to qseq on Hadoop + + seal + pydoop + hadoop-galaxy + + + + hadoop_galaxy + --executable seal + --input $input_data + --output $output1 + bcl2qseq + #if $advanced.control == 'show' + #if $advanced.bcl2qseq_bin: + --bclToQseq-path $advanced.bcl2qseq_bin + #end if + + #if $advanced.additional_ld_path + --append-ld-library-path $advanced.additional_ld_path + #end if + + #if $advanced.ignore_missing_bcl + --ignore-missing-bcl + #end if + + #if $advanced.ignore_missing_control + --ignore-missing-control + #end if + + #if $advanced.exclude_controls + --exclude-controls + #end if + + #if $advanced.no_eamss + --no-eamss + #end if + #end if + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + This is a Pydoop-based distributed version of Illumina's bclToQseq tool. + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/demux.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/demux.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,118 @@ + + + + + + Demultiplex Illumina runs on Hadoop + + seal + pydoop + hadoop-galaxy + + + + demux_galaxy.py + $input_data + $mismatches + $__new_file_path__ + #if $num_reducers + $num_reducers + #else + null + #end if + $output1 + $output1.id + $sample_sheet + $input_format + $output_format + $output_compression + #if $index.specify_index == 'present' + true + #else if $index.specify_index == 'not_present' + false + #else if $index.specify_index == 'dynamic' + $index_present + #else + #raise ValueError('Invalid index value!') + #end if + $separate_reads + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Demux is a Hadoop utility to demultiplex data from multiplexed Illumina runs. + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/demux_galaxy.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/demux_galaxy.py Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,141 @@ +#!/usr/bin/env python + +# Copyright (C) 2011-2014 CRS4. +# +# This file is part of Seal. +# +# Seal is free software: you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation, either version 3 of the License, or (at your option) +# any later version. +# +# Seal is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +# for more details. +# +# You should have received a copy of the GNU General Public License along +# with Seal. If not, see . + + + +""" +Calls the Seal Demux tool. Then, it calls the custom galaxy integration script +split_demux_output.py to generate one Galaxy dataset per each sample extracted +by Demux. +""" + +# parameters: +# INPUT_DATA +# MISMATCHES +# NEW_FILE_PATH +# NUM_REDUCERS +# OUTPUT1 +# OUTPUT_ID +# SAMPLE_SHEET +# INPUT_FORMAT +# OUTPUT_FORMAT +# OUTPUT_COMPRESSION +# SEPARATE_READS + +import os +import re +import subprocess +import sys + +# XXX: add --append-python-path to the possible arguments? + +def parse_indexed(s): + if s is not None: + normalized = s.lower().strip() + if normalized == 'notindexed': + return False + elif normalized == 'indexed': + return True + return None # failed to parse + +def parse_index_present(param): + is_indexed = parse_indexed(param) + if is_indexed is None: + # try to read it as a file + if os.path.isfile(param): + with open(param) as f: + contents = f.readline(10000) + uri, value = contents.split("\t", 1) + is_indexed = parse_indexed(value) + if is_indexed is None: + raise RuntimeError("Error determining whether run has an index read. " + \ + "Couldn't parse the dataset that was supposed to specify it (first 1000 chars): %s" % contents) + return is_indexed + +def usage_error(msg=None): + print >> sys.stderr, "Usage error" + if msg: + print >> sys.stderr, msg + print >> sys.stderr, "Usage:", os.path.basename(sys.argv[0]),\ + "INPUT_DATA MISMATCHES NEW_FILE_PATH NUM_REDUCERS OUTPUT1 OUTPUT_ID SAMPLE_SHEET INPUT_FORMAT OUTPUT_FORMAT OUTPUT_COMPRESSION INDEX_PRESENT SEPARATE_READS" + sys.exit(1) + + +if __name__ == "__main__": + if len(sys.argv) != 13: + usage_error() + + input_data = sys.argv[1] + mismatches = sys.argv[2] + new_file_path = sys.argv[3] + num_reducers = sys.argv[4] + output1 = sys.argv[5] + output_id = sys.argv[6] + sample_sheet = sys.argv[7] + input_format = sys.argv[8] + output_format = sys.argv[9] + output_compression = sys.argv[10] + index_present = sys.argv[11] + separate_reads = sys.argv[12] + + mydir = os.path.abspath(os.path.dirname(__file__)) + + # Run the demux program + cmd = [ + 'hadoop_galaxy', + '--input', input_data, + '--input-format', input_format, # --input-format for hadoop-galaxy + '--output', output1, + '--executable', 'seal', + 'demux', + '--sample-sheet', sample_sheet, + '--input-format', input_format, # --input-format for seal demux + '--output-format', output_format + ] + if re.match(r'\s*\d+\s*', num_reducers): + cmd.extend( ('--num-reducers', num_reducers) ) + + if output_compression.lower() != 'none': + cmd.extend( ('--compress-output', output_compression) ) + + if mismatches != '0': + cmd.extend( ('--mismatches', mismatches) ) + + is_indexed = parse_index_present(index_present) + if is_indexed is False: + cmd.append("--no-index") + + norm_separate_reads = separate_reads.lower().strip() + if norm_separate_reads == 'separate-reads': + cmd.append("--separate-reads") + elif norm_separate_reads.startswith('f'): + pass + else: + raise RuntimeError("Unrecognized value for separate-reads parameter: '%s'" % separate_reads) + + print >> sys.stderr, ' '.join(cmd) + subprocess.check_call(cmd) + + ### + # now the second phase: split_demux_output.py + cmd = [ + os.path.join(mydir, 'split_demux_output.py'), + output_id, output1, new_file_path ] + print >> sys.stderr, ' '.join(cmd) + subprocess.check_call(cmd) diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/generate_sam_header.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/generate_sam_header.py Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,41 @@ +#!/usr/bin/env python + +# Copyright (C) 2011-2014 CRS4. +# +# This file is part of Seal. +# +# Seal is free software: you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation, either version 3 of the License, or (at your option) +# any later version. +# +# Seal is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +# for more details. +# +# You should have received a copy of the GNU General Public License along +# with Seal. If not, see . + + + +# A really really thin wrapper. We only seem to need it because Galaxy won't +# search for the command in the PATH + +import os +import subprocess +import sys + +if __name__ == '__main__': + output_path = sys.argv[-1] + try: + # seal merge_alignments won't overwrite an existing file, so we first remove + # the file Galaxy creates for us. + os.remove(output_path) + except IOError: + pass + hadoopized_output_path = 'file://' + os.path.abspath(output_path) + cmd = [ 'seal', 'merge_alignments' ] + sys.argv[1:-1] + cmd.append(hadoopized_output_path) + print "running command:", str(cmd) + subprocess.check_call(cmd) diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/generate_sam_header.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/generate_sam_header.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,100 @@ + + + + + + Generate a SAM header for the given reference + + seal + pydoop + hadoop-galaxy + + + + #set $ref_path = 'file://' + $reference.fields.path if $reference.fields.path.startswith('/') else $reference.fields.path + generate_sam_header.py + --header-only + --annotations ${ref_path}.ann + --sort-order $sort_order + + #if $compute_md5: + --md5 + #end if + + #if $assembly: + --sq-assembly "$assembly" + #end if + + #if $rg.set_rg == 'true': + --rg_cn "$rg.rg_cn" + --rg_dt "$rg.rg_dt" + --rg_id "$rg.rg_id" + --rg_lb "$rg.rg_lb" + --rg_pl "$rg.rg_pl" + --rg_pu "$rg.rg_pu" + --rg_sm "$rg.rg_sm" + #end if + + ${output} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +ReadSort is a Hadoop-based program for sorting reads by alignment position. +For the full help see the `manual <http://biodoop-seal.sourceforge.net/read_sort_index.html>`_. + + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/merge_alignments.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/merge_alignments.py Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,73 @@ +#!/usr/bin/env python + +# Copyright (C) 2011-2014 CRS4. +# +# This file is part of Seal. +# +# Seal is free software: you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation, either version 3 of the License, or (at your option) +# any later version. +# +# Seal is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +# for more details. +# +# You should have received a copy of the GNU General Public License along +# with Seal. If not, see . + + + +import os +import subprocess +import sys +import tempfile + +import hadoop_galaxy.pathset as pathset +import hadoop_galaxy.cat_paths as cat_paths + +def usage_error(msg=None): + if msg: + print >> sys.stderr, msg + print >> sys.stderr, os.path.basename(__file__), "INPUT_PATHSET OUTPUT [args...]" + sys.exit(1) + +def main(args): + if len(args) < 2: + usage_error() + + # We generate the header with seal_merge_alignments, insert it at the + # top of a copy of the input pathset, and then use cat_parts to + # join everything into a single file. + + input_pathset, output_path = map(os.path.abspath, args[0:2]) + + with tempfile.NamedTemporaryFile() as header_file: + print "generating header" + gen_header_cmd = [ 'seal', 'merge_alignments', '--header-only' ] + gen_header_cmd.extend(args[2:]) + header_text = subprocess.check_output(gen_header_cmd) + + header_file.write(header_text) + header_file.flush() + print "header ready" + print "generating new pathset" + + original_pathset = pathset.FilePathset.from_file(input_pathset) + new_pathset = pathset.FilePathset() + new_pathset.append(header_file.name) + for p in original_pathset: + new_pathset.append(p) + + with tempfile.NamedTemporaryFile() as temp_pathset: + new_pathset.write(temp_pathset) + temp_pathset.flush() + + print "concatenating pathset" + # TODO: Add ability to use dist_cat_paths + cat_paths.main([temp_pathset.name, output_path]) + print "operation complete" + +if __name__ == '__main__': + main(sys.argv[1:]) diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/merge_alignments.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/merge_alignments.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,100 @@ + + + + + + Merge a pathset of part-files of alignments into a single well-formatted SAM file + + seal + pydoop + hadoop-galaxy + + + #set $ref_path = 'file://' + $reference.fields.path if $reference.fields.path.startswith('/') else $reference.fields.path + merge_alignments.py + $input_data + $output + + --annotations ${ref_path}.ann + --sort-order $sort_order + + #if $compute_md5: + --md5 + #end if + + #if $assembly: + --sq-assembly "$assembly" + #end if + + #if $rg.set_rg == 'true': + --rg_cn "$rg.rg_cn" + --rg_dt "$rg.rg_dt" + --rg_id "$rg.rg_id" + --rg_lb "$rg.rg_lb" + --rg_pl "$rg.rg_pl" + --rg_pu "$rg.rg_pu" + --rg_sm "$rg.rg_sm" + #end if + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +ReadSort is a Hadoop-based program for sorting reads by alignment position. +For the full help see the `manual <http://biodoop-seal.sourceforge.net/read_sort_index.html>`_. + + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/prq.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/prq.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,118 @@ + + + + + + Convert qseq or fastq files to prq on Hadoop + + seal + pydoop + hadoop-galaxy + + + hadoop_galaxy + --input $input_data + --input-format $input_format.type + --output $output1 + --executable seal + prq + --input-format $input_format.type + --num-reducers $num_reducers + -D hbam.qseq-input.base-quality-encoding=$input_format.bq_encoding + -D hbam.fastq-input.base-quality-encoding=$input_format.bq_encoding + + #if $bpr + -D seal.prq.min-bases-per-read=$bpr + #end if + #if $drop_failed + -D seal.prq.drop-failed-filter=$drop_failed + #end if + #if $warn_unpaired + -D seal.prq.warning-only-if-unpaired=$warn_unpaired + #end if + + + + + + + + + + + + + + + + + + + + + + + + + + +PairReadsQSeq (PRQ) is a Hadoop utility to convert Illumina qseq files into +prq file format. For the full help see the `manual <http://biodoop-seal.sourceforge.net/prq_index.html>`_. + + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/read_sort.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/read_sort.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,68 @@ + + + + + + Sort reads with Hadoop + + seal + pydoop + hadoop-galaxy + + + #set $ref_path = 'file://' + $reference.fields.path if $reference.fields.path.startswith('/') else $reference.fields.path + hadoop_galaxy + --input $input_data + --output $output + --executable seal + read_sort + --annotations ${ref_path}.ann + --num-reducers $num_reducers + + + + + + + + + + + + + + + + + + + + + +ReadSort is a Hadoop-based program for sorting reads by alignment position. +For the full help see the `manual <http://biodoop-seal.sourceforge.net/read_sort_index.html>`_. + + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/recab_table.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/recab_table.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,122 @@ + + + + + + Calculate a base quality recalibration table on Hadoop. + + seal + pydoop + hadoop-galaxy + + + + recab_table_galaxy.py + $input_data + $output1 + + #if $dbsnp.db_source == "history": + $dbsnp.ownFile + #else: + ${dbsnp.built-inFile.fields.path} + #end if + + $num_reducers + + #if $default_rg: + -D seal.recab.rg-covariate.default-rg=$default_rg + #end if + + #if $smoothing: + -D seal.recab.smoothing=$smoothing + #end if + + #if $max_qscore: + -D seal.recab.max-qscore=$max_qscore + #end if + + + + + + + + + + + + + + + + +RecabTable is a Hadoop program to calculate a table of base qualities for all values of a given set of factors. It computes a result equivalent to the GATK CountCovariatesWalker. +For the full help see the `manual <http://biodoop-seal.sourceforge.net/recab_table_index.html>`_. + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/recab_table_galaxy.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/recab_table_galaxy.py Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,116 @@ +#!/usr/bin/env python + +# Copyright (C) 2011-2014 CRS4. +# +# This file is part of Seal. +# +# Seal is free software: you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation, either version 3 of the License, or (at your option) +# any later version. +# +# Seal is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +# for more details. +# +# You should have received a copy of the GNU General Public License along +# with Seal. If not, see . + + + +""" +Calls the Seal RecabTable tool. Then, it calls recab_table_fetch to +concatenate all the partial tables and create a single csv file. +""" + + +# parameters: +# INPUT_DATA +# OUTPUT +# VCF +# NUM_REDUCERS +# [OTHER] + +import os +import sys + +import hadoop_galaxy.pathset as pathset +import subprocess +import tempfile +import pydoop.hdfs as phdfs + +# XXX: add --append-python-path to the possible arguments? + +def usage_error(msg=None): + if msg: + print >> sys.stderr, msg + print >> sys.stderr, os.path.basename(sys.argv[0]), "INPUT_DATA OUTPUT VCF NUM_REDUCERS [OTHER]" + sys.exit(1) + + +def run_recab(input_path, output_path, vcf, num_red, other_args): + mydir = os.path.abspath(os.path.dirname(__file__)) + cmd = [ + 'hadoop_galaxy', + '--input', input_path, + '--output', output_path, + '--executable', 'seal', + 'recab_table', + '--vcf-file', vcf, + '--num-reducers', num_red + ] + + if other_args: + cmd.extend(other_args) + + # now execute the hadoop job + subprocess.check_call(cmd) + +def collect_table(pset, output_path): + # finally, fetch the result into the final output file + cmd = ['seal', 'recab_table_fetch'] + cmd.extend(pset.get_paths()) + cmd.append(output_path) + try: + # remove the file that galaxy creates. recab_table_fetch refuses to + # overwrite it + os.unlink(output_path) + except IOError: + pass + subprocess.check_call(cmd) + +def cleanup(out_pathset): + # clean-up job output + for path in out_pathset: + try: + print >> sys.stderr, "Deleting output path", path + phdfs.rmr(path) + except StandardError as e: + print >> sys.stderr, "Error!", str(e) + +def main(args): + if len(args) < 5: + usage_error() + + input_data = args[0] + final_output = args[1] + vcf = args[2] + num_reducers = args[3] + other = args[4:] + + # Create a temporary pathset to reference the recab_table + # output directory + with tempfile.NamedTemporaryFile(mode='rwb') as tmp_pathset_file: + try: + run_recab(input_data, tmp_pathset_file.name, vcf, num_reducers, other) + tmp_pathset_file.seek(0) + out_paths = pathset.FilePathset.from_file(tmp_pathset_file) + collect_table(out_paths, final_output) + finally: + cleanup(out_paths) + +if __name__ == "__main__": + main(sys.argv[1:]) + +# vim: et ai ts=2 sw=2 diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/seqal.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/seqal.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,83 @@ + + + + + + Map reads on Hadoop + + seal + pydoop + hadoop-galaxy + + + + hadoop_galaxy + --input $input_data + --output $output1 + --executable seal + seqal + #if $align_only.value: + --align-only --num-reducers 0 + #else + --num-reducers $align_only.num_reducers + #end if + --trimq $trimq + ${reference.fields.path} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Seqal is a distributed short read mapping and duplicate removal tool. It + implements a distributed version of the BWA aligner, and adds a duplicate + read identification feature using the same criteria as the Picard + MarkDuplicates command. For a full description see the `manual + <http://biodoop-seal.sourceforge.net/seqal_index.html>`_. + + diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal/split_demux_output.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal/split_demux_output.py Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,106 @@ +#!/usr/bin/env python + +# Copyright (C) 2011-2014 CRS4. +# +# This file is part of Seal. +# +# Seal is free software: you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation, either version 3 of the License, or (at your option) +# any later version. +# +# Seal is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +# for more details. +# +# You should have received a copy of the GNU General Public License along +# with Seal. If not, see . + + + +import logging +import os +import sys + +import pydoop.hdfs as phdfs + +from hadoop_galaxy.pathset import FilePathset + +Debug = os.environ.get('DEBUG', None) +logging.basicConfig(level=logging.DEBUG if Debug else logging.INFO) + +def usage_error(msg=None): + if msg: + print >> sys.stderr, msg + print >> sys.stderr, "Usage: %s OUTPUT_ID DEMUX_OUTPUT_PATHSET NEW_FILE_DIR" % os.path.basename(sys.argv[0]) + sys.exit(1) + + +class PathsetWriter(object): + # The format is dictated by the Galaxy documentation for tools that produce a variable + # number of output files: http://wiki.g2.bx.psu.edu/Admin/Tools/Multiple%20Output%20Files + # We fix the file_type to 'pathset'. + Galaxy_output_name_template = "primary_%s_%s_visible_pathset" + + def __init__(self, output_dir, output_id, data_type): + self.output_dir = output_dir + self.output_id = output_id + self.data_type = data_type + + def write_pathset(self, dataset_path, name): + """ + dataset_path: the path of the dataset to which the new pathset needs to refer + name: name of dataset to appear in Galaxy + """ + if not name: + raise RuntimeError("Blank dataset name") + sanitized_name = name.replace('_', '-') # replace _ with - or galaxy won't like the name + opathset = FilePathset(dataset_path) + opathset.set_datatype(self.data_type) + opath = os.path.join(self.output_dir, self.Galaxy_output_name_template % (self.output_id, sanitized_name)) + logging.debug("writing dataset path %s to pathset file %s", dataset_path, opath) + with open(opath, 'w') as f: + opathset.write(f) + return self # to allow chaining + + + +def main(): + if len(sys.argv) != 4: + usage_error("Wrong number of arguments") + + output_id, demux_data, dest_dir = sys.argv[1:] + logging.debug("input args: output_id, demux_data, dest_dir = %s", sys.argv[1:]) + + ipathset = FilePathset.from_file(demux_data) + logging.debug("input path set: %s", ipathset) + + writer = PathsetWriter(dest_dir, output_id, ipathset.datatype) + + # ipathset points to the output directory given to demux. Inside it + # we should find all the project/sample subdirectories, plus 'unknown' (if there + # were any reads not attributable to a sample). So, we list the output + # dir and collect sample names and their paths. In theory, the pathset + # we receive as input should only contains the output from one demux; thus + # a sample should only occur once. + if len(ipathset) != 1: + raise RuntimeError("Unexpected demux output pathset size of %d. Expected 1 (the demux output path)" % len(ipathset)) + + project_paths = \ + filter(lambda p: os.path.basename(p)[0] not in ('_', '.'), # filter hadoop and regular hidden files + phdfs.ls(iter(ipathset).next()) # List the contents of the pathset. ls produces absolute paths + ) + # Each project_path points to a directory containing the data from one project. + # There may also be a directory 'unknown' + for project_path in project_paths: + if os.path.basename(project_path).lower() == 'unknown': + writer.write_pathset(project_path, 'unknown') + else: + for project_sample_path in phdfs.ls(project_path): + # take the last two elements of the path -- should be project, sample + complete_sample_name = "%s.%s" % tuple(project_sample_path.split(os.path.sep)[-2:]) + writer.write_pathset(project_sample_path, complete_sample_name) + +if __name__ == '__main__': + main() diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/seal_tool_conf.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/seal_tool_conf.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,14 @@ + + + +

+ + + + + + + + +

+ diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/tool_data_table_conf.xml.sample --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/tool_data_table_conf.xml.sample Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,16 @@ + + + name, value, path + +

+ + + name, value, path + +

+ diff -r 000000000000 -r 244073d9abc1 seal-galaxy-cc1b1911/tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seal-galaxy-cc1b1911/tool_dependencies.xml Wed Oct 15 09:41:10 2014 -0400 @@ -0,0 +1,39 @@ + + + + + + + + + + git clone https://github.com/crs4/seal.git + git checkout master + git reset --hard 13986416aa79561bd0102cb7ccc1e0668ac9f0a4 + + + $INSTALL_DIR/lib/python + + $INSTALL_DIR/lib/python + python setup.py build_hadoop_bam + python setup.py install --prefix=$INSTALL_DIR --install-lib=$INSTALL_DIR/lib/python + + $INSTALL_DIR/bin + $INSTALL_DIR/lib/python + + + + +This package has a number of dependencies that need to be installed before it: + +* Pydoop needs to be installed (it will be pulled down as a dependency; see +that package's instructions for it's own installation pointers) + +* protobuf-python + +* JDK and Ant (ant version at least version 1.7) + +Please see http://biodoop-seal.sourceforge.net/installation_dependencies.html for more details. + + +