hadoop_galaxy: cat_paths.xml comparison

comparison cat_paths.xml @ 0:7698311d4466 draft

Uploaded

author	crs4
date	Fri, 30 May 2014 06:48:47 -0400
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:7698311d4466
+<tool id="hadoop_galaxy_cat_paths" name="Cat paths" version="0.1.0">
+<description>Concatenate all components of a pathset into a single file.</description>
+<requirements>
+<requirement type="package" version="0.11">pydoop</requirement>
+<requirement type="package" version="0.1.1">hadoop-galaxy</requirement>
+</requirements>
+<command>
+#if $use_hadoop
+dist_cat_paths
+#else
+cat_paths
+#end if
+#if $delete_source
+--delete-source
+#end if
+$input_pathset $output_path
+</command>
+<inputs>
+<param name="input_pathset" type="data" format="pathset" label="Input pathset">
+<validator type="empty_field" />
+</param>
+<param name="delete_source" type="boolean" checked="false" label="Delete remote input data"
+help="This option makes the tool move the data rather than copy it" />
+<param name="use_hadoop" type="boolean" checked="false" label="Use Hadoop-based program"
+help="The Galaxy workspace must be accessible by the Hadoop cluster (see help for details)" />
+</inputs>
+<outputs>
+<!-- TODO: can we read the format from input pathset and transfer it to output? -->
+<data name="output_path" format="data" label="Concatenated dataset $input_pathset.name" />
+</outputs>
+<stdio>
+<exit_code range="1:" level="fatal" />
+</stdio>
+<help>
+Datasets represented as pathsets can be split in a number of files.
+This tool takes all of them and concatenates them into a single output file.
+In your workflow, you'll need to explicitly set the appropriate data format on the
+output dataset with an Action to "Change Datatype".
+"Delete remote input data" option
+====================================
+With this option, after the data has been concated into the new Galaxy dataset,
+the original files that were referenced by the pathset are deleted.  This effectively
+tells the action to "move" the data instead of a "copying" it and helps
+avoid amassing intermediate data in your Hadoop workspace.
+"Use Hadoop-based program" option
+====================================
+With this option you will use your entire Hadoop cluster to simultaneously write
+multiple parts of the final file.  For this to be possible, the Hadoop nodes
+must be able to access the Galaxy file space directly.  In addition, to achieve
+reasonable results the Galaxy workspace should on a parallel shared file system.
+</help>
+</tool>

Mercurial > repos > crs4 > hadoop_galaxy

comparison cat_paths.xml @ 0:7698311d4466 draft