annotate cat_paths.xml @ 0:7698311d4466 draft

Uploaded
author crs4
date Fri, 30 May 2014 06:48:47 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
7698311d4466 Uploaded
crs4
parents:
diff changeset
1 <tool id="hadoop_galaxy_cat_paths" name="Cat paths" version="0.1.0">
7698311d4466 Uploaded
crs4
parents:
diff changeset
2 <description>Concatenate all components of a pathset into a single file.</description>
7698311d4466 Uploaded
crs4
parents:
diff changeset
3 <requirements>
7698311d4466 Uploaded
crs4
parents:
diff changeset
4 <requirement type="package" version="0.11">pydoop</requirement>
7698311d4466 Uploaded
crs4
parents:
diff changeset
5 <requirement type="package" version="0.1.1">hadoop-galaxy</requirement>
7698311d4466 Uploaded
crs4
parents:
diff changeset
6 </requirements>
7698311d4466 Uploaded
crs4
parents:
diff changeset
7
7698311d4466 Uploaded
crs4
parents:
diff changeset
8 <command>
7698311d4466 Uploaded
crs4
parents:
diff changeset
9 #if $use_hadoop
7698311d4466 Uploaded
crs4
parents:
diff changeset
10 dist_cat_paths
7698311d4466 Uploaded
crs4
parents:
diff changeset
11 #else
7698311d4466 Uploaded
crs4
parents:
diff changeset
12 cat_paths
7698311d4466 Uploaded
crs4
parents:
diff changeset
13 #end if
7698311d4466 Uploaded
crs4
parents:
diff changeset
14 #if $delete_source
7698311d4466 Uploaded
crs4
parents:
diff changeset
15 --delete-source
7698311d4466 Uploaded
crs4
parents:
diff changeset
16 #end if
7698311d4466 Uploaded
crs4
parents:
diff changeset
17 $input_pathset $output_path
7698311d4466 Uploaded
crs4
parents:
diff changeset
18 </command>
7698311d4466 Uploaded
crs4
parents:
diff changeset
19
7698311d4466 Uploaded
crs4
parents:
diff changeset
20 <inputs>
7698311d4466 Uploaded
crs4
parents:
diff changeset
21 <param name="input_pathset" type="data" format="pathset" label="Input pathset">
7698311d4466 Uploaded
crs4
parents:
diff changeset
22 <validator type="empty_field" />
7698311d4466 Uploaded
crs4
parents:
diff changeset
23 </param>
7698311d4466 Uploaded
crs4
parents:
diff changeset
24 <param name="delete_source" type="boolean" checked="false" label="Delete remote input data"
7698311d4466 Uploaded
crs4
parents:
diff changeset
25 help="This option makes the tool move the data rather than copy it" />
7698311d4466 Uploaded
crs4
parents:
diff changeset
26 <param name="use_hadoop" type="boolean" checked="false" label="Use Hadoop-based program"
7698311d4466 Uploaded
crs4
parents:
diff changeset
27 help="The Galaxy workspace must be accessible by the Hadoop cluster (see help for details)" />
7698311d4466 Uploaded
crs4
parents:
diff changeset
28 </inputs>
7698311d4466 Uploaded
crs4
parents:
diff changeset
29
7698311d4466 Uploaded
crs4
parents:
diff changeset
30 <outputs>
7698311d4466 Uploaded
crs4
parents:
diff changeset
31 <!-- TODO: can we read the format from input pathset and transfer it to output? -->
7698311d4466 Uploaded
crs4
parents:
diff changeset
32 <data name="output_path" format="data" label="Concatenated dataset $input_pathset.name" />
7698311d4466 Uploaded
crs4
parents:
diff changeset
33 </outputs>
7698311d4466 Uploaded
crs4
parents:
diff changeset
34
7698311d4466 Uploaded
crs4
parents:
diff changeset
35 <stdio>
7698311d4466 Uploaded
crs4
parents:
diff changeset
36 <exit_code range="1:" level="fatal" />
7698311d4466 Uploaded
crs4
parents:
diff changeset
37 </stdio>
7698311d4466 Uploaded
crs4
parents:
diff changeset
38
7698311d4466 Uploaded
crs4
parents:
diff changeset
39 <help>
7698311d4466 Uploaded
crs4
parents:
diff changeset
40 Datasets represented as pathsets can be split in a number of files.
7698311d4466 Uploaded
crs4
parents:
diff changeset
41 This tool takes all of them and concatenates them into a single output file.
7698311d4466 Uploaded
crs4
parents:
diff changeset
42
7698311d4466 Uploaded
crs4
parents:
diff changeset
43 In your workflow, you'll need to explicitly set the appropriate data format on the
7698311d4466 Uploaded
crs4
parents:
diff changeset
44 output dataset with an Action to "Change Datatype".
7698311d4466 Uploaded
crs4
parents:
diff changeset
45
7698311d4466 Uploaded
crs4
parents:
diff changeset
46 "Delete remote input data" option
7698311d4466 Uploaded
crs4
parents:
diff changeset
47 ====================================
7698311d4466 Uploaded
crs4
parents:
diff changeset
48 With this option, after the data has been concated into the new Galaxy dataset,
7698311d4466 Uploaded
crs4
parents:
diff changeset
49 the original files that were referenced by the pathset are deleted. This effectively
7698311d4466 Uploaded
crs4
parents:
diff changeset
50 tells the action to "move" the data instead of a "copying" it and helps
7698311d4466 Uploaded
crs4
parents:
diff changeset
51 avoid amassing intermediate data in your Hadoop workspace.
7698311d4466 Uploaded
crs4
parents:
diff changeset
52
7698311d4466 Uploaded
crs4
parents:
diff changeset
53
7698311d4466 Uploaded
crs4
parents:
diff changeset
54 "Use Hadoop-based program" option
7698311d4466 Uploaded
crs4
parents:
diff changeset
55 ====================================
7698311d4466 Uploaded
crs4
parents:
diff changeset
56
7698311d4466 Uploaded
crs4
parents:
diff changeset
57 With this option you will use your entire Hadoop cluster to simultaneously write
7698311d4466 Uploaded
crs4
parents:
diff changeset
58 multiple parts of the final file. For this to be possible, the Hadoop nodes
7698311d4466 Uploaded
crs4
parents:
diff changeset
59 must be able to access the Galaxy file space directly. In addition, to achieve
7698311d4466 Uploaded
crs4
parents:
diff changeset
60 reasonable results the Galaxy workspace should on a parallel shared file system.
7698311d4466 Uploaded
crs4
parents:
diff changeset
61 </help>
7698311d4466 Uploaded
crs4
parents:
diff changeset
62 </tool>