This tool is a Galaxy wrapper for the Pandas Data Analysis Library in Python, for manipulating and computing expressions upon tabular data and matrices. It can perform functions on the element, row, and column basis, as well as sub-select, duplicate, replace, and perform general and custom expressions on rows, columns, and elements.
Only a single operation can be performed on the data. Multiple operations can be performed by chaining successive runs of this tool. This is to provide a more transparent workflow for complex operations.
Many of the examples given below relate to common research use-cases such as filtering large matrices for specific values, counting unique instances of elements, conditionally manipulating the data, and replacing unwanted values. Full table operations such as normalisation can be easily performed by scaling the data via mean/median/min/max (and many other) metrics, and general expressions can even be computed across multiple tables.
We have the following table:
. c1 c2 c3 g1 10 20 30 g2 3 6 9 g3 4 8 12 g4 81 6 3
would result in the output table:
. | c1 | c1 | c3 |
---|---|---|---|
g1 | 10 | 10 | 30 |
g2 | 3 | 3 | 9 |
g3 | 4 | 4 | 12 |
g2 | 3 | 3 | 9 |
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Drop, keep or duplicate rows and columns
- List of columns to select → 1,1,3
- List of rows to select → 1:3,2
- Keep duplicate columns → Yes
- Keep duplicate rows → Yes
We have the following table:
. c1 c2 c3 g1 10 20 30 g2 3 6 9 g3 4 8 12 g4 81 6 3
and we want:
. c1 c2 c3 g2 3 6 9 g3 4 8 12
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Filter rows or columns by their properties
- Filter → Rows
- Filter Criterion → Result of function applied to columns/rows
- Keep column/row if its observed → Sum
- is → < (Less Than)
- this value → 50
We have the following table:
. c1 c2 c3 g1 10 20 30 g2 3 6 9 g3 4 8 12 g4 81 6 3
and we want to count how many elements in each row are smaller than 10, i.e., we want to obtain the following results table:
. vec g1 0 g2 3 g3 2 g4 2
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Manipulate selected table elements
- Operation to perform → Custom
- Custom Expression on 'elem' → elem < 10
- Operate on elements → All
Note: There are actually simpler ways to achieve our purpose, but here we are demonstrating the use of a custom expression.
After executing, we would then be presented with a table like so:
. c1 c2 c3 g1 False False False g2 True True True g3 True True False g4 False True True
To get to our desired table, we would then process this table with the tool again:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Compute Expression across Rows or Columns
- Calculate → Sum
- For each → Row
Executing this will sum all the 'True' values in each row. Note that the values must have no extra whitespace in them for this to work (e.g. 'True ' or ' True' will not be parsed correctly).
We want to perform a scaled log transformation on all values greater than 5, and set all other values to 1.
We have the following table:
. c1 c2 c3 g1 0 20 30 g2 3 0 9 g3 4 8 0 g4 81 0 0
and we want:
. c1 c2 c3 g1 1.00000000 0.1497866 0.1133732 g2 1.00000000 1.0000000 0.2441361 g3 1.00000000 0.2599302 1.0000000 g4 0.05425246 1.0000000 1.0000000
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Manipulate selected table elements
- Operation to perform → Custom
- Custom Expression → (math.log(elem) / elem) if (elem > 5) else 1
- Operate on elements → All
We have the following table:
. c1 c2 c3 g1 10 20 30 g2 3 10 9 g3 4 8 10 g4 81 10 10
deviation of it to yield:
. | c1 | c2 | c3 |
---|---|---|---|
g1 | 9.351737 | 17.784353 | 28.550737 |
g2 | 2.351737 | 7.784353 | 7.550737 |
g3 | 3.351737 | 5.784353 | 8.550737 |
g4 | 80.351737 | 7.784353 | 8.550737 |
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Perform a Full Table Operation
- Operation → Custom
- Custom Expression on 'table' along axis (0 or 1) → table - table.mean(0)/table.std(0)
We have the following three input tables:
Table 1
. c1 c2 c3 g1 10 20 30 g2 3 10 9 g3 4 8 10
Table 2
. c1 c2 g1 1 2 g2 3 4 g3 6 5
Table 3
. c1 c2 c3 g1 1 2 3 g2 1 2 3
Note that the dimensions of these tables do not match.
In order to perform simple operations between Tables, they must be of the same dimensions.
To add Table2 to Table3 we would have to transpose one of the tables using the in-built T method:
table2 + table3.T
or:
table2.T + table3
We can also perform more general operations using all 3 tables, such as taking the minimum value of the maximum values of Table2 and Table3, and dividing the Table1 values by it:
table1 / min(table2.values.max(), table3.values.max())
To perform these types of operations in Galaxy we would select the following:
Input Single or Multiple Tables → Multiple Tables
- (For each inserted table)
- Column names on first row? → Yes
- Row names on first column? → Yes
Custom Expression →
<insert your desired function>
Please note that the last example shown above was chosen to illustrate the limitations of the tool. Nested attributes like table2.values.max are disallowed in expressions in the tool so the above would have to be replaced with the harder to read workaround:
table1 / min(np.max(np.max(table2)), np.max(np.max(table3)))
Complex operations (like ones that would benefit from specifying nested attributes) can often be broken into subsequent runs ot the tool, in which the first run generates an intermediate table representing the result of the "inner" operation that the second run can then use as input to perform the "outer" operation.
Also note that, currently min(), max() and sum() are the only built-in Python functions that can be used inside expressions. If you want to use additional functions, these have to be qualified functions from the math, np or pd libraries.
We have the following table
. A B C 0 a B 1 1 b B 3 2 c B 5
and we want:
. A variable value 0 a B B 1 b B B 2 c B B 3 a C 1 4 b C 3 5 c C 5
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Perform a Full Table Operation
- Operation → Melt
- Variable IDs → A
- Unpivoted IDs → B,C
This converts the "B" and "C" columns into variables.
We have the following table
. foo bar baz zoo 0 one A 1 x 1 one B 2 y 2 one C 3 z 3 two A 4 q 4 two B 5 w 5 two C 6 t
and we want:
. A B C one 1 2 3 two 4 5 6
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Perform a Full Table Operation
- Operation → Pivot
- Index → foo
- Column → bar
- Values → baz
We have the following table
. c1 c2 c3 g1 10 20 30 g2 3 3 9 g3 4 8 12 g4 81 6 3
and we want to add "chr" to the elements in column 2 AND rows 2 and 4:
. c1 c2 c3 g1 10 20 30 g2 3 chr3 9 g3 4 8 12 g4 81 chr6 3
In Galaxy we would select the following:
Input Single or Multiple Tables → Single Table
Column names on first row? → Yes
Row names on first column? → Yes
Type of table operation → Manipulate selected table elements
Operation to perform → Replace values
Replacement value → chr{elem:.0f}
Here, the placeholder {elem} lets us refer to each element's current value, while the :.0f part is a format specifier that makes sure numbers are printed without decimals (for a complete description of the available syntax see the Python Format Specification Mini-Language).
Operate on elements → Specific Rows and/or Columns
List of columns to select → 2
List of rows to select → 2,4
Inclusive Selection → No
would repeat the steps above but set the Inclusive Selection to "Yes", to give:
. | c1 | c2 | c3 |
---|---|---|---|
g1 | 10 | chr20 | 30 |
g2 | chr3 | chr3 | chr9 |
g3 | 4 | 8 | 12 |
g4 | chr81 | chr6 | chr3 |
For an input table of:
A | B | C | D | E |
---|---|---|---|---|
foo | one | small | 1 | 2 |
foo | one | large | 2 | 4 |
foo | one | large | 2 | 5 |
foo | two | small | 3 | 5 |
foo | two | small | 3 | 6 |
bar | one | large | 4 | 6 |
bar | one | small | 5 | 8 |
bar | two | small | 6 | 9 |
bar | two | large | 7 | 9 |
we wish to pivot the table with the 'A' column as the new row index and use the values of the column 'C' as the new column indexes, based on the aggregated values of 'D'. By default the aggregator function is the mean, but here we will instead pick the max, to yield:
C | l | s |
---|---|---|
A | ||
bar | 7 | 6 |
foo | 2 | 3 |
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Perform a Full Table Operation
- Operation → Pivot
- Index → A
- Column → C
- Values → D
- Aggregator Function → max
For an input table of:
Name | Position | City | Age | Random |
---|---|---|---|---|
Mary | Manager | Boston | 34 | 0.678577 |
Josh | Programmer | New York | 37 | 0.973168 |
Jon | Manager | Chicago | 29 | 0.146668 |
Lucy | Manager | Los Angeles | 40 | 0.150120 |
Jane | Programmer | Chicago | 29 | 0.112769 |
Sue | Programmer | Boston | 31 | 0.185198 |
we wish to pivot the table with the 'Position' column as the new index and transform the 'Age' and 'Random' columns to have mean and standard deviation values
Position | Age | Random | Random |
---|---|---|---|
. | mean | mean | std |
Manager | 34.333333 | 0.325122 | 0.306106 |
Programmer | 32.333333 | 0.423712 | 0.477219 |
In Galaxy we would select the following:
- Input Single or Multiple Tables → Single Table
- Column names on first row? → Yes
- Row names on first column? → Yes
- Type of table operation → Perform a Full Table Operation
- Operation → Pivot
- Index → Position
- Column-Function Mapping
- Value Column → Age
- Function → mean
- Value Column → Random
- Function → mean
- Function → std