TIP: If your data is not TAB delimited, use Text Manipulation->Convert
What it does
This tool computes an expression on every row of a dataset and appends or inserts the result as a new column (field).
Several expressions can be specified and will be applied sequentially to each row.
Expression rules
Columns are referenced with c and a number. For example, c1 refers to the first column of a tab-delimited file
The following built-in Python functions are available for use in expressions:
abs | all | any | ascii | bin | bool | chr | complex | divmod | float | format | hex | int | len | list map | max | min | oct | ord | pow | range | reversed | round | set | sorted | str | sum | type acos | acosh | asin | asinh | atan | atan2 | atanh | cbrt | ceil | comb | copysign | cos | cosh | degrees dist | erf | erfc | exp | exp2 | expm1 | fabs | factorial | floor | fmod | frexp | fsum | gamma | gcd hypot | inf | isclose | isfinite | isinf | isnan | isqrt | ldexp | lgamma | log | log10 | log1p | log2 modf | nextafter | perm | pow | prod | remainder | sin | sqrt | tan | tanh | tau | trunc | ulp
In addition the numpy function format_float_positional is available to control the formatting of floating point numbers.
Expressions can be chained, and the tool will keep track of newly added columns while working through the chain. This means you can reference a column that was created as the result of a previous expression in later ones.
Simple examples
If this is your input:
chr1 151077881 151077918 2 200 - chr1 151081985 151082078 3 500 +
computing "c4 * c5" will produce:
chr1 151077881 151077918 2 200 - 400 chr1 151081985 151082078 3 500 + 1500
You can also use this tool to evaluate expressions. For example, computing "c3 >= c2" for the input above will result in the following:
chr1 151077881 151077918 2 200 - True chr1 151081985 151082078 3 500 + True
Similarly, computing "type(c2) == type(c3) will return:
chr1 151077881 151077918 2 200 - True chr1 151081985 151082078 3 500 + True
Error handling
The tool will always fail on syntax errors in and other unrecoverable parsing errors with any of your expressions. For other problems, however, it offers control over how they should be handled:
The default for "Autodetect column types" is "Yes", which means the tool will evaluate each column value as the type that Galaxy assumes for the column. This default behavior will allow you to write simpler expressions. The arithmetic expression "c4 * c5" from the first simple example, for instance, works only because Galaxy realizes that c4 and c5 are integer columns. Occasionally, this autodetection can cause issues. A common such situation are missing values in columns that Galaxy thinks are of numeric type. If you're getting errors like "Failed to convert some of the columns in line #X ...", a solution might be to turn off column type autodetection. The price you will have to pay for doing so is that now you will have to handle type conversions yourself. In the first example you would now have to use the epression: "int(c4) * int(c5)".
By default, if any expression references columns that are not existing before that expression gets computed, the tool will fail, but you can uncheck the "Fail on references to non-existent columns" option. If you do so, the result will depend on your choice for "If an expression cannot be computed for a row" (see 3.)
The default for rows, for which an expression fails to compute is, again, to fail the tool run, but you can also choose to:
skip the row on output
This is a simple way to only keep lines conforming to an expected standard. It is also easy to mask problems with your expressions with this option so take a look at the results and try to understand what gets skipped and for what reasons (the stdout of the tool will contain information about both).
keep the row unchanged
This can be a good solution if your input contains special separator lines that don't follow the general tabular format of other lines and you would like to keep those lines
produce an empty column value for the row
This will use the empty string as a substitute for non-computable items. Different from the "keep the row unchanged option" the problematic line will have a column added or changed. This option is a good choice for inputs in which all rows have the same tabular layout where you want to make sure that the same is true for the output, i.e. that all output lines still have the same number of columns.
fill in a replacement value
This option is very similar to the previous one, but lets you control the replacement value.
Example
In the following input:
chr1 151077881 151077918 2 200 - chr1 151081985 151082078 3 500 + chr1 151090031 151090938 4 700
the last line does not have a strand column. This violates the bed file format specification, which says that unknown strand is to be encoded as . in the strand column.
You can fix the file with the following tool run:
Add expression: c6
Mode of the operation: Replace
Use new column to replace column number: 6
Fail on references to non-existent columns: No
If an expression cannot be computed for a row: Fill in a replacement value
Replacement value: .