Csvtk - Filter2 Help

Info

Csvtk advanced filter (also called filter2) outputs rows that satisfy the input awk-like artithmetic/string expressions. Please see the documentation for further details and examples on how to write expressions.

Single quotes are not allowed in text inputs!

If your wanted column header has a space in it, use the column number. Example: Use $1 if column #1 is called "Colony Counts"

Supported operators and types:

Modifiers: + - / * & | ^ ** % >> <<

Comparators: > >= < <= == != =~ !~

Logical ops: || &&

Numeric constants, as 64-bit floating point (12345.678)

String constants (double quotes: "foobar")

Date constants (double quotes)

Boolean constants: true false

Parenthesis to control order of evaluation ( )

Arrays (anything separated by , within parenthesis: (1, 2, "foo"))

Prefixes: ! - ~

Ternary conditional: ? :

Null coalescence: ??

Input Data

**Limitations of Input Data**

1. The CSV parser requires all the lines have same number of fields/columns.
    If your file has illegal rows, set the "Illegal Rows" parameter to "Yes" to pass your data through
    Even lines with spaces will cause error.
    Example bad table below.

2. By default, csvtk thinks files have header rows. If your file does not, set global parameter
    "Has Header Row" to "No"

3. Column names should be unique and are case sensitive!

4. Lines starting with "#" or "$" will be ignored, if in the header row

5. If " exists in tab-delimited files, set Lazy quotes global parameter to "Yes"

Example bad table:

Head 1	Head 2	Head 3	Head 3
1	2	3
this	will		break

Bad tables may work if both the "Ignore Illegal Rows" and "Ignore Empty Rows" global parameters are set to "Yes", But there is no guarentee of that!

Usage

Ex. Filter2 on one column:

Suppose we had the following table:

Culture Label	Cell Count	Dilution
ECo-1	2523	1000
LPn-1	100	1000000
LPn-2	4	1000

If we wanted to find all samples with the label LPn, we could use the filter expression '$1 =~ "LPn*"' to get the following output:

Culture Label	Cell Count	Dilution
LPn-1	100	1000000
LPn-2	4	1000

Note how $1 was used to get column 1 due to it containing a space

Ex2. Filter2 with multiple inputs:

Same input table

Culture Label	Cell Count	Dilution
ECo-1	2523	1000
LPn-1	100	1000000
LPn-2	4	1000

Now if we use the expression '$1 =~ "LPn*" && $Dilution > 1000' to filter on, we would pull out the only row that satisfies both conditions:

Culture Label	Cell Count	Dilution
LPn-1	100	1000000

Column Name Input Help

Multiple names can be given if separated by a ' , '.
- ex. 'ID,Organism' would target the columns named ID and Organism for the function
Column names are case SeNsitive
Column numbers can also be given:

-ex. '1,2,3' or '1-3' for inputting columns 1-3.
You can also specify all but unwanted column(s) with a ' - '.
- ex. '-ID' would target all columns but the ID column

More Information

For information from the creators of csvtk, please visit their site at: https://bioinf.shenwei.me/csvtk/

Although be aware that some features may not be available and some small changes were made to work with Galaxy.

Notable changes from their documentation:

Cannot specify multiple file header names (IE cannot use "name;username" as a valid column match)
No single quotes / apostrophes allowed in text inputs