Filter a tabular dataset by applying line filters as it is being read. Multiple filters may be used with each filter using the result of the previous filter.
Inputs
A tabular dataset.
Outputs
A filtered tabular dataset.
Input Line Filters
As a tabular file is being read, line filters may be applied:
- skip leading lines - skip the first number of lines
- comment char - omit any lines that start with the specified comment character
- by regex expression matching - include/exclude lines that match the regex expression
- select columns - choose to include only selected columns in the order specified
- select columns by indices/slices - indices or slices of the columns to keep (python_list indexing)
- regex replace value in column - replace a field in a column using a regex substitution (good for date reformatting)
- regex replace value in column - add a new column using a regex substitution of a column value
- prepend a line number column - each line has the ordinal value of the line read by this filter as the first column
- append a line number column - each line has the ordinal value of the line read by this filter as the last column
- prepend a text column - each line has the text string as the first column
- append a text column - each line has the text string as the last column
- prepend the dataset name - each line has the dataset name as the first column
- append the dataset name - each line has the dataset name as the last column
- normalize list columns - replicates the line for each item in the specified list columns
(Six filters are applied as the following file is read)
Input Tabular File: #People with pets Pets FirstName LastName DOB PetNames PetType 2 Paula Brown 24/05/78 Rex,Fluff dog,cat 1 Steven Jones 04/04/74 Allie cat 0 Jane Doe 24/05/78 1 James Smith 20/10/80 Spot Filter 1 - append a line number column: #People with pets 1 Pets FirstName LastName DOB PetNames PetType 2 2 Paula Brown 24/05/78 Rex,Fluff dog,cat 3 1 Steven Jones 04/04/74 Allie cat 4 0 Jane Doe 24/05/78 5 1 James Smith 20/10/80 Spot 6 Filter 2 - by regex expression matching [include]: '^\d+' (include lines that start with a number) 2 Paula Brown 24/05/78 Rex,Fluff dog,cat 3 1 Steven Jones 04/04/74 Allie cat 4 0 Jane Doe 24/05/78 5 1 James Smith 20/10/80 Spot 6 Filter 3 - append a line number column: 2 Paula Brown 24/05/78 Rex,Fluff dog,cat 3 1 1 Steven Jones 04/04/74 Allie cat 4 2 0 Jane Doe 24/05/78 5 3 1 James Smith 20/10/80 Spot 6 4 Filter 4 - regex replace value in column[4]: '(\d+)/(\d+)/(\d+)' '19\3-\2-\1' (convert dates to sqlite format) 2 Paula Brown 1978-05-24 Rex,Fluff dog,cat 3 1 1 Steven Jones 1974-04-04 Allie cat 4 2 0 Jane Doe 1978-05-24 5 3 1 James Smith 1980-10-20 Spot 6 4 Filter 5 - normalize list columns[5,6]: 2 Paula Brown 1978-05-24 Rex dog 3 1 2 Paula Brown 1978-05-24 Fluff cat 3 1 1 Steven Jones 1974-04-04 Allie cat 4 2 0 Jane Doe 1978-05-24 5 3 1 James Smith 1980-10-20 Spot 6 4 Filter 6 - select columns by indices/slices: '1:6' Paula Brown 1978-05-24 Rex dog Paula Brown 1978-05-24 Fluff cat Steven Jones 1974-04-04 Allie cat Jane Doe 1978-05-24 James Smith 1980-10-20 Spot