Galaxy |

Datamash (version 1.0.6)

Select Input Data:

Group by fields:

Example: 1,4 - To group by the first and fourth fields. Leave empty to perform operation on entire file as one group.

Input file has a header line:

Mark this if the input file's first line is a header line

Print header line:

Mark this if you want the first line to show the field names

Sort input:

Mark if the input file is not sorted. If the input file is already sorted, unmark this option to reduce computing time.

Print all fields from input file:

If set, all input fields will be printed. If unset, only fields used for grouping will be printed.

Ignore case when grouping:

If set, upper/lowercase differences will be ignored when grouping fields.

Operation to perform on each groups

Operation to perform on each group 0

TIP: Input data must be TAB delimited. If the desired dataset does not appear in the input list, use Text Manipulation->Convert to convert it to Tabular type.

Syntax

This tools performs common operations (such as summing, counting, mean, standard-deviation) on input file, based on tabular data. The tool can also optionaly group the input based on a given field.

Example 1

Find the average score in statistics course of college students, grouped by their college major. The input file has three fields (Name,Major,Score) and a header line:

Name        Major            Score
Bryan       Arts             68
Isaiah      Arts             80
Gabriel     Health-Medicine  100
Tysza       Business         92
Zackery     Engineering      54
...
...

Grouping the input by the second column (Major), and performing operations mean and sample standard deviation on the third column (Score), gives:

GroupBy(Major)     mean(Score)   sstdev(Score)
Arts               68.9474       10.4215
Business           87.3636       5.18214
Engineering        66.5385       19.8814
Health-Medicine    90.6154       9.22441
Life-Sciences      55.3333       20.606
Social-Sciences    60.2667       17.2273

This sample file is available at http://www.gnu.org/software/datamash .

Example 2

Using the UCSC RefSeq Human Gene Track, available at: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz

List the number and identifiers of isoforms per gene. The gene identifier is in column 13, the isoform/transcript identifier is in column 2. Grouping by column 13 and performing count and Combine all values on column 2, gives:

GroupBy(field-13)     count(field-2)     collapse(field-2)
A1BG                  1                  NM_130786
A1BG-AS1              1                  NR_015380
A1CF                  6                  NM_001198818,NM_001198819,NM_001198820,NM_014576,NM_138932,NM_138933
A2M                   1                  NM_000014
A2M-AS1               1                  NR_026971
A2ML1                 2                  NM_001282424,NM_144670
...

Count how many transcripts are listed for each chromosome and strand. Chromosome is on column 3, Strand is in column 4. Transcript identifiers are in column 2. Grouping by columns 3,4 and performing operation count on column 2, gives:

GroupBy(field-3)     GroupBy(field-4)     count(field-2)
chr1                 +                    2456
chr1                 -                    2431
chr2                 +                    1599
chr2                 -                    1419
chr3                 +                    1287
chr3                 -                    1249
...

GNU Datamash is a Free and Open Source Software, see more details on the Datamash Website.

GNU Datamash is also available as a command-line program, see http://www.gnu.org/software/datamash/download/ .

For more details about supported statistical operations, see Datamash website.