TIP: Input data must be TAB delimited. If the desired dataset does not appear in the input list, use Text Manipulation->Convert to convert it to Tabular type.
Syntax
This tools performs common operations (such as summing, counting, mean, standard-deviation) on input file, based on tabular data. The tool can also optionaly group the input based on a given field.
Example 1
Find the average score in statistics course of college students, grouped by their college major. The input file has three fields (Name,Major,Score) and a header line:
Name Major Score Bryan Arts 68 Isaiah Arts 80 Gabriel Health-Medicine 100 Tysza Business 92 Zackery Engineering 54 ... ...
Grouping the input by the second column (Major), and performing operations mean and sample standard deviation on the third column (Score), gives:
GroupBy(Major) mean(Score) sstdev(Score) Arts 68.9474 10.4215 Business 87.3636 5.18214 Engineering 66.5385 19.8814 Health-Medicine 90.6154 9.22441 Life-Sciences 55.3333 20.606 Social-Sciences 60.2667 17.2273
This sample file is available at http://www.gnu.org/software/datamash .
Example 2
Using the UCSC RefSeq Human Gene Track, available at: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
List the number and identifiers of isoforms per gene. The gene identifier is in column 13, the isoform/transcript identifier is in column 2. Grouping by column 13 and performing count and Combine all values on column 2, gives:
GroupBy(field-13) count(field-2) collapse(field-2) A1BG 1 NM_130786 A1BG-AS1 1 NR_015380 A1CF 6 NM_001198818,NM_001198819,NM_001198820,NM_014576,NM_138932,NM_138933 A2M 1 NM_000014 A2M-AS1 1 NR_026971 A2ML1 2 NM_001282424,NM_144670 ...
Count how many transcripts are listed for each chromosome and strand. Chromosome is on column 3, Strand is in column 4. Transcript identifiers are in column 2. Grouping by columns 3,4 and performing operation count on column 2, gives:
GroupBy(field-3) GroupBy(field-4) count(field-2) chr1 + 2456 chr1 - 2431 chr2 + 1599 chr2 - 1419 chr3 + 1287 chr3 - 1249 ...
GNU Datamash is a Free and Open Source Software, see more details on the Datamash Website.
GNU Datamash is also available as a command-line program, see http://www.gnu.org/software/datamash/download/ .
For more details about supported statistical operations, see Datamash website.