(operations on tabular data)datamash
datamash
$header_in
$header_out
$need_sort
$print_full_line
$ignore_case
#if str($grouping).strip()
--group '$grouping'
#end if
#for $oper in $operations
${oper.op_name}
${oper.op_column}
#end for
< $in_file > $out_file
^[0-9, ]*$
.. class:: infomark
**TIP:** Input data must be TAB delimited. If the desired dataset does not appear in the input list, use *Text Manipulation->Convert* to convert it to **Tabular** type.
-----
**Syntax**
This tools performs common operations (such as summing, counting, mean, standard-deviation) on input file, based on tabular data. The tool can also optionaly group the input based on a given field.
-----
**Example 1**
- Find the average score in statistics course of college students, grouped by their college major. The input file has three fields (Name,Major,Score) and a header line::
Name Major Score
Bryan Arts 68
Isaiah Arts 80
Gabriel Health-Medicine 100
Tysza Business 92
Zackery Engineering 54
...
...
- Grouping the input by the second column (*Major*), and performing operations **mean** and **sample standard deviation** on the third column (*Score*), gives::
GroupBy(Major) mean(Score) sstdev(Score)
Arts 68.9474 10.4215
Business 87.3636 5.18214
Engineering 66.5385 19.8814
Health-Medicine 90.6154 9.22441
Life-Sciences 55.3333 20.606
Social-Sciences 60.2667 17.2273
This sample file is available at http://www.gnu.org/software/datamash .
**Example 2**
- Using the UCSC RefSeq Human Gene Track, available at: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
- List the number and identifiers of isoforms per gene. The gene identifier is in column 13, the isoform/transcript identifier is in column 2. Grouping by column 13 and performing **count** and **Combine all values** on column 2, gives::
GroupBy(field-13) count(field-2) collapse(field-2)
A1BG 1 NM_130786
A1BG-AS1 1 NR_015380
A1CF 6 NM_001198818,NM_001198819,NM_001198820,NM_014576,NM_138932,NM_138933
A2M 1 NM_000014
A2M-AS1 1 NR_026971
A2ML1 2 NM_001282424,NM_144670
...
- Count how many transcripts are listed for each chromosome and strand. Chromosome is on column 3, Strand is in column 4. Transcript identifiers are in column 2. Grouping by columns **3,4** and performing operation **count** on column 2, gives::
GroupBy(field-3) GroupBy(field-4) count(field-2)
chr1 + 2456
chr1 - 2431
chr2 + 1599
chr2 - 1419
chr3 + 1287
chr3 - 1249
...
-----
**GNU Datamash** is a Free and Open Source Software, see more details on the Datamash_ Website.
**GNU Datamash** is also available as a command-line program, see http://www.gnu.org/software/datamash/download/ .
For more details about supported statistical operations, see Datamash_ website.
.. _Datamash: http://www.gnu.org/software/datamash/