Mercurial > repos > mmaiensc > arts

ARTS: Automated Randomization of multiple Traits for Study design
Written by Mark Maienschein-Cline
mmaiensc@gmail.com
Center for Research Informatics
University of Illinois at Chicago

ARTS uses a genetic algorithm to optimize (minimize) a mutual information-based objective function, obtaining
an optimal randomization for studies of arbitrary size and design.

The publication for this code is in preparation; citation to be added soon (hopefully!). When it is published,
the section of the supplementary information will give more details about usage (in addition to what's below).

Please contact me at the email above with questions.


There are two ways of using this code: command-line (it's a perl script), or through Galaxy.

You can learn about, and download, Galaxy at http://galaxyproject.org.

################
# INSTALLATION #
################

#
# Command line version:
#
No installation needed, as long as you have a perl interpreter. Should work fine on a Mac or Linux system;
probably fine on Windows, but I haven't tested it.

#
# Galaxy version:
#
Two options:
1) You can download this tool from the Galaxy toolshed directly into your installation.
2) Move the ARTS.pl and .xml files into tools/ in your Galaxy distribution, and edit the tool_config file
appropriately. If you don't know how to do this, you should probably use strategy #1.

###########
# RUNNING #
###########

#
# Galaxy version
#
Once you get the tools installed in Galaxy, there are help sections in the tool descriptions you can refer to.
Also refer to the instructions for the command-line version below.

#
# Command line version:
#

Run ARTS.pl without any inputs to see the usage. All inputs are specified using the usual [-flag] [value]
syntax (i.e., -i input.txt).

Sample command using the sample_data.txt file:
./ARTS.pl -i sample_data.txt -c "2,3,4,5;2;3;4;5" -b 10 -o batched_data.txt -cc 2,4 -cd 4


More information about the inputs (*'ed remarks refer to the values in the sample command above):

-i  Input trait table: tab-delimited table, including 1 header line. See sample_data.txt for an example.
    You can prepare this table in Excel and save as a tab-delimited text, or just write it in a text file,
    or copy-paste from Excel to a text file. You can have more columns than you will actually care about
    randomizing here.
    * You can use the file sample_data.txt as an example input; there are 5 columns, Sample ID, Age, Sex,
      Collection Date, and Disease.

-c  Trait columns to randomize. This is a comma- and semicolon-delimited list. Its syntax is important,
    so pay attention.
    Columns are numbered starting from 1. Traits that should be considered jointly should be listed together
    separated by commas. Each set of jointly considered traits should be listed separated by semicolons. Hence,
    * -c "2,3,4,5;2;3;4;5" says to consider all the traits (columns 2-5) jointly (that's the 2,3,4,5 part), AND
      to consider each trait individually (that's the ;2;3;4;5 part).
    You could opt to only consider traits individually (-c "2;3;4;5"), or only jointly (-c "2,3,4,5"), or only
    pair-wise (-c "2,3;2,4;2,5;3,4;3,5;4,5"), or whatever you want.
    OUR GENERAL-PURPOSE RECOMMENDATION is to consider all traits jointly, plus all individually, as in the sample
    command. This corresponds to the MMI statistic discussed in the publication.
    GALAXY USERS: you just get to select the columns to consider, and the script will use the MMI statistic
    automatically (you don't get a choice).
    FINAL NOTE: you should put quotes around the value here, since otherwise semicolons will be interpreted
    as end-of-line characters.

-b  Batch size (number of samples that can be processed at the same time). You have two options:
    1) Enter a single number. This will fill as many complete batches as possible, and put the remainder into a smaller
       batch. This is probably convenient, but you should do a quick count to make sure you don't end up with a really
       small last batch (e.g., if you have 105 samples and do batch size of 25, your last batch will only have 5 samples).
    2) Enter a comma-delimited list that adds up to the number of samples, which allows for uneven batch sizes
       For example, -b 10,10,9,9 for 38 samples. If your math doesn't add up, the program will exit and let you know.
    * sample_data.txt has 30 samples, so "-b 10" makes 3 batches of 10 samples each.

-o  Output file. Self-explanatory. The batch assignments are added as an extra column on the end, otherwise looks
    like the input.
    * batched_data.txt is our output file.

-p  (sort-of optional: you MUST use both -b and -o, OR just -p) Print (to STDOUT) the statistics of a batched
    run using this column. The result will look like the last part of the STDOUT from an ARTS run (see below),
    but you can use this option for testing batch assignments from another algorithm, or if you did one by hand.

-cc Indices of continuously-valued columns. ARTS uses discrete values for its statistics, so these columns must
    be discretized (binned). If ARTS encounters a column with more than 20 values, it will generate a warning asking
    if you want it to be continuous. Comma-delimited list.
    * In sample_data.txt, columns 2 (age) and 4 (date) could be considered continuous (that is, it's worth treating
      a 35 year-old similarly to a 36 year-old), so we set "-cc 2,4".

-cd Date-valued columns. These columns should also be listed under -cc, but this lets ARTS know to expect a date
    (format MUST be M/D/Y, where month is a number (1 instead of January)) and convert the date to a number before
    binning.
    * In sample_data.txt, column 4 is a date, so set "-cd 4".

-cb Number of bins to use for discretizing the continuous columns. Again, you can set a single value, or give a comma-
    delimited list, which will match the order of the list given in the -cc flag.
    * For the sample run, we left the default value of 5, but we could do, for example, "-cb 5,7", which would bin
      the ages into 5 bins and the dates into 7 bins (since we set "-cc 2,4", and column 2 was age, column 4 was date).

-bn Name for the batch column added to the output. Default is "batch".

-s  Random number seed. Set as a large negative integer. The code always uses the same seed, but if you want to
    rerun with a different seed you can use this option.

----------------------------------------------

When you run the sample command, the STDOUT looks like this (I added the N) line numbers):

"""""""""""""""""""
1)  Using traits:	Age	Sex	Collection date	Disease
2)  Using trait combinations:	{Age,Sex,Collection date,Disease}	{Age}	{Sex}	{Collection date}	{Disease}
3)    Generation 1 of 300, average fitness 0.1432
4)    Generation 2 of 300, average fitness 0.1342
5)    Generation 3 of 300, average fitness 0.1298
6)    Generation 4 of 300, average fitness 0.1279
7)    Generation 5 of 300, average fitness 0.1250
8)    Generation 6 of 300, average fitness 0.1227
9)    Generation 7 of 300, average fitness 0.1211
10)   Generation 8 of 300, average fitness 0.1194
11)   Generation 9 of 300, average fitness 0.1187
12)   Generation 10 of 300, average fitness 0.1181
13)   Generation 11 of 300, average fitness 0.1175
14)   Generation 12 of 300, average fitness 0.1165
15)   Generation 13 of 300, average fitness 0.1143
16)   Generation 14 of 300, average fitness 0.1133
17)   Generation 15 of 300, average fitness 0.1132
18)   Generation 16 of 300, average fitness 0.1127
19)   Generation 17 of 300, average fitness 0.1123
20)   Generation 18 of 300, average fitness 0.1116
21)   Generation 19 of 300, average fitness 0.1119
22)   Generation 20 of 300, average fitness 0.1113
23)   Generation 21 of 300, average fitness 0.1113
24)   Generation 22 of 300, average fitness 0.1110
25)   Generation 23 of 300, average fitness 0.1110
26) Final MI 0.1045 ; Individual trait MIs (mean 0.0091 ): 	0.0155	0.0000	0.0209	0.0000
27) -----------------------------------------------------------------
28) 	Age values					Sex values		Collection date values					Disease values
29) Batch (size)	19-27.2	35.4-43.6	51.8-60	43.6-51.8	27.2-35.4	M	F	2/26/2012-11/11/2012	11/11/2012-7/27/2013	6/14/2011-2/26/2012	9/29/2010-6/14/2011	1/15/2010-9/29/2010	Y	N
30) -------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------
31) 1 (10)	2	2	2	1	3	5	5	3	2	2	2	1	5	5
32) 2 (10)	2	2	1	2	3	5	5	2	2	4	1	1	5	5
33) 3 (10)	3	2	1	1	3	5	5	3	2	2	2	1	5	5
34) -------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------	-------
35) Total	7	6	4	4	9	15	15	8	6	8	5	3	15	15
"""""""""""""""""""

Here's what the lines mean:
1)  Tells you what traits you've selected.
2)  Tells you what trait combinations you've selected.
3-25) Prints the progress for each generation of the GA. Converges when average fitness changes by less than 0.0001.
26) Final objective function value. Normalized between 0 and 1, ideal case is 0. Note that different choices of the
    objective function ARE NOT COMPARABLE: if you select fewer traits, or simpler combinations of traits (fewer
    joint traits) using different -c values, you will get lower MI values, but this does not necessarily indicate better
    overall randomization, because your choices may be overly simplistic. This is why we recommend sticking with the
    MMI definition (all joint + all individual) consistently. This line also gives the randomization values for all
    individual traits.
27-24) Inividual trait counts per batch for different values. Continuously-valued columns are given as a range
    (e.g., age 19-27.2).
35) Total number of traits in each bin over all samples.

----------------------------------------------

The output, batched_data.txt, will look like this:

"""""""""""""""""""
Sample ID       Age     Sex     Collection date Disease batch
sample1 25      M       3/28/2012       Y       3
sample2 37      F       4/27/2013       N       3
sample3 36      F       3/10/2013       N       1
sample4 52      M       7/1/2012        Y       1
sample5 48      M       8/13/2011       Y       3
sample6 60      M       9/21/2011       N       3
sample7 31      F       10/22/2010      Y       3
sample8 28      F       1/15/2010       N       2
sample9 26      M       1/7/2012        N       1
sample10        44      F       4/5/2012        Y       1
sample11        33      M       5/18/2012       N       3
sample12        25      F       7/27/2013       N       3
sample13        28      M       1/20/2013       Y       2
sample14        30      F       8/11/2012       Y       3
sample15        51      M       11/23/2011      N       2
sample16        22      M       12/21/2011      N       2
sample17        28      M       9/26/2010       Y       1
sample18        19      F       1/18/2010       Y       3
sample19        35      M       2/10/2012       N       1
sample20        38      F       2/17/2012       N       2
sample21        25      F       4/28/2012       Y       1
sample22        55      M       1/7/2013        Y       2
sample23        33      F       6/30/2013       N       1
sample24        24      M       7/1/2012        Y       2
sample25        42      M       2/15/2011       N       3
sample26        60      M       5/21/2011       N       1
sample27        34      F       10/23/2010      Y       2
sample28        37      F       12/18/2010      Y       1
sample29        41      F       11/7/2012       N       2
sample30        50      F       2/15/2012       Y       2
"""""""""""""""""""

Looks the same as the input file, with a sixth column titled "batch" added, saying which of the three
batches each sample should be processed in (of course, you can permute the order of batches if you want).

Included file batched_data.txt is what the output should look like.
author	mmaiensc
date	Wed, 13 Nov 2013 16:29:30 -0500
parents	2086dd919b31
children