Mercurial > repos > mmaiensc > ember

<tool id="prep_data" name="PreProcess Expression Data" version="1.3.1">
  <description>Step 1 of analysis: discretizes expression data</description>
  <command interpreter="perl">PreProcess_Expression_Data.pl -i $data -c $compslist -a $annot -o $output -p $thresh -l $log -v n</command>
  <inputs>
    <param format="txt" name="data" type="data" label="Expression data"/>
    <param format="txt" name="compslist" type="data" label="Comparison list"/>
    <param format="txt" name="annot" type="data" label="Annotation file"/>
    <param name="thresh" type="float" min="0" max="1" label="Percentile threshold" value="0.63" optional="true"/>
    <param name="log" type="select" label="Log transform data?">
        <option value="n" selected="true">No</option>
        <option value="y">Yes</option>
    </param>
  </inputs>
  <outputs>
    <data format="txt" name="output"/>
  </outputs>

  <tests>
    <test>
      <param name="data" value="EMBER/expression.txt"/>
      <param name="compslist" value="EMBER/comparisons_list.txt"/>
      <param name="annot" value="EMBER/annotation.txt"/>
      <param name="thresh" value="0.63"/>
      <param name="log" value="n"/>
      <output name="output" file="EMBER/expression_profiles.txt"/>
    </test>
  </tests>

  <help>

This tool discretizes the gene expression data and adds genomic annotations.

More options for the EMBER tools (especially for the main program, EMBER, including searching for multiple expression patterns) are available in the command line version, available at http://dinner-group.uchicago.edu/downloads.html. That package also includes test data and sample outputs.

When using any of the EMBER tools, please cite: M Maienschein-Cline, J Zhou, KP White, R Sciammas, and AR Dinner. Discovering transcription factor regulatory targets using gene expression and binding data. *Bioinformatics*, 28:206-213 (2012).

-----

Description of inputs:

*Expression Data*:

   Microarray data, with data from N experiments (and at least 2 replicates per condition).

   *Format (N+1 columns)*: [ID] [expt 1 value] [expt 2 value] ... [expt N value]

   IMPORTANT: the first line should be a title line, first field "#ID", and subsequent fields giving the condition/replicate for each column, i.e.,

      #ID [condition]#[replicate]...

   where [condition] matches the values in the Comparison List, and replicate tells which number the file is. [condition] and [replicate] are delimited by a "#" (so don't use that character in the condition name).

*Comparison List*:

   List of behavior dimension definitions. [condition] should match the names in the expression data list.

   *Format (2 columns)*: [condition1] [condition2]

*Annotation File*:

   Gives the genomic coordinates of each probe set.

   *Format (6 columns)*: [probe id] [gene name] [chromosome] [start] [end] [strand]

*Percentile Threshold* (p):

   Used to eliminate genes that are consistently expressed at a very low level. All data are concatenated into one list, and the pth percentile of that list is taken as the thresold. Then a probe set is removed if its value is less than the threshold in ALL conditions.

   p = 1.0 means all probes are retained, p = 0.0 means none are. However, note that this does NOT necessarily imply that 0.63 means 63% of probe sets are retained.

*Log Transform*: whether or not to take the log of the data before discretization.

A note on preparing the expression data: it may be convenient to prepare the data in an Excel worksheet, copying and pasting the expression levels from different experiments into the same file and adding titles in the first column. However, I have had some issues with then saving the file as tab-delimited text, as the line break character used by Excel is not always recognized in the text-processing routines in the scripts. The safest choice may be to select and copy the data from the open Excel worksheet and paste it into a text editor, which has worked for me.

  </help>

</tool>
author	mmaiensc
date	Thu, 22 Mar 2012 13:49:52 -0400
parents
children	e960969a92ae