changeset 28:184d14e4270d

Update to Miller Lab devshed revision 4ede22dd5500
author Richard Burhans <burhans@bx.psu.edu>
date Wed, 17 Jul 2013 12:46:46 -0400
parents 8997f2ca8c7a
children fb944979bf35
files find_intervals.xml multiple_to_gd_genotype.py multiple_to_gd_genotype.xml rank_pathways.xml
diffstat 4 files changed, 228 insertions(+), 78 deletions(-) [+]
line wrap: on
line diff
--- a/find_intervals.xml	Mon Jul 15 10:47:35 2013 -0400
+++ b/find_intervals.xml	Wed Jul 17 12:46:46 2013 -0400
@@ -84,59 +84,75 @@
   </tests>
 
   <help>
-
 **Dataset formats**
 
-The input dataset is tabular_, with required columns of chromosome, position,
-and score (in any column).
-The output dataset is interval_.  (`Dataset missing?`_)
+The input dataset is tabular_ (which includes gd_snp_ and gd_genotype_),
+with required columns of chromosome, position, and score (in any column).
+The output dataset is interval_. (`Dataset missing?`_)
 
+.. _tabular: ./static/formatHelp.html#tab
+.. _gd_snp: ./static/formatHelp.html#gd_snp
+.. _gd_genotype: ./static/formatHelp.html#gd_genotype
 .. _interval: ./static/formatHelp.html#interval
-.. _tabular: ./static/formatHelp.html#tab
 .. _Dataset missing?: ./static/formatHelp.html
 
 -----
 
 **What it does**
 
-The user selects a tabular dataset (such as a gd_snp dataset) and 
-if the dataset is not also gd_snp format, specifies 
-the columns containing chromosome, position, and scores (such as an Fst-value for the SNP). 
-For gd_snp format the metadata can be used to specify the chromosome and 
-position.
-Other inputs include
-a percentage or raw score for the "score-shift" which should be greater than the 
-average value for the scores column.  A higher value will give smaller intervals
-in the output.
-If a percentage (e.g. 95%) is specified
-then that percentile of the scores is used as the shift; 
-percentile may not work well if many rows or SNPs have the same score
-(in that case use a raw score).  The program subtracts the
-shift from every score, then finds genomic intervals (i.e., consecutive runs
-of SNPs) whose total score cannot be increased by adding or subtracting one
-or more adjusted scores at the ends of the interval.
-Another input is the number of times the
-data should be randomized (only intervals with score exceeding the maximum for
-the randomized data are reported).  
-If 100 shuffles are requested, then any interval reported by the tool has a 
-score with probability less than 0.01 of being equaled or exceeded by chance.
+The user selects a tabular dataset (such as the SNV formats gd_snp and
+gd_genotype) and if the dataset is not in an SNV format, specifies the
+columns containing chromosome, position, and scores (such as an FST-value
+for the SNP).  With SNV formats, the metadata tells which columns hold the
+chromosome and position.  Other inputs include a percentage or raw score
+for the "score-shift" which should be greater than the average value
+for the scores column.  A higher value will give smaller intervals in
+the output.  If a percentage (e.g. 95%) is specified then that percentile
+of the scores is used as the shift; percentile may not work well if many
+rows or SNPs have the same score (in that case use a raw score).
+
+The program subtracts the shift from every score, then finds genomic
+intervals (i.e., consecutive runs of SNPs) whose total score cannot be
+increased by adding or subtracting one or more adjusted scores at the
+ends of the interval.  Another input is the number of times the data
+should be randomized (only intervals with score exceeding the maximum
+for the randomized data are reported).  If 100 shuffles are requested,
+then any interval reported by the tool has a score with probability
+less than 0.01 of being equaled or exceeded by chance, assuming that
+the scores vary independently by position.
 
 -----
 
 **Example**
 
-- input (gd_snp)::
+- Input (showing only the chromosome, position, and score columns)::
 
-    Contig222_chr2_9817738_9818143   220     C       T       888.0   chr2    9817960         C       17      0       2       78      12      0       2       63      20      0       2       87      8       0       2       51      11      0       2       60      12      0       2       63      Y       76      0.093   1
-    Contig47_chr2_25470778_25471576  126     G       A       888.0   chr2    25470896        G       12      0       2       63      14      0       2       69      14      0       2       69      10      0       2       57      18      0       2       81      13      0       2       66      N       11      0.289   1
+    chr2      39      0.40
+    chr2     103      0.97
+    chr2     188      0.72
+    chr2     203      0.68
+    chr2     321      0.92
+    ...
+    chr2    1132      0.85
+    chr2    1321      0.34
     ...
-    Contig115_chr2_61631913_61632510 310     G       T       999.3   chr2    61632216        G       7       0       2       48      9       0       2       54      7       0       2       48      11      0       2       60      10      0       2       57      10      0       2       57      N       13      0.184   0
-    Contig31_chr2_67331584_67331785  39      C       T       999.0   chr2    67331623        C       11      0       2       60      10      0       2       57      7       0       2       48      9       0       2       54      2       0       2       33      4       0       2       39      N       110     0.647   1
-    etc.
+
+- Suppose the user-specified score-shift is 0.75.  This value is subtracted from each score, giving::
 
-- output not reporting individual positions::
+    chr2      39     -0.35
+    chr2     103      0.22
+    chr2     188     -0.03
+    chr2     203     -0.07
+    chr2     321      0.17
+    ...
+    chr2    1132      0.10
+    chr2    1321     -0.41
+    ...
 
-    chr2    9817960 67331624        1272.2000
+- The output, not reporting individual positions, might be (depending on the values not shown above)::
 
+    chr2    103    1132    1.42
   </help>
 </tool>
+
+
--- a/multiple_to_gd_genotype.py	Mon Jul 15 10:47:35 2013 -0400
+++ b/multiple_to_gd_genotype.py	Wed Jul 17 12:46:46 2013 -0400
@@ -471,6 +471,7 @@
 			else:
 				if echl[0]!='#':
 					appnd=True
+					slf.write(echl)
 		slf.close()
 		#~ 
 		os.system('mv %stmp %s'%(outgd_gentp,outgd_gentp))
--- a/multiple_to_gd_genotype.xml	Mon Jul 15 10:47:35 2013 -0400
+++ b/multiple_to_gd_genotype.xml	Wed Jul 17 12:46:46 2013 -0400
@@ -1,5 +1,5 @@
-<tool id="gd_multiple_to_gd_genotype" name="Multiple" version="1.0.0">
-  <description>: to gd_genotype</description>
+<tool id="gd_multiple_to_gd_genotype" name="Convert" version="1.0.0">
+  <description>: CSV, FSTAT, Genepop or VCF to either gd_snp or gd_genotype</description>
 
   <command interpreter="python">
     #import base64
@@ -11,19 +11,19 @@
     <param name="input" type="data" format="txt" label="Input dataset" />
 
     <param name="input_format" type="select" label="Input format">
-      <option value="csv" selected="true">csv</option>
-      <option value="fstat">fstat</option>
-      <option value="genepop">genepop</option>
-      <option value="vcf">vcf</option>
+      <option value="csv" selected="true">CSV</option>
+      <option value="fstat">FSTAT</option>
+      <option value="genepop">Genepop</option>
+      <option value="vcf">VCF</option>
     </param>
 
-    <param name="species" type="text" label="Species">
+    <param name="species" type="text" label="Focus species">
       <sanitizer>
         <valid initial="string.printable"/>
       </sanitizer>
     </param>
 
-    <param name="dbkey" type="genomebuild" label="Genome" />
+    <param name="dbkey" type="genomebuild" label="Reference species" />
   </inputs>
 
   <outputs>
@@ -48,13 +48,124 @@
 
 **Dataset formats**
 
+The input dataset is formated as VCF, FSTAT, Genepop, or CSV, and is of
+Galaxy datatype text_.  Additionally, the name of the focus species (from
+which the SNPs in the VCF file were obtained) and a reference species
+are required.  The output dataset is in gd_genotype_ or gd_snp_ format.
+
+For input datasets in Genepop, FSTAT, or CSV formats, the program ignores
+population structures as well as alleles other than those encoded by 0,
+1, and 2.  For input datasets in FSTAT format the program accepts up to
+9 digits and for Genepop files only 2 digits.  Chromosome and position
+for each SNPs must be separated by a space or a tab.  Ancestral loci must
+be encoded as 1, derived as 2 and missing as 0.  In all cases ancestral
+and derived SNPs are returned as N.  Alternatively, a dataset in CSV
+format can include nucleotides.  In this case the ancestral nucleotide
+is defined as the most common allele.
+
+.. _text: ./static/formatHelp.html#text
+.. _gd_snp: ./static/formatHelp.html#gd_snp
+.. _gd_genotype: ./static/formatHelp.html#gd_genotype
+
 -----
 
 **What it does**
 
+This tool returns a gd_genotype dataset from VCF formatted files or three
+other conventional population genetics formats (i.e. FSTAT, Genepop,
+and CSV).  For VCF files that include the fields allelic depth, genotype
+quality and genotype ("AD", "GQ", and "GT" respectively in the "FORMAT"
+field) the input dataset can be converted into a gd_snp file.
+
 -----
 
-**Example**
+**Examples**
+
+- If the input format is VCF and includes the fields allelic depth, genotype quality and genotype ("AD", "GQ", and "GT" respectively in the "FORMAT" field).  Focus species name is "aye-aye" and reference species is "Human Feb. 2009 (GRCh37/hg19) (hg19)"::
+
+   #CHROM POS    ID          REF ALT QUAL    FILTER INFO FORMAT         19_F                    19.1_F             19.2_F
+   Chr21  382242 rs134033430 T   C   3296.97 .      .    GT:GQ:DP:PL:AD 1/1:75:26:943,75,0:0,26 1/1:3:1:30,3,0:0,1 ./.
+   Chr21  383680 rs137652597 T   C   2236.62 .      .    GT:GQ:DP:PL:AD 1/1:36:12:436,36,0:0,12 ./.                1/1:3:1:31,3,0:0,1
+   Chr21  387251 .           G   T   2407.88 .      .    GT:GQ:DP:PL:AD 1/1:30:12:394,30,0:0,10 ./.                ./.
+   etc.
+
+- output (gd_snp)::
+
+   #{"column_names":["chr","pos","A","B","Q","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q"],"dbkey":"aye-aye","individuals":[["19_F",6],["19.1_F",10],["19.2_F",14]],"pos":2,"rPos":2,"ref":1,
+   #"scaffold":1,"species":"hg19"}
+   Chr21   382242  T       C       3296.97 0       26      0       75      0       1       0       3       0       0       -1      -1
+   Chr21   383680  T       C       2236.62 0       12      0       36      0       0       -1      -1      0       1       0       3
+   Chr21   387251  G       T       2407.88 0       10      0       30      0       0       -1      -1      0       0       -1      -1
+   etc.
+
+- If the input format is VCF, focus species is "aye-aye" and reference species is "Human Feb. 2009 (GRCh37/hg19) (hg19)"::
+
+   #CHROM POS    ID          REF ALT QUAL    FILTER INFO FORMAT         19_F                    19.1_F             19.2_F
+   Chr21  382242 rs134033430 T   C   3296.97 .      .    GT:GQ:DP:PL    1/1:75:26:943,75,0      1/1:3:1:30,3,0:0,1 ./.
+   Chr21  383680 rs137652597 T   C   2236.62 .      .    GT:GQ:DP:PL    1/1:36:12:436,36,0      ./.                1/1:3:1:31,3,0:0,1
+   Chr21  387251 .           G   T   2407.88 .      .    GT:GQ:DP:PL    1/1:30:12:394,30,0      ./.                ./.
+   etc.
+
+- output (gd_genotype)::
+
+   #{"column_names":["chr","pos","A","B","1G","2G","3G"],"dbkey":"aye-aye","individuals":[["19_F",5],["19.1_F",6],["19.2_F",7]],"pos":2,"rPos":2,"ref":1,"scaffold":1,"species":"hg19"}
+   Chr21   382242  T       C       0       0       -1
+   Chr21   383680  T       C       0       -1      0
+   Chr21   387251  G       T       0       -1      -1
+   etc.
+
+- If the input format is Genepop::
 
+   Microsat on aye-aye from different locations
+        Chr21   382242, Chr21   383680, Chr21   387251
+   POP
+   19_F, 0202 0202 0202
+   Pop
+   19.1_F, 0202 0000 0000
+   19.2_F, 0000 0202 0000
+   etc.
+
+- or the input format is FSTAT::
+
+   300  3  2  1
+   Chr21   382242
+   Chr21   383680
+   Chr21   387251
+      19_F   22 22 22
+      19.1_F   22 00 00
+      19.2_F   00 22 00
+   etc.
+
+- or the input format is CSV::
+
+   ,19_F,19.1_F,19.2_F,...
+   Chr21   382242,22,22,00
+   Chr21   383680,22,00,22
+   Chr21   387251,22,00,00
+   etc.
+
+- output (gd_genotype)::
+
+   #{"column_names":["chr","pos","A","B","1G","2G","3G"],"dbkey":"aye-aye","individuals":[["19_F",5],["19.1_F",6],["19.2_F",7]],"pos":2,"rPos":2,"ref":1,"scaffold":1,"species":"hg19"}
+   Chr21   382242  N       N       0       0       -1
+   Chr21   383680  N       N       0       -1      0
+   Chr21   387251  N       N       0       -1      -1
+   etc.
+
+- if the input format is CSV::
+
+   ,19_F,19.1_F,19.2_F,...
+   Chr21   382242,C,C,N
+   Chr21   383680,C,N,C
+   Chr21   387251,T,N,N
+   etc.
+
+- output (gd_genotype)::
+
+   #{"column_names":["chr","pos","A","B","1G","2G","3G","4G"],"dbkey":"aye-aye","individuals":[["19_F",5],["19.1_F",6],["19.2_F",7],["...",8]],"pos":2,"rPos":2,"ref":1,"scaffold":1,"species":"hg19"}
+   Chr21   382242  C       N       2       2       -1
+   Chr21   383680  T       N       2       -1      2
+   Chr21   387251  T       N       2       -1      -1
+   etc.
   </help>
 </tool>
--- a/rank_pathways.xml	Mon Jul 15 10:47:35 2013 -0400
+++ b/rank_pathways.xml	Wed Jul 17 12:46:46 2013 -0400
@@ -62,11 +62,19 @@
 
 **Dataset formats**
 
-All of the input and output datasets are in tabular_ format.
-The input dataset must have columns with KEGG gene ID and pathways.
-[Need to update this, since input columns now depend on the "Rank by" choice.]
-The output datasets are described below.
-(`Dataset missing?`_)
+The query dataset has a column containing ENSEMBL transcript codes for
+the gene set of interest, while the background dataset has one column
+with ENSEMBL transcript codes and another with GO terms, for some larger
+universe of genes.
+
+All of the input and output datasets are in tabular_ format.  The input
+dataset (i.e. query) to rank by "percentage of genes affected" has a
+column containing ENSEMBL transcript codes for the gene set of interest,
+while the background dataset has one column with ENSEMBL transcript
+codes and another with KEGG pathways, for some larger universe of genes.
+The input dataset to rank by "change in length and number of paths"
+must have columns with KEGG gene ID and pathways.  The output datasets
+are described below.  (`Dataset missing?`_)
 
 .. _tabular: ./static/formatHelp.html#tab
 .. _Dataset missing?: ./static/formatHelp.html
@@ -75,37 +83,36 @@
 
 **What it does**
 
-This tool produces a table ranking the pathways based on the percentage
-of genes in an input dataset, out of the total in each pathway
-[please clarify w.r.t. query and background datasets].
-Alternatively, the tool ranks the pathways based on the change in
-length and number of paths connecting sources and sinks.  This change is
-calculated between graphs representing pathways with and without excluding
-the nodes that represent the genes in an input list.  Sources are all
-the nodes representing the initial reactants/products in the pathway.
-Sinks are all the nodes representing the final reactants/products in
-the pathway.
+Given a query set of genes from a larger background dataset, this tool
+evaluates the over- or under-representation of KEGG pathways in the query
+set, using the specified statistical test.  Alternatively, the tool ranks
+the pathways based on the change in length and number of paths connecting
+sources and sinks.  This change is calculated between graphs representing
+pathways with and without excluding the nodes that represent the genes
+in an input list.  Sources are all the nodes representing the initial
+reactants/products in the pathway.  Sinks are all the nodes representing
+the final reactants/products in the pathway.
 
-If pathways are ranked by percentage of genes affected, the output contains
-a row for each KEGG pathway, with the following columns:
+If pathways are ranked by percentage of genes affected, the output
+contains a row for each KEGG pathway, with the following columns:
 
 1. count: the number of genes in the query set that are in this pathway
 2. representation: the percentage of this pathway's genes (from the background dataset) that appear in the query set
 3. ranking of this pathway, based on its representation ("1" is highest)
 4. probability of depletion of this pathway in the query dataset
 5. probability of enrichment of this pathway in the query dataset
-6. KEGG pathway
+6. name of the pathway
 
 If pathways are ranked by change in length and number of paths, the
 output is a tabular dataset with the following columns:
 
 1. change in the mean length of paths between sources and sinks
-2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset.  If the pathway do not have sources/sinks, the length is assumed to be infinite (I)
-3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset.  If the pathway do not have sources/sinks, the length is assumed to be infinite (I)
+2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I)
+3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I)
 4. rank of the change in the mean length of paths between sources and sinks (from high change to low change)
 5. change in the number of paths between sources and sinks
-6. number of paths between sources and sinks in the pathway including the genes in the input dataset.  If the pathway do not have sources/sinks, it is assumed to be a circuit (C)
-7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset.  If the pathway do not have sources/sinks, it is assumed to be a circuit (C)
+6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C)
+7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C)
 8. rank of the change in the number of paths between sources and sinks (from high change to low change)
 9. name of the pathway
 
@@ -113,27 +120,42 @@
 
 **Examples**
 
-- input (column 10 for KEGG gene ID, column 12 for KEGG pathways)::
- 
+Rank by percentage of genes affected:
+
+- input background dataset (column 5 for ENSEMBL transcript, column 12 for KEGG pathways, two-tailed Fisher's exact test for statistic)::
+
    Contig39_chr1_3261104_3261850   414  chr1  3261546  ENSCAFT00000000001   ENSCAFP00000000001   S    667   F    476153  probably damaging    cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways
    Contig62_chr1_19011969_19012646 265  chr1  19012240 ENSCAFT00000000144   ENSCAFP00000000125   *    161   R    483960  probably damaging    N
    etc.
- 
-- output ranked by percentage of genes affected [need new sample output with more columns]::
+
+- input query dataset (column 5 for ENSEMBL transcript)::
+
+   Contig12_chr20_101969_112646    265  chr20 9822141  ENSCAFT00000001234   ENSCAFP00000021123   T    101   R    476153  probably damaging
+   Contig39_chr1_3261104_3261850   414  chr1  3261546  ENSCAFT00000000001   ENSCAFP00000000001   S    667   F    476153  probably damaging
+   etc.
 
-   3   0.25   1   cfa03450=Non-homologous end-joining
-   1   0.25   1   cfa00750=Vitamin B6 metabolism
-   2   0.2    3   cfa00290=Valine, leucine and isoleucine biosynthesis
-   3   0.18   4   cfa00770=Pantothenate and CoA biosynthesis
+- output::
+
+   3   0.20    1   1.0 0.0065  cfa03450=Non-homologous end-joining
+   1   0.067   2   1.0 0.019   cfa00750=Vitamin B6 metabolism
+   2   0.062   3   1.0 0.021   cfa00290=Valine, leucine and isoleucine biosynthesis
+   1   0.037   4   1.0 0.035   cfa00770=Pantothenate and CoA biosynthesis
    etc.
 
-- output ranked by change in length and number of paths::
+Rank by change in length and number of paths:
 
-    3.64   8.44   4.8     2   4    9    5   1   cfa00260=Glycine, serine and threonine metabolism
-    7.6    9.6    2       1   3    5    2   2   cfa00240=Pyrimidine metabolism
-    0.05   2.67   2.62    6   1   30   29   3   cfa00982=Drug metabolism - cytochrome P450
-   -0.08   8.33   8.41   84   1   30   29   3   cfa00564=Glycerophospholipid metabolism
+- input (column 10 for KEGG gene ID, column 12 for KEGG pathways)::
+
+   Contig39_chr1_3261104_3261850   414  chr1  3261546  ENSCAFT00000000001   ENSCAFP00000000001   S    667   F    476153  probably damaging    cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways
+   Contig62_chr1_19011969_19012646 265  chr1  19012240 ENSCAFT00000000144   ENSCAFP00000000125   *    161   R    483960  probably damaging    N
    etc.
 
+- output::
+
+   3.64   8.44   4.8     2   4    9    5   1   cfa00260=Glycine, serine and threonine metabolism
+   7.6    9.6    2       1   3    5    2   2   cfa00240=Pyrimidine metabolism
+   0.05   2.67   2.62    6   1   30   29   3   cfa00982=Drug metabolism - cytochrome P450
+   -0.08  8.33   8.41   84   1   30   29   3   cfa00564=Glycerophospholipid metabolism
+   etc.
   </help>
 </tool>