comparison rank_pathways.xml @ 28:184d14e4270d

Update to Miller Lab devshed revision 4ede22dd5500
author Richard Burhans <burhans@bx.psu.edu>
date Wed, 17 Jul 2013 12:46:46 -0400
parents 8997f2ca8c7a
children a631c2f6d913
comparison
equal deleted inserted replaced
27:8997f2ca8c7a 28:184d14e4270d
60 60
61 <help> 61 <help>
62 62
63 **Dataset formats** 63 **Dataset formats**
64 64
65 All of the input and output datasets are in tabular_ format. 65 The query dataset has a column containing ENSEMBL transcript codes for
66 The input dataset must have columns with KEGG gene ID and pathways. 66 the gene set of interest, while the background dataset has one column
67 [Need to update this, since input columns now depend on the "Rank by" choice.] 67 with ENSEMBL transcript codes and another with GO terms, for some larger
68 The output datasets are described below. 68 universe of genes.
69 (`Dataset missing?`_) 69
70 All of the input and output datasets are in tabular_ format. The input
71 dataset (i.e. query) to rank by "percentage of genes affected" has a
72 column containing ENSEMBL transcript codes for the gene set of interest,
73 while the background dataset has one column with ENSEMBL transcript
74 codes and another with KEGG pathways, for some larger universe of genes.
75 The input dataset to rank by "change in length and number of paths"
76 must have columns with KEGG gene ID and pathways. The output datasets
77 are described below. (`Dataset missing?`_)
70 78
71 .. _tabular: ./static/formatHelp.html#tab 79 .. _tabular: ./static/formatHelp.html#tab
72 .. _Dataset missing?: ./static/formatHelp.html 80 .. _Dataset missing?: ./static/formatHelp.html
73 81
74 ----- 82 -----
75 83
76 **What it does** 84 **What it does**
77 85
78 This tool produces a table ranking the pathways based on the percentage 86 Given a query set of genes from a larger background dataset, this tool
79 of genes in an input dataset, out of the total in each pathway 87 evaluates the over- or under-representation of KEGG pathways in the query
80 [please clarify w.r.t. query and background datasets]. 88 set, using the specified statistical test. Alternatively, the tool ranks
81 Alternatively, the tool ranks the pathways based on the change in 89 the pathways based on the change in length and number of paths connecting
82 length and number of paths connecting sources and sinks. This change is 90 sources and sinks. This change is calculated between graphs representing
83 calculated between graphs representing pathways with and without excluding 91 pathways with and without excluding the nodes that represent the genes
84 the nodes that represent the genes in an input list. Sources are all 92 in an input list. Sources are all the nodes representing the initial
85 the nodes representing the initial reactants/products in the pathway. 93 reactants/products in the pathway. Sinks are all the nodes representing
86 Sinks are all the nodes representing the final reactants/products in 94 the final reactants/products in the pathway.
87 the pathway.
88 95
89 If pathways are ranked by percentage of genes affected, the output contains 96 If pathways are ranked by percentage of genes affected, the output
90 a row for each KEGG pathway, with the following columns: 97 contains a row for each KEGG pathway, with the following columns:
91 98
92 1. count: the number of genes in the query set that are in this pathway 99 1. count: the number of genes in the query set that are in this pathway
93 2. representation: the percentage of this pathway's genes (from the background dataset) that appear in the query set 100 2. representation: the percentage of this pathway's genes (from the background dataset) that appear in the query set
94 3. ranking of this pathway, based on its representation ("1" is highest) 101 3. ranking of this pathway, based on its representation ("1" is highest)
95 4. probability of depletion of this pathway in the query dataset 102 4. probability of depletion of this pathway in the query dataset
96 5. probability of enrichment of this pathway in the query dataset 103 5. probability of enrichment of this pathway in the query dataset
97 6. KEGG pathway 104 6. name of the pathway
98 105
99 If pathways are ranked by change in length and number of paths, the 106 If pathways are ranked by change in length and number of paths, the
100 output is a tabular dataset with the following columns: 107 output is a tabular dataset with the following columns:
101 108
102 1. change in the mean length of paths between sources and sinks 109 1. change in the mean length of paths between sources and sinks
103 2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) 110 2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I)
104 3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) 111 3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I)
105 4. rank of the change in the mean length of paths between sources and sinks (from high change to low change) 112 4. rank of the change in the mean length of paths between sources and sinks (from high change to low change)
106 5. change in the number of paths between sources and sinks 113 5. change in the number of paths between sources and sinks
107 6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) 114 6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C)
108 7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) 115 7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C)
109 8. rank of the change in the number of paths between sources and sinks (from high change to low change) 116 8. rank of the change in the number of paths between sources and sinks (from high change to low change)
110 9. name of the pathway 117 9. name of the pathway
111 118
112 ----- 119 -----
113 120
114 **Examples** 121 **Examples**
115 122
116 - input (column 10 for KEGG gene ID, column 12 for KEGG pathways):: 123 Rank by percentage of genes affected:
117 124
125 - input background dataset (column 5 for ENSEMBL transcript, column 12 for KEGG pathways, two-tailed Fisher's exact test for statistic)::
126
118 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways 127 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways
119 Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N 128 Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N
120 etc. 129 etc.
121
122 - output ranked by percentage of genes affected [need new sample output with more columns]::
123 130
124 3 0.25 1 cfa03450=Non-homologous end-joining 131 - input query dataset (column 5 for ENSEMBL transcript)::
125 1 0.25 1 cfa00750=Vitamin B6 metabolism 132
126 2 0.2 3 cfa00290=Valine, leucine and isoleucine biosynthesis 133 Contig12_chr20_101969_112646 265 chr20 9822141 ENSCAFT00000001234 ENSCAFP00000021123 T 101 R 476153 probably damaging
127 3 0.18 4 cfa00770=Pantothenate and CoA biosynthesis 134 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging
128 etc. 135 etc.
129 136
130 - output ranked by change in length and number of paths:: 137 - output::
131 138
132 3.64 8.44 4.8 2 4 9 5 1 cfa00260=Glycine, serine and threonine metabolism 139 3 0.20 1 1.0 0.0065 cfa03450=Non-homologous end-joining
133 7.6 9.6 2 1 3 5 2 2 cfa00240=Pyrimidine metabolism 140 1 0.067 2 1.0 0.019 cfa00750=Vitamin B6 metabolism
134 0.05 2.67 2.62 6 1 30 29 3 cfa00982=Drug metabolism - cytochrome P450 141 2 0.062 3 1.0 0.021 cfa00290=Valine, leucine and isoleucine biosynthesis
135 -0.08 8.33 8.41 84 1 30 29 3 cfa00564=Glycerophospholipid metabolism 142 1 0.037 4 1.0 0.035 cfa00770=Pantothenate and CoA biosynthesis
136 etc. 143 etc.
137 144
145 Rank by change in length and number of paths:
146
147 - input (column 10 for KEGG gene ID, column 12 for KEGG pathways)::
148
149 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways
150 Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N
151 etc.
152
153 - output::
154
155 3.64 8.44 4.8 2 4 9 5 1 cfa00260=Glycine, serine and threonine metabolism
156 7.6 9.6 2 1 3 5 2 2 cfa00240=Pyrimidine metabolism
157 0.05 2.67 2.62 6 1 30 29 3 cfa00982=Drug metabolism - cytochrome P450
158 -0.08 8.33 8.41 84 1 30 29 3 cfa00564=Glycerophospholipid metabolism
159 etc.
138 </help> 160 </help>
139 </tool> 161 </tool>