Galaxy |

Changeset 0:66ef158fa85c (2019-11-22)

Next changeset 1:c18d0db68d51 (2019-11-22)

Commit message:
Uploaded

added:
aurora_wgcna.Rmd

diff -r 000000000000 -r 66ef158fa85c aurora_wgcna.Rmd
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/aurora_wgcna.Rmd Fri Nov 22 19:44:46 2019 -0500

[

b'@@ -0,0 +1,376 @@\n+---\n+title: \'Aurora Galaxy WGCNA Tool: Gene Co-Expression Network Construction & Analysis\'\n+output:\n+ pdf_document:\n+ number_sections: false\n+---\n+\n+```{r setup, include=FALSE, warning=FALSE, message=FALSE}\n+knitr::opts_chunk$set(error = FALSE, echo = FALSE)\n+```\n+\n+```{r}\n+# Make a directory for saving the figures.\n+dir.create(\'figures\', showWarnings = FALSE)\n+```\n+\n+# Introduction\n+This report contains step-by-step results from use of the [Aurora Galaxy](https://github.com/statonlab/aurora-galaxy-tools) Weighted Gene Co-expression Network Analysis [WGCNA](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) tool. This tool wraps the WGCNA R package into a ready-to-use Rmarkdown file. It performs module discovery and network construction using a dataset and optional trait data matrix provided. \n+\n+If you provided trait data, a second report will be available with results comparing the trait values to the identified modules.\n+\n+This report was generated on:\n+```{r}\n+format(Sys.time(), "%a %b %d %X %Y")\n+```\n+\n+\n+## About the Input Data\n+### Gene Expression Matrix (GEM)\n+The gene expression data is an *n* x *m* matrix where *n* rows are the genes, *m* columns are the samples and the elements represent gene expression levels (derived either from Microarray or RNA-Seq). The matrix was provided in a file meething these rules:\n+- Housed in a comma-separated (CSV) file.\n+- The rows represent the gene expression levels\n+- The first column of each row is the gene, transcript or probe name.\n+- The header contains only the sample names and therefore is one value less than the remaining rows of the file.\n+\n+### Trait/Phenotype Matrix\n+The trait/phenotype data is an *n* x *m* matrix where *n* is the samples and *m* are the features such as experimental condition, biosample properties, traits or phenotype values. The matrix is stored in a comma-separated (CSV) file and has a header.\n+\n+## Parameters provided by the user.\n+The following describes the input arguments provided to this tool:\n+```{r}\n+\n+if (!is.null(opt$height_cut)) {\n+ print(\'The cut height for outlier removal of the sample dendrogram:\')\n+ print(opt$cut_height)\n+}\n+\n+if (!is.null(opt$power)) {\n+ print(\'The power to which the gene expression data is raised:\')\n+ print(opt$power)\n+}\n+print(\'The minimal size for a module:\')\n+print(opt$min_cluster_size)\n+\n+print(\'The block size for dividing the GEM to reduce memory requirements:\')\n+print(opt$block_size)\n+\n+print(\'The hard threshold when generating the graph file:\')\n+print(opt$hard_threshold)\n+\n+print(\'The character string used to identify missing values in the GEM:\')\n+print(opt$missing_value1)\n+\n+if (!is.null(opt$trait_data)) {\n+ print(\'The column in the trait data that contains the sample name:\')\n+ print(opt$sname_col)\n+ \n+ print(\'The character string used to identify missing values in the trait data:\')\n+ print(opt$missing_value2)\n+ \n+ print(\'Columns in the trait data that should be treated as categorical:\')\n+ print(opt$one_hot_cols)\n+ \n+ print(\'Columns in the trait data that should be ignored:\')\n+ print(opt$ignore_cols)\n+}\n+```\n+\n+## If Errors Occur\n+Please note, that if any of the R code encountered problems, error messages will appear in the report below. If an error occurs anywhere in the report, results should be thrown out. Errors are usually caused by improperly formatted input data or improper input arguments. Use the following checklist to find and correct potential errors:\n+\n+- Do the formats for the input datasets match the requirements listed above.\n+- Do the values set for missing values match the values in the input files, and is the missing value used consistently within the input files (i.e you don\'t have more than one such as 0.0 and 0, or NA and 0.0)\n+- If trait data was provided, check that the column specified for the sample name is correct.\n+- The block size should not exceed 10,000 and should not be lower than 1,000.\n+- Ensure that the s'..b' genes were selected to draw the heat maps in order to save on computational time.\n+\n+```{r}\n+for (i in blocks) {\n+ # Load the TOM from a file.\n+ load(net$TOMFiles[i])\n+ TOM_size = length(which(net$blocks == i))\n+ TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)\n+ dissTOM = 1-TOM\n+\n+ # For reproducibility, we set the random seed\n+ set.seed(10);\n+ select = sample(dim(TOM)[1], size = 1000);\n+ selectColors = module_labels[net$blockGenes[[i]][select]]\n+ selectTOM = dissTOM[select, select];\n+\n+ # There\xe2\x80\x99s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.\n+ selectTree = hclust(as.dist(selectTOM), method = "average")\n+\n+ # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing\n+ # the color palette; setting the diagonal to NA also improves the clarity of the plot\n+ plotDiss = selectTOM^7;\n+ diag(plotDiss) = NA;\n+ colors = sub(\'ME\',\'\', selectColors)\n+ \n+ png(paste0(\'figures/06-TOM_heatmap_block_\', i, \'.png\'), width=6 ,height=6, units="in", res=300)\n+ TOMplot(plotDiss, selectTree, colors, main = paste(\'TOM Heatmap, Block\', i))\n+ dev.off()\n+ TOMplot(plotDiss, selectTree, colors, main = paste(\'TOM Heatmap, Block\', i))\n+}\n+```\n+\n+```{r}\n+output = cbind(colnames(gemt), module_labels)\n+colnames(output) = c(\'Gene\', \'Module\')\n+write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)\n+```\n+\n+A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.\n+\n+The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0. This is known as hard thresholding. However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules. The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.\n+\n+Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form. Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.\n+\n+```{r}\n+edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())\n+for (i in blocks) {\n+ # Load the TOM from a file.\n+ load(net$TOMFiles[i])\n+ TOM_size = length(which(net$blocks == i))\n+ TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)\n+ colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]\n+ row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]\n+\n+ cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)\n+ edges = rbind(edges, cydata$edgeData)\n+}\n+\n+edges$Interaction = \'co\'\n+output = edges[,c(\'fromNode\',\'toNode\',\'Interaction\', \'weight\')]\n+colnames(output) = c(\'Source\', \'Target\', \'Interaction\', \'Weight\')\n+write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\\t")\n+```\n+\n+Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.\n+\n+```{r}\n+# Save this image for the next step which is optional if theuser\n+# provides a trait file.\n+save.image(file=opt$r_data)\n+```\n'