Commit message:
Uploaded |
added:
README.rst R_functions/clustering_functions.R R_functions/outlier_trimming.R R_functions/pca_helpers.R R_functions/pipeline_code.R R_functions/plotting_functions.R iterative_pca.py iterative_pca_plot.R pca_pipeline_def.xml pca_report.html pedify.py tool_dependencies.xml |
b |
diff -r 000000000000 -r 64e75e21466e README.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.rst Wed Jun 01 03:38:39 2016 -0400 |
b |
b"@@ -0,0 +1,331 @@\n+.. class:: warningmark \n+\n+'''WARNING''' This tool requires the 'dbscan' (https://cran.r-project.org/web/packages/dbscan/index.html) and 'flashpcaR' (https://github.com/gabraham/flashpca/releases) R packages to be installed on the galaxy instance.\n+\n+======================================\n+Principle Component Analysis Pipeline \n+======================================\n+\n+\t:Author: Adrian Cheung\n+\t:Contact: adrian.che0222@gmail.com\n+\t:Date: 15-01-2015\n+\n+Contents\n+--------\n+\n+- `Overview`_\n+- `Primary Input`_\n+- `Primary Output`_\n+- `Options/Secondary Inputs`_\n+- `Other Output`_\n+- `Command Line Interface`_\n+- `Implementation Details`_\n+\n+Overview\n+--------\n+A tool which performs iterative principle component analysis. \n+The general idea is to seperate patient samples based on their ethnicity, by performing PCA on the variant data of each sample. \n+After each analysis step, outliers are identified. The PCA is then repeated, with the outliers removed.\n+This process continues for a set number of iterations specified by the user. After the pipeline completes, the user can see a \n+detailed summary, as well as have access to the outliers identified at each iteration.\n+\n+Primary Input\n+-------------\n+As primary input the tools accepts a single file, which may be formatted in the following ways:\n+\n+- **Variant data file:** This should be a tab-delimited text file, with each row containing data about a single variant site from a single person. If this option is selected, the column names which contain important information must also be specified, either via a configuration file (see below), or through the tool's form fields.\n+- **Numeric ped file:** See http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml for detailed information on PED format. This tool requires the affection status of each site to be specified numerically i.e.:\n+\n+\t- 0 = homozygous reference \n+\t- 1 = heterozygous\n+\t- 2 = homozygous alternate\n+ \n+ rather than consisting of pairs of genotypes for each site.\n+- **RData file:** File containing stored data from an R session. For this tool the input must meet certain requirements:\n+ \t\n+ \t- The file can only contain a SINGLE R object, which must be a list. \n+ \t- The list must contain a named 'bed' element. \n+ \t- The 'bed' element must be an n x m matrix/data frame, where n = number of samples, m = number of unique snps found in all the samples. \n+ \t- The A(i,j)th entry in the 'bed' matrix should indicate affectation status of the ith sample at the jth SNP site, according to the key for numeric ped files (as above). \n+ \t- The row names of the 'bed' matrix must contain the ids of the samples. \n+ \t- The column names of the 'bed' matrix must contain the ids of the SNPs. \n+\n+ If these very specific criteria are not met, the tool WILL fail.\n+\n+- **RDS file (command line only):** File containing a single R object. Object must follow same specifications as the RData file\n+\n+Primary Output\n+---------------\n+\n+HTML file containing plots of the PCA for each iteration.\n+Possible plots, depending on user specified options:\n+\n+- **Control vs Cases Plot:** If control and/or cases tags are provided, this plot will be output. ALL samples are plotted, with controls shown in blue, cases in red, unknown samples in black.\n+- **Cluster Plot:** Output if user opts to do clustering. Samples are plotted, with clusters colour-coded. Outliers as identified by DBSCAN are always read and use an open circle as the icon. Trimmed clusters use a cross for the icon, instead of a circle. Both the outliers (open circles) AND the rejected clusters (crosses) will be dropped in the next iteration.\n+- **Outliers Plot:** Output if user does NOT opt to do clustering. Samples which are considered outliers (as described above in 'Detecting outliers without clustering') are plotted as red open circles; all other samples are plotted as green full circles.\n+- **Standard Deviations Plot:** Samples are colour-coded by standard deviation. Samples whi"..b'ion number>", and will contain the pngs for the plots, as well as the outliers file and excluded samples file. The output basename can be set with the --out <custom_basename> command line argument. Here is a section of an example output directory tree, if the basename was \'data1\'::\n+\n+\tfull_output_data1\n+\t|-> data1.html\n+\t|-> data1.log\n+\t|-> output_data1_iteration_0\n+\t\t|-> data1_outliers.txt\n+\t\t|-> data1_xsamples.txt\n+\t\t|-> data1_plot1\n+\t\t|-> data1_plot2\n+\t\t...etc\n+\t|-> output_data1_iteration_1\n+\t...etc\n+\n+**Example Usage**\n+\n+To see more help text on how to run the program, do::\n+\t\n+\tpython iterative_pca.py -h\n+\n+If we have a directory containing some valid input files at <root_dir>/input, then some here are some example\n+use cases:\n+\n+- *Text file containing variant data input*::\n+\n+\tpython iterative_pca.py <root_dir>/input/parsed_all_Halo1_150528_wGT.txt variant_data 10 --config_file <root_dir>/input/halo1_pca.config --ethnicity_file <root_dir>/input/Halo_ethnicity_rf.txt --clustering_flag --clustering_method dbscan --cluster_trimming sd --out halo1_out\t\n+\n+ Does 10 iterations on the input file, using the ethnicity data from \'Halo_ethnicity_rf.txt\', and outputs data to \'full_output_halo1_out\'. Use the specified config file, and use DBSCAN for clustering, and trim by standard deviations. \n+\n+- *Numeric ped file input*::\n+\n+\tpython iterative_pca.py <root_dir>/input/halo1_numeric.ped numeric_ped 5\n+\n+ Does 5 iterations on the input file. Trim outliers purely by standard deviations (no clustering), and output data to \'full_output_halo1_numeric\'.\n+\n+- *RDS file input*::\n+\n+\tpython iterative_pca.py <root_dir>/input/HapMap3_flashPCA_data.rds rds 20 --ethnicity_file <root_dir>/input/HapMap3_ethnicity_rf.txt --clustering_flag --clustering_method hclust --cluster_trimming mcd --out hapmap3_test1 --control LP --cases HAPS\n+ \n+ Do 20 iterations on the input rds file, using ethnicity data from \'HapMap3_ethnicity_rf.txt\'. Use hierarchical clustering, and trim by mean cluster distance. Indicate that ids of control samples contain the pattern "LP", and ids of case samples contain the pattern "HAPS".\n+\n+\n+Implementation Details\n+----------------------\n+The program consists of two main scripts: a top level python script and an R script. The python script prepares the input files so they can be read into the R script. The R script is where most of the heavy lifting is done; the iterative pca steps as well as the plotting all occurs here. After this script finishes, the python script will generate the HTML output using the Jinja2 templating engine, before exiting.\n+\n+Here is a brief overview of each of the files\' functions:\n+\n+Python:\n+\n+- **iterative_pca.py**: Top level script which manages the entire pipeline. Main functions are preparing the input files for the R script and writing to the log file, and creating the HTML file at the end.\n+- **pedify.py**: Contains functions for parsing the configuration file and generating ped and map files from an input variant text file. Also responsible for filtering out unwanted samples/SNPs when parsing the text file.\n+\n+R:\n+\n+- **iterative_pca_plot.R**: reads in data, then runs multiple PCA iterations, keeping track of the data. This data is then used to generate the output folders, plots and files.\n+- **plotting_functions.R**: Functions used to prepare the different kinds of plots.\n+- **clustering_functions.R**: Clustering functions. Contains functions to optimise the DBSCAN and hclust algorithms, including ways to find k, the number of clusters. Also contains different methods for identifying cluster outliers.\n+- **outlier_trimming.R**: Functions to trim outliers (no clustering).\n+- **pca_helpers.R**: Functions to read in data and perform PCA, along with some other utility functions.\n+- **pipeline_code.R**: A function to run a single iteration of the pipeline.\n+\n+Other:\n+\n+- **pca_report.html**: HTML template which is filled in with the iteration data after completing a run successfully.\n+\n+\n+\n' |
b |
diff -r 000000000000 -r 64e75e21466e R_functions/clustering_functions.R --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/R_functions/clustering_functions.R Wed Jun 01 03:38:39 2016 -0400 |
[ |
@@ -0,0 +1,134 @@ +# Automated clustering suite, that makes use of dbscan and hclust +# clustering algorithms +library(dbscan) +library(cluster) + +# Do the OPTICS clustering algorithm on the input data. +# Computationally determines the 'optimal' value for the epsilion neighbourhood argument. +# Outputs an integer array indicating which cluster each sample from the input data belongs to. +# 0 indicates the sample is an outlier and does not belong to any sample +automated_dbscan = function(pca_data, emax, mp) { + d = dist(pca_data$values[, c(1, 2)]) + oc = optics(d, emax, mp) + eps = compute_eps(oc, 3) + oc_clusters = optics_cut(oc, eps) + return(oc_clusters$cluster) +} + +# Do the hclust clustering algorithm on the input data, automatically +# computing the optimal value for 'k', i.e. the number of clusters. +# Outputs an integer array indicating which cluster each sample from the input data belongs to. +automated_hclust = function(pca_data) { + d = dist(pca_data$values[, c(1, 2)]) + hc = hclust(d) + k = compute_k(hc, d, 10) + clusters = cutree(hc, k) + return(clusters) +} + +# determine the optimal epsilon value for dbscan +# From the input OPTICS data, +compute_eps = function(oc, weight) { + rdist = oc$reachdist + iqr = IQR(rdist) + uc = median(rdist) + weight*iqr + outliers = rdist[which(rdist > uc)] + eps = median(outliers) + return(eps) +} + +# determining k using the silhouette method and hclust +# Values near 1 indicate clustering is appropriate. +compute_k = function(hc, d, n) { + s = 1:n + # try clusters from size 2 to n, evaluating the 'silhouette coefficient' + # for each value of k. + for (i in 1:n) { + # silhouette method not valid for cluster size = 1 + if (i == 1) { + # if the silhouette coeff is < 0.5 for all the other values, + # we assume there is just one big cluster + s[i] = 0.5 + } else { + # get the silhoutte coeff for every point, and take the mean + clusters = cutree(hc, i) + sil = silhouette(clusters, d) + sil_coeff = mean(sil[, 'sil_width']) + s[i] = sil_coeff + } + } + return(which.max(s)) +} + +# Compute the centroids of each cluster +# returns an n x m matrix, where +# n = number of clusters +# m = dimension of data +find_cluster_centers = function(clusters, x) { + cluster_ids = unique(clusters) + k = length(cluster_ids) + centers = matrix(nrow=k, ncol=ncol(x)) + rownames(centers) = 1:k + for (i in 1:k) { + this_id = cluster_ids[i] + # dont work out centre of outliers + cluster_points = x[which(clusters==this_id), ] + if (is.null(dim(cluster_points))) { + l = length(cluster_points) + cluster_points = matrix(cluster_points, ncol=l) + } + centers[i, ] = apply(cluster_points, 2, FUN=mean) + rownames(centers)[i] = this_id + } + centers = centers[rownames(centers) != 0, , drop=FALSE] + return(centers) +} + +# Determine which clusters are outliers based on how far away their cluster +# centers are from the center of all the data. Returns a list containing the +# 'ids' of all rejected CLUSTERS (not samples) +find_cluster_outliers_sd = function(clusters, centers, pca_data, numsds) { + if(nrow(centers) <= 1) { + print("no centers, or only one center") + return(NULL) + } + # work out the standard deviation of the data in pc1 and pc2 directions + pc1_sd = sd(pca_data$values[, 1]) + pc2_sd = sd(pca_data$values[, 2]) + # work out centroid of the pca data + centroid = apply(pca_data$values, 2, FUN=mean) + pc1_bar = centroid[1] + pc2_bar = centroid[2] + # get pc1 and pc2 hi-lo cutoffs + pc1_hi = pc1_bar + pc1_sd*numsds + pc1_lo = pc1_bar - pc1_sd*numsds + pc2_hi = pc2_bar + pc2_sd*numsds + pc2_lo = pc2_bar - pc2_sd*numsds + # check which clusters fall outside cutoffs + pc1_centers = centers[, 1] + pc2_centers = centers[, 2] + pc1_ol = which(pc1_centers > pc1_hi | pc1_centers < pc1_lo) + pc2_ol = which(pc2_centers > pc2_hi | pc2_centers < pc2_lo) + all_ol = union(pc1_ol, pc2_ol) + rc = rownames(centers)[all_ol] # id of rejected clusters + return(rc) +} + +# Determine which clusters are outliers based on how far away they are from other clusters +# Returns a list containing the 'ids' of all rejected CLUSTERS (not samples) +find_cluster_outliers_mcd = function(clusters, centers, pca_data, multiplier, ndim) { + if(nrow(centers) <= 1) { + print("no centers, or only one center") + return(NULL) + } + d = dist(centers[, 1:ndim]) + dm = as.matrix(d) + avg_distance = mean(as.numeric(d)) + # ignore the diagonal + cluster_distances = rowSums(dm)/(ncol(dm)-1) + # trim clusters based on how far apart they are + cutoff = multiplier*avg_distance + indices = which(cluster_distances > cutoff) + rc = names(cluster_distances)[indices] + return(rc) +} \ No newline at end of file |
b |
diff -r 000000000000 -r 64e75e21466e R_functions/outlier_trimming.R --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/R_functions/outlier_trimming.R Wed Jun 01 03:38:39 2016 -0400 |
[ |
@@ -0,0 +1,21 @@ +# Finding outliers by standard deviation + +# Get samples whose pc1 OR pc2 values lie more than 'numsds' s.devs +# away from the sample median for that pc. +outliers_by_sd = function(pca_data, xsamples, numsds) { + pc1_outliers = find_outliers(pca_data$values[, 1], numsds) + pc2_outliers = find_outliers(pca_data$values[, 2], numsds) + all_outliers = union(pc1_outliers, pc2_outliers) + return(all_outliers) +} + +# compute outliers +# Returns indices of all samples which lie more than +# 'numsds' s.devs away from the sample median +find_outliers = function(input_data, numsds) { + lower = median(input_data) - numsds*sd(input_data) + upper = median(input_data) + numsds*sd(input_data) + + outliers = which(input_data < lower | input_data > upper) + return(outliers) +} \ No newline at end of file |
b |
diff -r 000000000000 -r 64e75e21466e R_functions/pca_helpers.R --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/R_functions/pca_helpers.R Wed Jun 01 03:38:39 2016 -0400 |
[ |
@@ -0,0 +1,114 @@ +library(flashpcaR) + +# read in data from either a numeric ped file or an rds object +# output a numeric ped file, with the rownames set to the ids of the samples +get_source_data = function(data_source, data_type) { + data_type = tolower(data_type) + if (data_type == "numeric_ped") { + # read in ped file + ped_data = read.table(data_source, sep="\t", row.names=1) + } else if (data_type == "rds") { + hapmap3_object = readRDS(data_source) + ped_data = hapmap3_object$bed + } else if (data_type == "rdata") { + hapmap3_object = load_obj(data_source) + ped_data = hapmap3_object$bed + } else { + print("Unrecognised data type, returning NULL") + ped_data = NULL + } + return(ped_data) +} + +# A function that will read in and return a single object from an RData file +# This is a workaround so the program can run without needing to know name of the object; +# however the assumption is that the RData file contains only ONE object (the one we want) +load_obj = function(filename) { + # create new environment + env = new.env() + # load the rdata file into the new environment, and get the NAME + # of the first object + object_name = load(filename, env)[1] + # return the object + return(env[[object_name]]) +} + +# remove unwanted rows or columns (samples and snps, respectively) from +# the ped data +filter_ped_data = function(ped_data, xsamples, xsnps) { + # rows to remove + rr = which(rownames(ped_data) %in% xsamples) + # remove rejected samples + if (length(rr) != 0) { + fpd1 = ped_data[-rr, , drop=FALSE] + } else { + fpd1 = ped_data + } + # remove all zero and rejected snp columns + snps = which(colnames(ped_data) %in% xsnps) + zeros = which(colSums(abs(fpd1)) == 0) + cr = union(snps, zeros) + if (length(cr) != 0) { + fpd2 = fpd1[, -cr, drop=FALSE] + } else { + fpd2 = fpd1 + } + # remove monomorphic snps + snp_sds = apply(fpd2, 2, sd) + clean_ped = fpd2[, snp_sds >= 0.01, drop=FALSE] + return(clean_ped) +} + +# Ethnicity file requirements: +# - tab delimited +# - Must have at least two columns +# - First column has sample ID's +# - Second column has ethnicities +# - First row must be a header +parse_ethnicity_file = function(eth_filename) { + if (file.exists(eth_filename) == FALSE) { + print(paste0("Warning: Ethnicity file: ", eth_filename, " not found")) + return(NULL) + } + if (file.info(eth_filename)$size == 0) { + print(paste0("Warning: Ethnicity file: '", eth_filename, "' is empty")) + return(NULL) + } + eth_data = read.table(eth_filename, header=TRUE, sep="\t") + rownames(eth_data) = eth_data[, 1] + colnames(eth_data)[1] = "IID" + colnames(eth_data)[2] = "population" + return(eth_data) +} + +# Read in a file and return the first column as a +# character vector +get_first_column = function(fname) { + rv = c() + if (file.exists(fname) == FALSE) { + print(paste0("Warning: File: '", fname, "' not found")) + return(rv) + } + if (file.info(fname)$size == 0) { + print(paste0("Warning: File: '", fname, "' is empty")) + return(rv) + } else { + rv = as.character(read.table(fname)[, 1]) + return(rv) + } +} + +# Do pca using flashpcar. Returns a 2 element list +# values - contains the loadings of the pcs +# Will be an n x m matrix, where +# - n = Number of samples +# - m = number of pcs +# ids - Character array of ids, same length as number of rows in values +do_pca = function(ped_data) { + pca_data = list() + pm = data.matrix(ped_data) + values = flashpca(pm, ndim=6)$vectors + pca_data$values = values + pca_data$ids = as.character(rownames(ped_data)) + return(pca_data) +} \ No newline at end of file |
b |
diff -r 000000000000 -r 64e75e21466e R_functions/pipeline_code.R --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/R_functions/pipeline_code.R Wed Jun 01 03:38:39 2016 -0400 |
[ |
@@ -0,0 +1,112 @@ +### pipeline ### +# Complete a single iteration, which consists of: +# - Doing pca +# - Clustering, if required +# - Finding outliers +# - Setting up plots +# Outputs a list containing all the data regarding this iteration +single_iteration = function(outdir, basename, ped_data, xsamples, numsds, + cmethod, tmethod, control_tag, cases_tag, ethnicity_data=NULL) { + it_data = list() + it_data$old_xsamples = xsamples + # get data and do pca + pca_data = do_pca(ped_data) + it_data$pca_data = pca_data + it_data$plots = list() + plot_number = 1 + + # plot controls and cases + if (!is.null(control_tag) || !is.null(cases_tag)) { + it_data$plots[[plot_number]] = setup_cvc_plot(pca_data, control_tag, cases_tag) + plot_number = plot_number + 1 + } + + # if we have ethnicity data, setup a special plot + if (!is.null(ethnicity_data)) { + it_data$plots[[plot_number]] = setup_ethnicity_plot(pca_data, ethnicity_data) + plot_number = plot_number + 1 + } + + if (cmethod == "none") { + # get outliers by sd + all_outliers = outliers_by_sd(pca_data, xsamples, numsds) + new_xsamples = union(xsamples, pca_data$ids[all_outliers]) + it_data$xsamples = new_xsamples + # prepare outlier plot + it_data$plots[[plot_number]] = setup_ol_plot(pca_data, all_outliers) + plot_number = plot_number + 1 + # prepare sd plot + it_data$plots[[plot_number]] = setup_sd_plot(pca_data) + plot_number = plot_number + 1 + } else { + # do clustering + if (cmethod == "dbscan") { + emax = 2 + mp = 7 + clusters = automated_dbscan(pca_data, emax, mp) + } else if (cmethod == "hclust") { + clusters = automated_hclust(pca_data) + } else { + clusters = automated_hclust(pca_data) + } + + # get outliers + cluster_outliers = which(clusters == 0) + # get rejected clusters + centers = find_cluster_centers(clusters, pca_data$values) + if (tmethod == "mcd") { + rc = find_cluster_outliers_mcd(clusters, centers, pca_data, numsds, 2) + } else if (tmethod == "sd") { + rc = find_cluster_outliers_sd(clusters, centers, pca_data, numsds) + } else if (tmethod == "dbscan_outliers_only") { + rc = 0 + } + rc_indices = which(clusters %in% rc) + all_ol = union(cluster_outliers, rc_indices) + # it is possible that all samples get removed, in which case program will crash. + # so do not remove them + if (length(all_ol) == nrow(ped_data)) { + new_xsamples = xsamples + } else { + new_xsamples = union(xsamples, pca_data$ids[all_ol]) + } + it_data$xsamples = new_xsamples + # prepare plot + it_data$plots[[plot_number]] = setup_cluster_plot(pca_data, clusters, rc=rc) + plot_number = plot_number + 1 + } + it_data$outliers = setdiff(new_xsamples, xsamples) + it_data$num_plots = plot_number - 1 + return(it_data) +} + +# basically an inner join on a list of ids, and a table of ethnicity data +# if eth_data == null, then the second column is filled with NAs +add_ethnicity_data = function(ids, eth_data) { + n = length(ids) + if (!is.null(eth_data)) { + output = matrix(nrow=n, ncol=ncol(eth_data)) + colnames(output) = colnames(eth_data) + if (n > 0) { + for (i in 1:n) { + this_id = ids[i] + if (this_id %in% rownames(eth_data)) { + this_row = unlist(lapply(eth_data[this_id, ], as.character), use.names=FALSE) + } else { + this_row = c(this_id, rep(NA, ncol(output)-1)) + } + output[i, ] = this_row + } + } + } else { + output = cbind(ids, rep(NA, n)) + colnames(output) = c("IID", "Population") + } + return(output) +} + +generate_directory_name = function(outdir, basename, iteration) { + newdir = paste0("output_", basename, "_iteration_", iteration - 1) + full_path = paste0(outdir, "/", newdir) + return(full_path) +} \ No newline at end of file |
b |
diff -r 000000000000 -r 64e75e21466e R_functions/plotting_functions.R --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/R_functions/plotting_functions.R Wed Jun 01 03:38:39 2016 -0400 |
[ |
@@ -0,0 +1,225 @@ +## Plotting and grouping ## +# input data: some number of 2d observations. Each row represents a single observation, +# column 1 = variable 1, to be plotted on the x-axis, +# column 2 = variable 2, to be plotted on the y-axis +# groups: Integer vector with same number of entries as there are rows in the input data, +# representing which group each observation belongs to. Negative numbers are not plotted +# tags: the tag to put on the legend for each group +# plot_colors: colors to use for each group +# plot_symbols: symbols to use for each group +# plot_title: as name suggests +# plot_filename: if this is not null, graph is output to a png with the specified name +plot_by_groups = function(input_data, groups, tags, plot_colors, plot_symbols, plot_title, plot_filename=NULL) { + if(!is.null(plot_filename)) { + png(plot_filename) + } + # leave some extra room on the RHS for the legend + par(mar=c(5.1, 4.1, 4.1, 8.1)) + x = as.numeric(input_data[, 1]) + y = as.numeric(input_data[, 2]) + gids = sort(unique(groups[which(groups >= 0)])) + n = length(gids) + + # first set up the plot area to the correct dimensions + plot(x, y, col="white") + + for (i in 1:n) { + gid = gids[i] + pts_x = x[which(groups == gid)] + pts_y = y[which(groups == gid)] + pts_color = plot_colors[i] + pts_symbol = plot_symbols[i] + points(pts_x, pts_y, col=pts_color, pch=pts_symbol) + } + legend(x="topright", + xpd=TRUE, + inset=c(-0.3, 0), + col=plot_colors, + pch=plot_symbols, + legend=tags, + text.col=plot_colors) + title(main=plot_title) + if(!is.null(plot_filename)) { + dev.off() + } +} + +# Controls vs cases plot. Colour controls blue, cases red, +# Samples which are neither control nor case are black. +setup_cvc_plot = function(pca_data, control_tag, cases_tag) { + plot_info = list() + nsamples = length(pca_data$ids) + groups = rep(1, nsamples) + control_legend = paste0("CO: ", control_tag) + cases_legend = paste0("CA: ", cases_tag) + if (!is.null(control_tag)) { + groups[grep(control_tag, pca_data$ids)] = 2 + } + if (!is.null(cases_tag)) { + groups[grep(cases_tag, pca_data$ids)] = 3 + } + res = sort(unique(groups)) + if (length(res) == 1) { + tags = c("UNKNOWN") + plot_colors = c("black") + } else if (length(res) == 3) { + tags = c("UNKNOWN", control_legend, cases_legend) + plot_colors = c("black", "blue", "red") + } else { + if (all(res == c(1, 2))) { + tags = c("UNKNOWN", control_legend) + plot_colors = c("black", "blue") + } else if (all(res == c(1, 3))) { + tags = c("UNKNOWN", cases_legend) + plot_colors = c("black", "red") + } else { + tags = c(control_legend, cases_legend) + plot_colors = c("blue", "red") + } + } + plot_info$groups = groups + plot_info$tags = tags + plot_info$plot_colors = plot_colors + plot_info$plot_symbols = rep(1, length(res)) + plot_info$plot_title = "Control vs Cases Plot" + return(plot_info) +} + +# outliers plot; colour outliers red, non-outliers green +setup_ol_plot = function(pca_data, outliers) { + plot_info = list() + nsamples = dim(pca_data$values)[1] + groups = 1:nsamples + groups[outliers] = 1 + groups[setdiff(1:nsamples, outliers)] = 2 + plot_info$groups = groups + plot_info$tags = c("outliers", "good data") + plot_info$plot_colors = c("red", "green") + plot_info$plot_symbols = c(1, 20) + plot_info$plot_title = "Outliers Plot" + return(plot_info) +} + +# standard deviations plot; colour samples by s.dev +setup_sd_plot = function(pca_data) { + plot_info = list() + nsamples = dim(pca_data$values)[1] + pc1 = as.numeric(pca_data$values[, 1]) + pc2 = as.numeric(pca_data$values[, 2]) + pc1_sds = as.numeric(lapply(pc1, compute_numsds, pc1)) + pc2_sds = as.numeric(lapply(pc2, compute_numsds, pc2)) + + groups = 1:nsamples + groups[get_sdset2d(pc1_sds, pc2_sds, 1)] = 1 + groups[get_sdset2d(pc1_sds, pc2_sds, 2)] = 2 + groups[get_sdset2d(pc1_sds, pc2_sds, 3)] = 3 + groups[union(which(pc1_sds > 3), which(pc2_sds > 3))] = 4 + plot_info$groups = groups + plot_info$tags = c("SD = 1", "SD = 2", "SD = 3", "SD > 3") + plot_info$plot_colors = rainbow(4) + plot_info$plot_symbols = rep(20, 4) + plot_info$plot_title = "Standard Deviations Plot" + return(plot_info) +} + +# Plot samples, with coloured clusters. Rejected clusters use +# a cross symbol instead of a filled circle +setup_cluster_plot = function(pca_data, clusters, rc=NULL) { + plot_info = list() + groups = clusters + ids = sort(unique(groups)) + n = length(ids) + tags = 1:n + for (i in 1:n) { + tags[i] = sprintf("cluster %s", ids[i]) + } + outliers = which(groups == 0) + if (length(outliers) != 0) { + tags[1] = "outliers" + } + plot_colors = rainbow(n) + plot_symbols = rep(20, n) + if (length(outliers) != 0) { + plot_symbols[1] = 1 + } + # labelling for rejected clusters + if(!is.null(rc)) { + for(i in 1:n) { + if((ids[i] != 0) && (ids[i] %in% as.numeric(rc))) { + tags[i] = "rej. clust." + plot_symbols[i] = 4 + } + } + } + plot_info$groups = groups + plot_info$tags = tags + plot_info$plot_colors = plot_colors + plot_info$plot_symbols = plot_symbols + plot_info$plot_title = "Cluster Plot" + return(plot_info) +} + +# Plot samples, colouring by ethnicity. Different ethnicities also +# have different symbols. +setup_ethnicity_plot = function(pca_data, ethnicity_data) { + plot_info = list() + nsamples = dim(pca_data$values)[1] + eth = 1:nsamples + + for (i in 1:nsamples) { + sample_id = pca_data$ids[i] + eth[i] = as.character(ethnicity_data[sample_id, "population"]) + if(is.na(eth[i])) { + eth[i] = "UNKNOWN" + } + } + n = length(unique(eth)) + plot_info$groups = as.numeric(as.factor(eth)) + plot_info$tags = sort(unique(eth)) + plot_info$plot_colors = rainbow(n) + plot_info$plot_symbols = 1:n + plot_info$plot_title = "Ethnicity Plot" + return(plot_info) +} + +draw_cutoffs = function(input_data, x, y, numsds) { + pcx = as.numeric(input_data[x, ]) + pcy = as.numeric(input_data[y, ]) + + vlines = c(median(pcx) - numsds*sd(pcx), + median(pcx) + numsds*sd(pcx)) + hlines = c(median(pcy) - numsds*sd(pcy), + median(pcy) + numsds*sd(pcy)) + abline(v=vlines) + abline(h=hlines) +} + +# Following helper functions are used in the 'setup_sd_plot' function +# given a list of standard deviations, work out which points are n standard deviations away +get_sdset2d = function(x1, x2, n) { + if (n == 1) { + ind = intersect(which(x1 == 1), which(x2 == 1)) + } else { + lower = get_sdset2d(x1, x2, n - 1) + upper = union(which(x1 > n), which(x2 > n)) + xset = union(lower, upper) + bigset = union(which(x1 == n), which(x2 == n)) + ind = setdiff(bigset, xset) + } + return(ind) +} + +# work out how many standard deviations away from the sample median a single point is +# accuracy of this decreases for outliers, as the error in the estimated sd is +# multiplied +compute_numsds = function(point, x) { + x_sd = sd(x) + sum = x_sd + m = median(x) + i = 1 + while(abs(point - m) > sum) { + i = i + 1 + sum = sum + x_sd + } + return(i) +} \ No newline at end of file |
b |
diff -r 000000000000 -r 64e75e21466e iterative_pca.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/iterative_pca.py Wed Jun 01 03:38:39 2016 -0400 |
[ |
b'@@ -0,0 +1,393 @@\n+import os\n+import subprocess\n+import sys\n+import argparse\n+import datetime\n+import csv\n+\n+import jinja2\n+\n+import pedify\n+\n+JINJA_ENVIRONMENT = jinja2.Environment(\n+ loader=jinja2.FileSystemLoader(os.path.dirname(__file__)),\n+ extensions=[\'jinja2.ext.autoescape\'],\n+ autoescape=True)\n+DATA_TYPES = ["variant_data", "rds", "rdata", "numeric_ped"]\n+CLUSTERING_METHODS = ["dbscan", "hclust", "kmeans"]\n+CLUSTER_TRIMMING = ["sd", "mcd", "dbscan_outliers_only"]\n+HTML_TEMPLATE_NAME = "pca_report.html"\n+DEFAULT_SD_CUTOFF = 2\n+PATH_TO_R_FUNCTIONS = "{}/R_functions".format(os.path.dirname(os.path.abspath(__file__)))\n+\n+\n+def main():\n+\t###===ARGUMENT PARSING===###\n+\tparser = argparse.ArgumentParser()\n+\tparser.add_argument("datafile_name", help="name of data file")\n+\tparser.add_argument("data_type", type=str.lower, choices=DATA_TYPES, help="type of data file")\n+\tparser.add_argument("iterations", help="max number of iterations to run",\n+\t\t\t\t\t\ttype=int)\n+\tparser.add_argument("--control_tag", help="substring present in the ids of all control samples, e.g. LP")\n+\tparser.add_argument("--cases_tag", help="substring present in the ids of all case samples, e.g. HAPS")\n+\tparser.add_argument("--config_file", help="name of file containing config info")\n+\tparser.add_argument("--clustering_flag", help="Flag to indicate whether to do clustering", action="store_true")\n+\tparser.add_argument("--clustering_method", type=str.lower, choices=CLUSTERING_METHODS,\n+\t\t\t\t\t\thelp="algorithm used to find clusters")\n+\tparser.add_argument("--cluster_trimming", type=str.lower, choices=CLUSTER_TRIMMING, \n+\t\t\t\t\t\thelp="method used to identify and trim outlier clusters")\n+\tparser.add_argument("-s", "--sd_cutoff", help="number of standard deviations at which to cut outliers")\n+\tparser.add_argument("-o", "--out", help="base name of output files and directories; defaults to datafile name")\n+\tgroup = parser.add_mutually_exclusive_group()\n+\tgroup.add_argument("-r", "--relative_prefix", help="RELATIVE path CONTAINING the output directory")\n+\tgroup.add_argument("-a", "--absolute_prefix", help="ABSOLUTE path CONTAINING the output directory")\n+\tparser.add_argument("--html_file", help="ABSOLUTE PATH of output html file")\n+\tparser.add_argument("--ethnicity_file", help="File containing data about the ethnicity of the samples")\n+\tparser.add_argument("--galaxy", help="Flag to indicate we are running this script in Galaxy", \n+\t\t\t\t\t\taction="store_true")\n+\tparser.add_argument("--reject_snps", help="file containing SNP ids to ignore/remove")\n+\tparser.add_argument("--reject_samples", help="file containing sample ids to ignore/remove")\n+\n+\targs = parser.parse_args()\n+\n+\t# enforce conditionally required args\n+\tif args.data_type == "variant_data":\n+\t\tif args.config_file == None:\n+\t\t\tparser.error("Input variant data files require a complementary config file\\\n+ specified with \'--config_file <filename>\' flag")\n+\t\n+\tif args.config_file:\n+\t\tif args.data_type != "variant_data":\n+\t\t\tprint "Warning! Your config file will be ignored as the input data is not a\\\n+ text file containing variant data"\n+\n+\tif args.clustering_flag:\n+\t\tif (args.clustering_method == None) or (args.cluster_trimming == None):\n+\t\t\tparser.error("If --clustering_flag is set,\\\n+ --clustering_method and --cluster_trimming options must also be specified")\n+\n+\tif args.clustering_method or args.cluster_trimming:\n+\t\tif args.clustering_flag == False:\n+\t\t\tparser.error("--clustering_method and --cluster_trimming\\\n+ options only valid if --clustering_flag is also set")\n+\n+\t# dbscan outliers only valid if cluster method == "DBSCAN"\n+\tif args.cluster_trimming == "dbscan_outliers_only":\n+\t\tif args.clustering_method != "dbscan":\n+\t\t\tparser.error("dbscan_outliers_only cluster trimming method only\\\n+ valid if --clustering_method == dbscan")\n+\n+\tdfname = args.datafile_name\n+\tdata_type = args.data_type\n+\tn = args.iterations\n+\tcontrol_tag = args.control_tag\n+\tcases_tag = args.cases_tag\n+\tcfname = args.config_file\n+\tcflag = args'..b'ns:\n+\tThe html output that was written to the html file\n+"""\n+def produce_report(hfname, dfname, lfname, cflag, cmethod, tmethod, sd_cutoff, iteration_list, galaxy_flag):\n+\thfdir = os.path.dirname(hfname)\n+\thtmldoc = open(hfname, \'w\')\n+\titeration_data = []\n+\tfor it in iteration_list:\n+\t\toutdir = it[\'outdir\']\n+\t\tif galaxy_flag:\n+\t\t\tdirs = outdir.split("/")\n+\t\t\trelpath = "{}/{}".format(dirs[-2], dirs[-1])\n+\t\telse:\n+\t\t\trelpath = os.path.relpath(outdir, hfdir)\n+\t\tit_dict = {}\n+\n+\t\tofname = "{}/{}".format(outdir, it[\'ofname\'])\n+\t\txsname = "{}/{}".format(outdir, it[\'xsname\'])\n+\t\tof = open(ofname, \'r\')\n+\t\txf = open(xsname, \'r\')\n+\t\tol = []\n+\t\txl = []\n+\t\tfor line in of:\n+\t\t\tol.append(line.split(\',\'))\n+\t\tfor line in xf:\n+\t\t\txl.append(line.split(\',\'))\n+\t\tof.close()\n+\t\txf.close()\n+\t\t# fix the paths to the images\n+\t\trel_plot_paths = []\n+\t\tfor pp in it[\'plot_paths\']:\n+\t\t\trel_plot_paths.append("{}/{}".format(relpath, pp))\n+\n+\t\tit_dict[\'count\'] = it[\'count\']\n+\t\tit_dict[\'ol\'] = ol\n+\t\tit_dict[\'xl\'] = xl\n+\t\tit_dict[\'rel_plot_paths\']= rel_plot_paths\n+\t\titeration_data.append(it_dict)\n+\n+\tlogcontent = open(lfname, \'r\').read()\n+\ttemplate_values = {\n+\t\t"date": datetime.datetime.now(),\n+\t\t"dfname": dfname,\n+\t\t"lfname": lfname,\n+\t\t"logcontent": logcontent,\n+\t\t"iteration_data": iteration_data,\n+\t\t"cflag": cflag,\n+\t\t"cmethod": cmethod,\n+\t\t"tmethod": tmethod,\n+\t\t"sd_cutoff": sd_cutoff,\n+\t}\n+\ttemplate = JINJA_ENVIRONMENT.get_template(HTML_TEMPLATE_NAME)\n+\trendered_template = template.render(template_values)\n+\thtmldoc.write(rendered_template)\n+\thtmldoc.close()\n+\treturn rendered_template\n+\n+""" \n+Generate the expected directory structure of the root output directory,\n+given the basname, number of iterations, and number of expected plots output by\n+the R script\n+"""\n+def create_output_dict(outdir, basename, n, num_plots):\n+\titeration_list = []\n+\tfor i in range(n):\n+\t\tit_dict = {}\n+\t\ti_outdir = "{}/output_{}_iteration_{}".format(outdir, basename, i)\n+\t\tit_dict[\'outdir\'] = i_outdir\n+\t\tit_dict[\'ofname\'] = "{}_outliers.txt".format(basename)\n+\t\tit_dict[\'xsname\'] = "{}_xfile.txt".format(basename)\n+\t\tit_dict[\'count\'] = i\n+\t\tit_dict[\'plot_paths\'] = []\n+\t\tfor j in range(num_plots):\n+\t\t\tit_dict[\'plot_paths\'].append("{}_plot_number_{}.png".format(basename, j+1))\n+\t\titeration_list.append(it_dict)\n+\treturn iteration_list\n+\n+\n+###########################UTILITY########################################\n+def prepare_directory(outdir, logfile):\n+\t# set up output directory\n+\tlogfile.write("##Setting up output directory:{}##\\n".format(outdir))\n+\tsubprocess.call("rm -rf \'{}\'".format(outdir), shell=True)\n+\tlogfile.write("rm -rf \'{}\'\\n".format(outdir))\n+\tsubprocess.call("mkdir -p \'{}\'".format(outdir), shell=True)\n+\tlogfile.write("mkdir -p \'{}\'\\n".format(outdir))\n+\n+# takes a list of dicts, merges them into a single big dict\n+def merge_dicts(dictList):\n+\tresult = {}\n+\tfor d in dictList:\n+\t\tif type(d) == dict:\n+\t\t\tresult.update(d)\n+\treturn result\n+\n+# Some odd rules regarding character escaping and such in galaxy: fix those here\n+def format_galaxy_config_file(cfname):\n+\tCHAR_CONVERSION = {\n+\t\t\'g\': \'>\',\n+\t\t\'l\': \'<\',\n+\t\t\'e\': \'==\',\n+\t\t\'ge\': \'<=\',\n+\t\t\'le\': \'>=\'\n+\t}\n+\tcfile = open(cfname, \'r\')\n+\tlines = cfile.readlines()\n+\tcfile.close()\n+\n+\tnewlines = []\n+\tsection = None\n+\tfor l in lines:\n+\t\tnl = l\n+\t\tif l[0] == \'#\':\n+\t\t\tif l.strip() == "#numeric_filters":\n+\t\t\t\tsection = "numeric_filters"\n+\t\t\telse:\n+\t\t\t\tsection = None\n+\t\telse:\n+\t\t\tif section:\n+\t\t\t\ttokens = l.split(\',\')\n+\t\t\t\top = tokens[2]\n+\t\t\t\tif op in CHAR_CONVERSION.keys():\n+\t\t\t\t\tnewop = CHAR_CONVERSION[op]\n+\t\t\t\telif op in CHAR_CONVERSION.values():\n+\t\t\t\t\tnewop = op\n+\t\t\t\telse:\n+\t\t\t\t\t# error code 1; bad op codes\n+\t\t\t\t\treturn 1\n+\t\t\t\ttokens[2] = newop\n+\t\t\t\tnl = \',\'.join(tokens)\n+\n+\n+\t\tnl = nl.replace("__pd__", \'#\')\n+\t\tnewlines.append(nl)\n+\n+\tcfile = open(cfname, \'w\')\n+\tcfile.writelines(newlines)\n+\tcfile.close()\n+\treturn 0\n+##############################################################################\n+\n+if __name__ == "__main__":\n+\tmain()\n' |
b |
diff -r 000000000000 -r 64e75e21466e iterative_pca_plot.R --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/iterative_pca_plot.R Wed Jun 01 03:38:39 2016 -0400 |
[ |
@@ -0,0 +1,97 @@ +library(flashpcaR) +library(dbscan) +library(cluster) + +## MAIN ### +# get command line arguments +CLI_FLAG = 1 +if (CLI_FLAG == 1) { + n = commandArgs(trailingOnly=TRUE)[1] + outdir = commandArgs(trailingOnly=TRUE)[2] + basename = commandArgs(trailingOnly=TRUE)[3] + data_source = commandArgs(trailingOnly=TRUE)[4] + data_type = commandArgs(trailingOnly=TRUE)[5] + eth_filename = commandArgs(trailingOnly=TRUE)[6] + control_tag = commandArgs(trailingOnly=TRUE)[7] + if (control_tag == "None") {control_tag = NULL} + cases_tag = commandArgs(trailingOnly=TRUE)[8] + if (cases_tag == "None") {cases_tag = NULL} + numsds = as.numeric(commandArgs(trailingOnly=TRUE)[9]) + cmethod = commandArgs(trailingOnly=TRUE)[10] + tmethod = commandArgs(trailingOnly=TRUE)[11] + path_to_r_functions = commandArgs(trailingOnly=TRUE)[12] + xsamples_filename = commandArgs(trailingOnly=TRUE)[13] + xsnps_filename = commandArgs(trailingOnly=TRUE)[14] +} else { + n = 10 + basename = "test_eth2" + data_source = "./data/halo1_numeric.ped" + data_type = "numeric_ped" + outdir = paste0(getwd(), "/full_output_", basename) + #data_source = "./data/HapMap3_flashPCA_data.rds" + #data_type = "rds" + #eth_filename = "./data/HapMap3_ethnicity_rf.txt" + eth_filename = "./data/Halo_ethnicity_rf.txt" + control_tag = "HAPS" + cases_tag = NULL + numsds = 1.1 + cmethod = "hclust" + tmethod = "mcd" + path_to_r_functions = paste0(getwd(), "/R_functions/") + xsamples_filename = "./xfiles/halo1_xsamples.txt" + xsnps_filename = "./xfiles/halo1_xsnps.txt" +} + +# get source code +source(paste0(path_to_r_functions, "/", "plotting_functions.R")) +source(paste0(path_to_r_functions, "/", "pca_helpers.R")) +source(paste0(path_to_r_functions, "/", "pipeline_code.R")) +source(paste0(path_to_r_functions, "/", "clustering_functions.R")) +source(paste0(path_to_r_functions, "/", "outlier_trimming.R")) + +if (CLI_FLAG != 1) { + unlink(paste0(getwd(), "/", "full_output_", basename), recursive=TRUE) +} + +# read in data +ped_data = get_source_data(data_source, data_type) +eth_data = parse_ethnicity_file(eth_filename) +xsamples = get_first_column(xsamples_filename) +xsnps = get_first_column(xsnps_filename) + +# do the pca and prepare plots +iterations = list() +for(i in 1:n) { + fpd = filter_ped_data(ped_data, xsamples, xsnps) + iterations[[i]] = single_iteration(outdir, basename, fpd, xsamples, numsds, + cmethod, tmethod, control_tag, cases_tag, ethnicity_data=eth_data) + iterations[[i]]$dirname = generate_directory_name(outdir, basename, i) + xsamples = iterations[[i]]$xsamples +} + +# create folders and plots +for (i in 1:n) { + titer = iterations[[i]] + dir.create(titer$dirname, recursive=TRUE) + num_plots = titer$num_plots + + for (j in 1:num_plots) { + plot_filename = sprintf("%s/%s_plot_number_%d.png", titer$dirname, basename, j) + plot_by_groups(titer$pca_data$values[, c(1, 2)], + titer$plots[[j]]$groups, + titer$plots[[j]]$tags, + titer$plots[[j]]$plot_colors, + titer$plots[[j]]$plot_symbols, + titer$plots[[j]]$plot_title, + plot_filename=plot_filename + ) + } + + # write outliers to file + xfilename = paste0(titer$dirname, "/", basename, "_xfile.txt") + outliers_filename = paste0(titer$dirname, "/", basename, "_outliers.txt") + xscon = add_ethnicity_data(titer$old_xsamples, eth_data) + olcon = add_ethnicity_data(titer$outliers, eth_data) + write.table(xscon, file=xfilename, row.names=FALSE, col.names=TRUE, sep=",") + write.table(olcon, file=outliers_filename, row.names=FALSE, col.names=TRUE, sep=",") +} \ No newline at end of file |
b |
diff -r 000000000000 -r 64e75e21466e pca_pipeline_def.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pca_pipeline_def.xml Wed Jun 01 03:38:39 2016 -0400 |
[ |
b'@@ -0,0 +1,377 @@\n+<tool id="pca_pipeline" name="PCA Pipeline" version="1.0.0">\n+\t<description>Iterative PCA pipeline</description>\n+\t<requirements>\n+\t\t<requirement type="package" version="2.8">Jinja2</requirement>\n+\t\t<!--\n+\t\t<requirement type="package" version="3.2.1">R</requirement>\n+\t\t<requirement type="package" version="1.2.5">flashpcaR</requirement>\n+\t\t-->\n+\t</requirements>\n+\t<command interpreter="python">\n+\t\t<![CDATA[\n+\t\t\titerative_pca.py \n+\t\t\t$datafile\n+\t\t\t$data_type\n+\t\t\t$iterations \n+\t\t\t--sd_cutoff $sd_cutoff\n+\t\t\t--absolute_prefix $output.files_path \n+\t\t\t--html_file $output\n+\t\t\t#if $control_tag\n+\t\t\t\t--control_tag $control_tag\n+\t\t\t#end if\n+\t\t\t#if $cases_tag\n+\t\t\t\t--cases_tag $cases_tag\n+\t\t\t#end if\n+\t\t\t#if $data_type.value == "variant_data"\n+\t\t\t\t#if $user_configfile\n+\t\t\t\t\t--config_file $user_configfile\n+\t\t\t\t#else\n+\t\t\t\t\t--config_file $cfile\n+\t\t\t\t#end if\n+\t\t\t#end if\n+\t\t\t#if $clustering_flag.value == "yes"\n+\t\t\t\t--clustering_flag\n+\t\t\t\t--clustering_method $clustering_method\n+\t\t\t\t--cluster_trimming $cluster_trimming\n+\t\t\t#end if\n+\t\t\t#if $ethnicity_file\n+\t\t\t\t--ethnicity_file $ethnicity_file\n+\t\t\t#end if\n+\t\t\t#if $xsamples_file\n+\t\t\t\t--reject_samples $xsamples_file\n+\t\t\t#end if\n+\t\t\t#if $xsnps_file\n+\t\t\t\t--reject_snps $xsnps_file\n+\t\t\t#end if\n+\t\t\t--galaxy\n+\t\t\t]]>\n+\t</command>\n+\t<configfiles>\n+\t\t<configfile name="cfile">#control\n+control_tag,#Sample,$control_tag\n+cases_tag,#Sample,$cases_tag\n+#column_names\n+genotype_column,$genotype_column\n+reference_column,$reference_column\n+alternate_column,$alternate_column\n+sample_id_column,$sample_id_column\n+chromosome_column,$chromosome_column\n+position_column,$position_column\n+variant_id_column,$variant_id_column\n+#numeric_filters\n+#for $i, $f in enumerate($numeric_filters)\n+$f.filter_name,$f.column_name,$f.operation,$f.cutoff\n+#end for\n+#string_filters\n+#for $i, $s in enumerate($string_filters)\n+$s.filter_name,$s.column_name,$s.exact_flag,$s.accept_flag\n+$(\',\'.join($s.patterns.split(\'\\n\')))\n+#end for</configfile>\n+\t</configfiles>\n+\t<inputs>\n+\t\t<param name="datafile" type="data" label="Input datafile"/>\n+\t\t<param name="data_type" type="select" label="Type of input data file">\n+\t\t\t<option value="variant_data">Variant Data</option>\n+\t\t\t<option value="numeric_ped">Numeric PED File</option>\n+\t\t\t<option value="rdata">RData file</option>\n+\t\t</param>\t\t\n+\t\t<param name="iterations" type="integer" value="1" label="Number of iterations to complete"/>\n+\t\t<param name="clustering_flag" type="select" display="radio" label="Do clustering?">\n+\t\t\t<option value="yes">Yes</option>\n+\t\t\t<option value="no">No</option>\n+\t\t</param>\n+\t\t<param name="clustering_method" type="select" label="Clustering method (ignore if you selected \'No\' for \'Do clustering?\')">\n+\t\t\t<option value="dbscan">DBSCAN</option>\n+\t\t\t<option value="hclust">Hierarchical Clustering</option>\n+\t\t</param>\n+\t\t<param name="cluster_trimming" type="select" label="Algorithm used to identify and remove cluster outliers (ignore if you selected \'No\' for \'Do clustering?\')">\n+\t\t\t<option value="sd">Standard Deviations</option>\n+\t\t\t<option value="mcd">Mean Cluster Distance</option>\n+\t\t\t<option value="dbscan_outliers_only">DBSCAN outliers only (Only valid if DBSCAN is selected as \'Algorithm used to find clusters\'</option>\n+\t\t</param>\n+\t\t<param name="sd_cutoff" type="float" value="2" label="Strictness of outlier trimming. Lower = more outliers cut at each stage, Higher = less outliers cut at each stage."/>\n+\t\t<!-- Control and cases tag info -->\n+\t\t<param name="control_tag" type="text" value="LP" label="Control Tag"/>\n+\t\t<param name="cases_tag" type="text" value="HAPS" label="Cases Tag"/>\n+\t\t<param name="user_configfile" type="data" format="txt" optional="true" label="Optional user provided config file. \n+\t\t\tNB: \n+\t\t\t- If this is set, ALL the fields below will be ignored. \n+\t\t\t- If no input is provided, and the input data is a text file containing variant data, ALL the fields below except the filters MUST be filled in"/>\n+\t\t<param name="ethnicity_file" type="data" format="'..b' in the variant file to filter on. If this column is not found, a warning is displayed\n+\t\t- Column 3: The criteria of the filter which must be passed in order for us to accept a particular row. E.g. less than, greater than\n+\t\t- Column 4: The cutoff value to be compared against.\n+\n+\t- *\'#string_filters\' section:* Each filter takes up two lines.\n+\n+\t\t- Line 1, Column 1: Arbitrary filter name\n+\t\t- Line 1, Column 2: Column name to filter on \n+\t\t- Line 1, Column 3: Do the patterns have to be exact matches, or just a substrings? E.g. if pattern = "HAPS" and string being compared = "HAPS-909090", if exact was true this would not be a successfull match, whereas if not_exact was true it would be a match.\n+\t\t- Line 1, Column 4: What to do with the row if a successful match is detected, e.g. accept or reject\n+\t\t- Line 2: A comma seperated list of patterns to match on\n+\n+\n+- **Ethnicity file:** An ethnicity file containing ethnicity data, and possible other data, on the samples. Note this data is not used to sort the input and has no effect on the PCA itself. It is used only to label the results of the output.\n+\n+ Requirements:\n+\n+ \t- tab delimited\n+ \t- Must have at least two columns\n+ \t- First column has sample ID\'s\n+ \t- Second column has ethnicities\n+ \t- First row must be a header\n+\n+ First few lines of a correctly formatted ethnicity file given below::\n+\n+\tIID\tpopulation\tHalo1.or.2.\tBloodAge\tSalivaAge\tCOB\tethnicity\n+\tLP-10000001\tAUSTRALIAN\tHalo2 - LP-BC\t67\tNA\tAustralia\taustralian\n+\tLP-10000003\tAUSTRALIAN\tHalo1\t45\tNA\tAustralia\taustralian southern_european\n+\tLP-10000005\tAUSTRALIAN\tHalo1\t73\tNA\tAustralia\taustralian southern_european\n+\tLP-10000008\tEUROPE\tHalo1\t54\tNA\tSouth Eastern Europe\tsouth_east_european\n+\tLP-10000009\tOTHER\tHalo1\t65\tNA\tSouthern & East Africa\tjewish\n+\n+- **Exclude samples file:** A text file containing exact ids of samples to exclude from the PCA.\n+\n+ Requirements:\n+\n+\t- single column\n+\t- sample ids seperated by newlines\n+\t- one sample id per line\n+ \n+ Example::\n+\n+\tHAPS-90573\n+\tHAPS-90578R\n+\tHAPS-110542\n+\tHAPS-110605\n+\tHAPS-110620\n+\tHAPS-110638\n+\tHAPS-110649\n+\tHAPS-110668\n+\tHAPS-110799\n+\tHAPS-110813\n+\tHAPS-110959\n+\tHAPS-111186\n+\tHAPS-111298\n+\tHAPS-111404\n+\tHAPS-111493\n+\tHAPS-111512\n+\tHAPS-111538\n+\n+- **Exclude SNPS file:** A text file containing exact ids of SNPs to exclude from the PCA.\n+ \n+ Requirements:\n+\n+\t- single column\n+\t- snp ids seperated by newlines\n+\t- one snp id per line\n+\n+ Example::\n+\n+\trs72896283\n+\trs7534447\n+\trs4662775\n+\trs10932813\n+\trs10932816\n+\trs12330369\n+\trs1802904\n+\trs10902762\n+\trs9996817\n+\trs6446393\n+\trs871133\n+\trs4301095\n+\trs941849\n+\trs6917467\n+\trs75834296\n+\trs142922667\n+\n+- **Required Column Headers:** If a variant text file is the primary input, the following information MUST be provided, either through the config file, or by filling out the corresponding fields in the tool submission form.\n+\n+\t- Sample IDs: Name of the column containing the sample ids\n+\t- Chromosome: Name of the column indicating what chromosome the SNP is found on\n+\t- Position: Name of the column indicating at which position on the chromosome the SNP is found\n+\t- Genotype: The genotype of the sample for this site\n+\t- Reference: The \'normal\'/\'common\' genotype for this site\n+\t- Alternate: The alternate genotype for this site\n+\t- Variant IDs: Name of the column indicating the ID of the SNP \n+\n+- **Numeric Filters:** See Configuration file section\n+- **String Filters:** See Configuration file section\n+\n+Other Output\n+-------------\n+\n+- Tool will output a root folder containing the HTML file and all the plots, placed in directories seperated by iteration.\n+- If the input data was a variant file, the output folder will also contain a numeric ped file, generated before the first iteration, as well as a config file. The config file is either the exact one passed in by the user, or one automatically generated from the form input, which can be used for future PCA runs.\n+\n+\t]]>\n+\n+\n+\t</help>\n+</tool>\n\\ No newline at end of file\n' |
b |
diff -r 000000000000 -r 64e75e21466e pca_report.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pca_report.html Wed Jun 01 03:38:39 2016 -0400 |
b |
@@ -0,0 +1,78 @@ +{% macro display_iterations(ilist) %} + {% for it in ilist %} + {{ display_single_iteration(it) }} + {% endfor %} +{% endmacro %} + +{% macro display_single_iteration(idata) %} +<div><h3>Iteration {{ idata.count }}</h3></div> + {% for pp in idata.rel_plot_paths %} + <img src="{{ pp }}"> + {% endfor %} + <div> + <strong>{{ idata.xl|length - 1 }} samples were excluded from the PCA</strong> + <details> + <table> + {% for x in idata.xl %} + <tr> + {% if loop.first %} + {% for cell in x %} + <th>{{ cell }}</th> + {% endfor %} + {% else %} + {% for cell in x %} + <td>{{ cell }}</td> + {% endfor %} + {% endif %} + </tr> + {% endfor %} + </table> + </details> + </div> + <div> + <strong>Detected {{ idata.ol|length - 1}} outliers</strong> + <details> + <table> + {% for x in idata.ol %} + <tr> + {% if loop.first %} + {% for cell in x %} + <th>{{ cell }}</th> + {% endfor %} + {% else %} + {% for cell in x %} + <td>{{ cell }}</td> + {% endfor %} + {% endif %} + </tr> + {% endfor %} + </table> + </details> + </div> +</div> +{% endmacro %} + +<!DOCTYPE html> +<html> +<head> + <title>PCA Report</title> +</head> +<body> + <h1>PCA Report</h1> + <div><strong>Date:</strong> {{ date }}</div> + <div><strong>Input data file:</strong> {{ dfname }}</div> + <div><strong>Log File:</strong> {{ lfname }} + <details><pre>{{ logcontent }}</pre></details> + </div> + <div><strong>Standard Deviation Multiplier: </strong> {{ sd_cutoff }} </div> + <div><strong>Clustering Method: </strong> {{ cmethod }} </div> + <div><strong>Cluster Trimming: </strong> {{ tmethod }} </div> + + <div><h2>Iteration data</h2> + {{ display_iterations(iteration_data) }} + </div> +</body> +</html> + + + |
b |
diff -r 000000000000 -r 64e75e21466e pedify.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pedify.py Wed Jun 01 03:38:39 2016 -0400 |
[ |
b'@@ -0,0 +1,395 @@\n+import sys\n+import csv\n+import argparse\n+\n+DEBUG = 0\n+\n+REQ_KEYS = [\'genotype_column\', \'reference_column\', \'alternate_column\', \n+\t\t\t\'sample_id_column\', \'chromosome_column\', \'position_column\',\n+\t\t\t\'variant_id_column\']\n+\n+GENOTYPE_DICT = {\n+\t"\'0/0": "hom_ref",\n+\t"\'0/1": "het",\n+\t"\'1/1": "hom_alt",\n+\t"\'1/2": "tri_allelic"\n+}\n+\n+GENOTYPE_TO_NUMERIC = {\n+\t"\'0/0": 0,\n+\t"\'0/1": 1,\n+\t"\'1/1": 2,\n+\t"\'1/2": 2\n+}\n+\n+class PedConverter:\n+\tdef __init__(self):\n+\t\tself.cfg = None\n+\t\tself.samples = {}\n+\t\tself.sites = {}\n+\t\tself.xsamples = []\n+\n+\tdef verify_column_names(self, datafile_header):\n+\t\t# check all the config column names actually exist in the data file\n+\t\tfor col in self.cfg.col_names.values():\n+\t\t\tif col not in datafile_header:\n+\t\t\t\tprint "The \'{}\' column was not found in the datafile! Double check your config file is correct. Exiting...".format(\n+\t\t\t\t\tcol)\n+\t\t\t\tsys.exit(1)\n+\n+\tdef verify_filters(self, datafile_header):\n+\t\t# print warning messages if filters invalid\n+\t\tall_filters = self.cfg.nfilters.copy()\n+\t\tall_filters.update(self.cfg.sfilters)\n+\t\tfor key in all_filters:\n+\t\t\tcol_name = all_filters[key]["col_name"]\n+\t\t\tif col_name not in datafile_header:\n+\t\t\t\tprint "Warning! The \'{}\' filter was not applied as the datafile does not contain the column \'{}\'".format(\n+\t\t\t\t\tkey, col_name)\n+\n+\tdef read_config_file(self, cfname):\n+\t\tself.cfg = ConfigSettings()\n+\t\trc = self.cfg.parse_config_file(cfname)\n+\t\treturn rc\n+\n+\tdef read_data_file(self, dfname):\n+\t\tif (self.cfg == None) or (not self.cfg.is_valid()):\n+\t\t\treturn 1\n+\n+\t\tdatafile = open(dfname, \'r\')\n+\t\tdreader = csv.DictReader(datafile, delimiter=\'\\t\')\n+\t\t# verify datafile data matches config file\n+\t\tself.verify_column_names(dreader.fieldnames)\n+\t\tself.verify_filters(dreader.fieldnames)\n+\t\tall_sample_ids = set()\n+\t\ti = 0\n+\n+\t\tfor row in dreader:\n+\t\t\tfailed_filters = self.filter_all(row)\n+\t\t\tsample_key = row[self.cfg.col_names[\'sample_id_column\']]\n+\t\t\tall_sample_ids.add(sample_key)\n+\t\t\tif not failed_filters:\n+\t\t\t\t# add to sample dict\n+\t\t\t\t# key is a tuple made up of which chromosome the snp is found on\n+\t\t\t\t# and the position on the chromosome itself\n+\t\t\t\tSNP_key = (row[self.cfg.col_names[\'chromosome_column\']], int(row[self.cfg.col_names[\'position_column\']]))\n+\t\t\t\tgenotype = row[self.cfg.col_names[\'genotype_column\']]\n+\n+\t\t\t\t# create a dictionary for each sample (person); each person is associated\n+\t\t\t\t# with another dictionary of all the SNPs found in that person\n+\t\t\t\tif sample_key not in self.samples:\n+\t\t\t\t\tself.samples[sample_key] = {SNP_key: genotype} \n+\t\t\t\telse:\n+\t\t\t\t\tself.samples[sample_key][SNP_key] = genotype\n+\n+\t\t\t\t# create a dict of all the sites where SNPs exist\n+\t\t\t\tif SNP_key not in self.sites:\n+\t\t\t\t\t# generate arbitrary ID\'s if there is no pre-existing ID for the SNP\n+\t\t\t\t\tif row[self.cfg.col_names[\'variant_id_column\']] == \'.\':\n+\t\t\t\t\t\tSNP_id = "SNP_" + str(i)\n+\t\t\t\t\telse:\n+\t\t\t\t\t\tSNP_id = row[self.cfg.col_names[\'variant_id_column\']]\n+\n+\t\t\t\t\tSNP_data = {\'ref_col\': row[self.cfg.col_names[\'reference_column\']], \n+\t\t\t\t\t\t\t\t\'alt_col\': row[self.cfg.col_names[\'alternate_column\']], \n+\t\t\t\t\t\t\t\t\'SNP_id\': SNP_id}\n+\t\t\t\t\tself.sites[SNP_key] = SNP_data\n+\t\t\ti += 1\n+\t\t\n+\t\t# make sure every sample contains a genotype for every SNP found\n+\t\tfor sample_key in self.samples:\n+\t\t\tthis_sample = self.samples[sample_key]\n+\t\t\tfor SNP_key in self.sites:\n+\t\t\t\tif SNP_key not in this_sample:\n+\t\t\t\t\tthis_sample[SNP_key] = "\'0/0"\n+\t\tdatafile.close()\n+\n+\t\t# get list of samples which were filtered out\n+\t\tself.xsamples = list(all_sample_ids.difference(set(self.samples.keys())))\n+\t\treturn 0\n+\n+\t# returns key of the specific filter/s that failed\n+\tdef filter_numeric(self, row):\n+\t\tfailed_filters = set()\n+\t\tfor key in self.cfg.nfilters.keys():\n+\t\t\tnfilter = self.cfg.nfilters[key]\n+\t\t\tcutoff = float(nfilter["cutoff"])\n+\t\t\top = nfilter["op"]\n+\t\t\tcol_name = nfilter["col_name"]\n+\t\t\tif col_name in row.keys():\n+\t\t\t\tcv = float(row[col_name])\n+\t\t\t\tif not string_as_operator(cv, cutoff, op):\n+\t\t\t\t\tfailed_f'..b'g\n+\t\t\tif line[0] == \'#\':\n+\t\t\t\tif line in ConfigSettings.SECTIONS:\n+\t\t\t\t\tsection = line\n+\t\t\t\telse:\n+\t\t\t\t\tcontinue\n+\t\t\telse:\n+\t\t\t\t# fill up config dicts\n+\t\t\t\tif section == "#control":\n+\t\t\t\t\t(key, col_name, tag) = line.split(\',\')\n+\t\t\t\t\tself.control_info[key] = {\'col_name\': col_name, \'tag\': tag}\n+\t\t\t\telif section == "#column_names":\n+\t\t\t\t\t(key, col_name) = line.split(\',\')\n+\t\t\t\t\tself.col_names[key] = col_name\n+\t\t\t\telif section == "#numeric_filters":\n+\t\t\t\t\t(key, col_name, op, cutoff) = line.split(\',\')\n+\t\t\t\t\tself.add_numeric_filter(key, col_name, op, float(cutoff))\n+\t\t\t\telif section == "#string_filters":\n+\t\t\t\t\t(key, col_name, exact_flag, accept_flag) = line.split(\',\')\n+\t\t\t\t\tpatterns = next(cffile).strip().split(\',\')\n+\t\t\t\t\tself.add_string_filter(key, col_name, exact_flag, accept_flag, patterns)\n+\t\t\t\telse:\n+\t\t\t\t\trc = 2\n+\t\t\t\t\tbreak\n+\n+\t\tcffile.close()\n+\t\tif rc != 0:\n+\t\t\treturn rc\n+\t\tif not self.is_valid():\n+\t\t\trc = 3\n+\t\treturn rc\n+\n+\n+\tdef is_valid(self):\n+\t\tfor k in REQ_KEYS:\n+\t\t\tif k not in self.col_names.keys():\n+\t\t\t\treturn False\n+\t\treturn True\n+\n+\tdef add_numeric_filter(self, key, col_name, op, cutoff):\t\t\t\t\t\n+\t\tself.nfilters[key] = {\n+\t\t\t\'col_name\': col_name,\n+\t\t\t\'op\': op,\n+\t\t\t\'cutoff\': cutoff\n+\t\t\t}\n+\n+\tdef add_string_filter(self, key, col_name, exact_flag, accept_flag, patterns):\n+\t\tif exact_flag == "exact":\n+\t\t\tef = True\n+\t\telif exact_flag == "not_exact":\n+\t\t\tef = False\n+\t\telse:\n+\t\t\treturn False\n+\n+\t\tif accept_flag == "accept":\n+\t\t\taf = True\n+\t\telif accept_flag == "reject":\n+\t\t\taf = False\n+\t\telse:\n+\t\t\treturn False\n+\n+\t\tself.sfilters[key] = {\n+\t\t\t\'col_name\': col_name,\n+\t\t\t\'exact_flag\': ef,\n+\t\t\t\'accept_flag\': af,\n+\t\t\t\'patterns\': patterns\n+\t\t\t}\n+\n+\tdef __str__(self):\n+\t\trv = "is Valid: {} || control info: {} || col names: {} || numeric filters: {} || string filters: {}".format(\n+\t\t\tself.is_valid(), self.control_info, self.col_names, self.nfilters, self.sfilters)\n+\t\treturn rv \n+\n+\n+### Utility ###\n+def string_as_operator(arg1, arg2, op):\n+\tif op == "==":\n+\t\treturn arg1 == arg2\n+\telif op == ">":\n+\t\treturn arg1 > arg2\n+\telif op == "<":\n+\t\treturn arg1 < arg2\n+\telif op == "<=":\n+\t\treturn arg1 <= arg2\n+\telif op == ">=":\n+\t\treturn arg1 >= arg2 \n+\n+def genotype_to_bases(genotype, SNPdata):\n+\tbases = ""\n+\tif genotype in GENOTYPE_DICT:\n+\t\tgtype = GENOTYPE_DICT[genotype]\n+\t\tif gtype == "hom_ref":\n+\t\t\tbases = "{} {}".format(SNPdata[\'ref_col\'], SNPdata[\'ref_col\'])\n+\t\telif gtype == "hom_alt":\n+\t\t\tbases = "{} {}".format(SNPdata[\'alt_col\'], SNPdata[\'alt_col\'])\t\n+\t\telif gtype == "het":\n+\t\t\tbases = "{} {}".format(SNPdata[\'ref_col\'], SNPdata[\'alt_col\'])\t\n+\t\telif gtype == "tri_allelic":\n+\t\t\taa_col = SNPdata[\'alt_col\']\n+\t\t\tif len(aa_col) > 1:\n+\t\t\t\t# arbitrarily choose the first one\n+\t\t\t\talt_allele = aa_col[0]\n+\t\t\telse:\n+\t\t\t\talt_allele = aa_col\n+\t\t\tbases = "{} {}".format(alt_allele, alt_allele)\n+\telse:\n+\t\tprint genotype\n+\t\tprint "Unrecognized genotype!"\n+\t\tsys.exit(1)\n+\treturn bases\n+\n+### Main ###\n+def main():\n+\t# argument parsing\n+\tparser = argparse.ArgumentParser()\n+\tparser.add_argument("dfname", help="name of input data file")\n+\tparser.add_argument("cfname", help="name of input configuration file")\n+\tparser.add_argument("pfname", help="name of output ped file")\n+\tparser.add_argument("mfname", help="name of output map file")\n+\tparser.add_argument("xsname", help="name of output file containing exact IDs of samples who were excluded")\n+\targs = parser.parse_args()\n+\n+\tpc = PedConverter()\n+\t# read in config file\n+\trc = pc.read_config_file(args.cfname)\n+\tif rc == 0:\n+\t\tprint \'config file read successfully\'\n+\telse:\n+\t\tprint \'failed to read in config file successfully. Error code: {}\'.format(rc)\n+\t# read in data file\n+\trc = pc.read_data_file(args.dfname)\n+\tif rc == 0:\n+\t\tprint \'data file read successfully\'\n+\telse:\n+\t\tprint \'failed to read in data file successfully. Error code: {}\'.format(rc)\t\n+\tpc.create_ped_file(args.pfname, numeric=True)\n+\tpc.create_map_file(args.mfname)\n+\tpc.create_excluded_samples_file(args.xsname)\n+\n+if __name__ == "__main__":\n+\tmain()\n' |
b |
diff -r 000000000000 -r 64e75e21466e tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Wed Jun 01 03:38:39 2016 -0400 |
b |
@@ -0,0 +1,22 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="Jinja2" version="2.8"> + <install version="1.0"> + <actions> + <action type="download_by_url">https://pypi.python.org/packages/source/J/Jinja2/Jinja2-2.8.tar.gz#md5=edb51693fe22c53cee5403775c71a99e</action> + <action type="make_directory">$INSTALL_DIR/lib/python</action> + <action type="shell_command"> + export PYTHONPATH=$PYTHONPATH:$INSTALL_DIR/lib/python && + python setup.py install --install-lib $INSTALL_DIR/lib/python --install-scripts $INSTALL_DIR/bin + </action> + <action type="set_environment"> + <environment_variable action="append_to" name="PYTHONPATH">$INSTALL_DIR/lib/python</environment_variable> + <environment_variable action="prepend_to" name="PATH">$INSTALL_DIR/bin</environment_variable> + </action> + </actions> + </install> + <readme> + Jinja2 python library for templating + </readme> + </package> +</tool_dependency> |